Story #4040
closedAs a user, I can specify the connection_limit on any Remote
100%
Description
Problem¶
The ArtifactDownloader has a max_downloads restrictor. This is an identical feature to the connection limiting offered with aiohttp itself. Specifically aiohttp offers two features with its TCPConnector:
From their docs
- limit (int) – total number simultaneous connections. If limit is None the connector has no limit (default: 100).
Is this really a duplication?¶
Yes, it is verified via wireshark analysis. A max_downloader=20 and a limit=20 for the same number of TCP connections to the server.
Solution¶
Add connection_limit to the Remote like the rest of the attributes, e.g. proxy settings.
connection_limit in Pulp maps to TCPConnector.limit in aiohttp
What are the right default to Pulp¶
Like Pulp2, we should default connection_limit=5. If unset on a remote, a user will receive these default values.
How to Implement¶
This configuration should be read from the Remote and applied to the session by the Factory just like how the other settings are done. Since an unset value should provide default even if unset, the default actually like in the Factory code. That is here
Related issues
Updated by bmbouter over 6 years ago
- Related to Story #4039: As a plugin writer, I can specify max_concurrent_downloader and max_content_unit via DeclarativeVersion added
Updated by bmbouter over 6 years ago
- Project changed from File Support to Pulp
- Subject changed from As a user, I can specify the max_concurrent_downloads per FileRemote to As a user, I can specify the connection_limit and connection_limit_per_host on any Remote
- Description updated (diff)
This has been determined to be broadly useful so I've moved it to core which will automatically inherit onto FileRemote.
Updated by daviddavis over 6 years ago
Is it worthwhile to expose connection_limit_per_host to users? It seems hard to explain the two different fields in terms of a Pulp user's perspective and I am not sure it's very useful to set connection_limit_per_host.
Updated by bmbouter over 6 years ago
- Subject changed from As a user, I can specify the connection_limit and connection_limit_per_host on any Remote to As a user, I can specify the connection_limit on any Remote
- Description updated (diff)
I was thinking that it would be useful if a plugin writer was syncing from a mirror network where they connect to multiple hosts but I don't know of any plugin writers immediately planning on doing that so connection_limit_per_host is probably not useful right now.
I've revised this to remove connection_limit_per_host.
Updated by ttereshc over 6 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to ttereshc
Updated by ttereshc over 6 years ago
- Status changed from ASSIGNED to POST
- Sprint set to Sprint 43
Updated by gmbnomis over 6 years ago
bmbouter wrote:
The ArtifactDownloader has a max_downloads restrictor. This is an identical feature to the connection limiting offered with aiohttp itself. Specifically aiohttp offers two features with its TCPConnector:
Nitpick: the feature is not identical. Multiple stages may use the same remote downloader factory (e.g. one doing downloads, one doing API requests). But it is probably not worth the additional complexity to limit downloads per stage and connections globally via the connection pool.
Updated by bmbouter over 6 years ago
gmbnomis wrote:
Nitpick: the feature is not identical. Multiple stages may use the same remote downloader factory (e.g. one doing downloads, one doing API requests). But it is probably not worth the additional complexity to limit downloads per stage and globally via the connection pool.
Now that you mention it, I can see what you are saying. I also agree it's probably not worth to have it per-stage and globally. I think it would also be confusing to end-users.
Added by ttereshc over 6 years ago
Added by ttereshc over 6 years ago
Revision 19b0c26b | View on GitHub
Add connection_limit to a Remote
Use aiohttp's limit
option to limit a total number of TCP connections,
instead of using asyncio.Semaphore
directly.
Updated by ttereshc over 6 years ago
- Status changed from POST to MODIFIED
- % Done changed from 0 to 100
Applied in changeset pulp|19b0c26bdd17e51fd468b332103bd6b9d96878ff.
Updated by gmbnomis over 6 years ago
I just realized that there is probably a huge difference between the previous implementation limiting the active downloaders and the new one relying on aiohttp's connection limit. I have not tested it, but I think that the following may happen:
If a downloader gets back one of the HTTP codes which trigger a retry, the exponential backoff kicks in https://github.com/pulp/pulp/blob/60700bc51a2f5d49030950129edc9c723333e82d/plugin/pulpcore/plugin/download/http.py#L161. This means that the downloader will free up a connection and sleep. As we may have > 200 downloaders waiting for a connection, a new downloader will hit the server immediately without a backoff. If the server sends the 429 (or other) return code because we hit a limit, it will continue to do so. Even if there is only one artifact to download per content unit, Pulp will hammer the server with 200 requests until all of them are retrying in the worst case. And 200 downloaders in retry will probably still cause lots of load.
We could reduce max_concurrent_content
to e.g. 10 to ameliorate this effect, but then, the entire stage blocks when there are 10 content units scheduled for download. The main point about max_concurrent_content
is that the stage allows declarative content not needing downloads to pass it on a "fast lane" until we have such a large number of content units that we don't want to buffer them anymore. It is not intended to limit the number of concurrent active downloaders (this is exactly what the Semaphore that is gone now was for). Furthermore, even 10 content units may need 20 downloads each and we have the same effect as before.
Thus, if the scenario outlined above is valid and if we want to limit the number of active downloaders to achieve proper backoff, I am afraid that the simplest solution is to revert to the previous implementation of the stage and use the new configuration in the remote to limit the number of downloaders (and possibly, the number of connections at the same time).
Updated by gmbnomis over 6 years ago
I verified that the retry behavior is as expected, see issue https://pulp.plan.io/issues/4075
Updated by bmbouter about 5 years ago
- Status changed from MODIFIED to CLOSED - CURRENTRELEASE
Add connection_limit to a Remote
Use aiohttp's
limit
option to limit a total number of TCP connections, instead of usingasyncio.Semaphore
directly.closes #4040 https://pulp.plan.io/issues/4040