Issue #4036: aiohttp.client_exceptions.ServerDisconnectedError during sync - Pulp

Actions

Send by e-mail Copy link

Issue #4036

closed

aiohttp.client_exceptions.ServerDisconnectedError during sync

Added by ttereshc over 6 years ago. Updated about 5 years ago.

Status:

CLOSED - CURRENTRELEASE

Priority:

Normal

Assignee:

bmbouter

Category:

Sprint/Milestone:

3.0.0

Start date:

Due date:

Estimated time:

Severity:

2. Medium

Version:

Platform Release:

OS:

Triaged:

Yes

Groomed:

Sprint Candidate:

Tags:

Sprint:

Sprint 43

Quarter:

Description

I sync EPEL7 repo:

http POST http://localhost:8000/pulp/api/v3/repositories/ name=epel7
http POST http://localhost:8000/pulp/api/v3/remotes/rpm/ name='epel7' url='https://dl.fedoraproject.org/pub/epel/7/x86_64/'
export REPO_HREF=$(http :8000/pulp/api/v3/repositories/ | jq -r '.results[] | select(.name == "epel7") | ._href')
export REMOTE_HREF=$(http :8000/pulp/api/v3/remotes/rpm/ | jq -r '.results[] | select(.name == "epel7") | ._href')
http POST :8000${REMOTE_HREF}sync/ repository=$REPO_HREF

And I get either this:
Sep 21 07:54:21 pulp3.dev rq[8483]: pulp: rq.worker:ERROR: aiohttp.client_exceptions.ServerDisconnectedError: None
Sep 21 07:54:21 pulp3.dev rq[8483]: Traceback (most recent call last):
Sep 21 07:54:21 pulp3.dev rq[8483]:   File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/rq/worker.py", line 793, in perform_job
Sep 21 07:54:21 pulp3.dev rq[8483]:     rv = job.perform()
Sep 21 07:54:21 pulp3.dev rq[8483]:   File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/rq/job.py", line 599, in perform
Sep 21 07:54:21 pulp3.dev rq[8483]:     self._result = self._execute()
Sep 21 07:54:21 pulp3.dev rq[8483]:   File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/rq/job.py", line 605, in _execute
Sep 21 07:54:21 pulp3.dev rq[8483]:     return self.func(*self.args, **self.kwargs)
Sep 21 07:54:21 pulp3.dev rq[8483]:   File "/home/vagrant/devel/pulp_rpm/pulp_rpm/app/tasks/synchronizing.py", line 69, in synchronize
Sep 21 07:54:21 pulp3.dev rq[8483]:     loop.run_until_complete(pipeline)
Sep 21 07:54:21 pulp3.dev rq[8483]:   File "/usr/lib64/python3.6/asyncio/base_events.py", line 468, in run_until_complete
Sep 21 07:54:21 pulp3.dev rq[8483]:     return future.result()
Sep 21 07:54:21 pulp3.dev rq[8483]:   File "/home/vagrant/devel/pulp/plugin/pulpcore/plugin/stages/api.py", line 111, in create_pipeline
Sep 21 07:54:21 pulp3.dev rq[8483]:     await asyncio.gather(*futures)
Sep 21 07:54:21 pulp3.dev rq[8483]:   File "/home/vagrant/devel/pulp/plugin/pulpcore/plugin/stages/artifact_stages.py", line 180, in __call__
Sep 21 07:54:21 pulp3.dev rq[8483]:     one_units_downloads = gathered_downloaders.result()
Sep 21 07:54:21 pulp3.dev rq[8483]:   File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/backoff/_async.py", line 132, in retry
Sep 21 07:54:21 pulp3.dev rq[8483]:     ret = yield from target(*args, **kwargs)
Sep 21 07:54:21 pulp3.dev rq[8483]:   File "/home/vagrant/devel/pulp/plugin/pulpcore/plugin/download/http.py", line 169, in run
Sep 21 07:54:21 pulp3.dev rq[8483]:     async with self.session.get(self.url) as response:
Sep 21 07:54:21 pulp3.dev rq[8483]:   File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/aiohttp/client.py", line 855, in __aenter__
Sep 21 07:54:21 pulp3.dev rq[8483]:     self._resp = await self._coro
Sep 21 07:54:21 pulp3.dev rq[8483]:   File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/aiohttp/client.py", line 391, in _request
Sep 21 07:54:21 pulp3.dev rq[8483]:     await resp.start(conn)
Sep 21 07:54:21 pulp3.dev rq[8483]:   File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/aiohttp/client_reqrep.py", line 757, in start
Sep 21 07:54:21 pulp3.dev rq[8483]:     message, payload = await self._protocol.read()
Sep 21 07:54:21 pulp3.dev rq[8483]:   File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/aiohttp/streams.py", line 543, in read
Sep 21 07:54:21 pulp3.dev rq[8483]:     await self._waiter
Sep 21 07:54:21 pulp3.dev rq[8483]: aiohttp.client_exceptions.ServerDisconnectedError: None

Or that:
Sep 21 08:24:44 pulp3.dev rq[9116]: pulp: backoff:ERROR: Giving up run(...) after 1 tries (aiohttp.client_exceptions.ClientResponseError: 503, message='Service Unavailable')
Sep 21 08:24:44 pulp3.dev rq[9116]: Giving up run(...) after 1 tries (aiohttp.client_exceptions.ClientResponseError: 503, message='Service Unavailable')
Sep 21 08:24:44 pulp3.dev rq[9116]: pulp: backoff:ERROR: Giving up run(...) after 1 tries (aiohttp.client_exceptions.ClientResponseError: 503, message='Service Unavailable')
Sep 21 08:24:44 pulp3.dev rq[9116]: Giving up run(...) after 1 tries (aiohttp.client_exceptions.ClientResponseError: 503, message='Service Unavailable')
Sep 21 08:24:44 pulp3.dev rq[9116]: pulp: backoff:ERROR: Giving up run(...) after 1 tries (aiohttp.client_exceptions.ClientResponseError: 503, message='Service Unavailable')
Sep 21 08:24:44 pulp3.dev rq[9116]: Giving up run(...) after 1 tries (aiohttp.client_exceptions.ClientResponseError: 503, message='Service Unavailable')
Sep 21 08:24:44 pulp3.dev rq[9116]: pulp: backoff:ERROR: Giving up run(...) after 1 tries (aiohttp.client_exceptions.ClientResponseError: 503, message='Service Unavailable')
Sep 21 08:24:44 pulp3.dev rq[9116]: pulp: rq.worker:ERROR: aiohttp.client_exceptions.ClientResponseError: 503, message='Service Unavailable'
Sep 21 08:24:44 pulp3.dev rq[9116]: Traceback (most recent call last):
Sep 21 08:24:44 pulp3.dev rq[9116]:   File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/rq/worker.py", line 793, in perform_job
Sep 21 08:24:44 pulp3.dev rq[9116]:     rv = job.perform()
Sep 21 08:24:44 pulp3.dev rq[9116]:   File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/rq/job.py", line 599, in perform
Sep 21 08:24:44 pulp3.dev rq[9116]:     self._result = self._execute()
Sep 21 08:24:44 pulp3.dev rq[9116]:   File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/rq/job.py", line 605, in _execute
Sep 21 08:24:44 pulp3.dev rq[9116]:     return self.func(*self.args, **self.kwargs)
Sep 21 08:24:44 pulp3.dev rq[9116]:   File "/home/vagrant/devel/pulp_rpm/pulp_rpm/app/tasks/synchronizing.py", line 69, in synchronize
Sep 21 08:24:44 pulp3.dev rq[9116]:     loop.run_until_complete(pipeline)
Sep 21 08:24:44 pulp3.dev rq[9116]:   File "/usr/lib64/python3.6/asyncio/base_events.py", line 468, in run_until_complete
Sep 21 08:24:44 pulp3.dev rq[9116]:     return future.result()
Sep 21 08:24:44 pulp3.dev rq[9116]:   File "/home/vagrant/devel/pulp/plugin/pulpcore/plugin/stages/api.py", line 111, in create_pipeline
Sep 21 08:24:44 pulp3.dev rq[9116]:     await asyncio.gather(*futures)
Sep 21 08:24:44 pulp3.dev rq[9116]:   File "/home/vagrant/devel/pulp/plugin/pulpcore/plugin/stages/artifact_stages.py", line 180, in __call__
Sep 21 08:24:44 pulp3.dev rq[9116]:     one_units_downloads = gathered_downloaders.result()
Sep 21 08:24:44 pulp3.dev rq[9116]:   File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/backoff/_async.py", line 132, in retry
Sep 21 08:24:44 pulp3.dev rq[9116]:     ret = yield from target(*args, **kwargs)
Sep 21 08:24:44 pulp3.dev rq[9116]:   File "/home/vagrant/devel/pulp/plugin/pulpcore/plugin/download/http.py", line 170, in run
Sep 21 08:24:44 pulp3.dev rq[9116]:     response.raise_for_status()
Sep 21 08:24:44 pulp3.dev rq[9116]:   File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/aiohttp/client_reqrep.py", line 853, in raise_for_status
Sep 21 08:24:44 pulp3.dev rq[9116]:     headers=self.headers)
Sep 21 08:24:44 pulp3.dev rq[9116]: aiohttp.client_exceptions.ClientResponseError: 503, message='Service Unavailable'

Files

Download all files

rpmfusion_capture.pcapng (34.6 MB) rpmfusion_capture.pcapng	rpmfusion sync w/ concurrency of 20	bmbouter, 09/27/2018 10:28 PM
server_disconnect_capture.pcapng (50 MB) server_disconnect_capture.pcapng	pulp3, 5 TCP connection limit, ServerDisconnect reproduced	bmbouter, 10/17/2018 10:24 PM

Related issues

Actions

Copy link

Updated by CodeHeeler over 6 years ago

Triaged changed from No to Yes
Sprint set to Sprint 43

Actions

Copy link

Updated by bmbouter over 6 years ago

Status changed from NEW to ASSIGNED
Assignee set to bmbouter

Actions

Copy link

Updated by bmbouter over 6 years ago

I was testing against dl.fedoraproject.org with various concurrency settings. A concurrency of 3 failed. ttereshc said on irc that a concurrency of 1 also failed...

I asked in #fedora-admin and they said
(1) fedora infra limits to 15 connections per ip. AIUI with asyncio and connection reuse we are using 1 connection so that shouldn't limit us.
(2) they looked the logs from my IP that received ServerDisconnectErrors and they showed all 200 OK. This is especially strange because if the server says all responses are 200 OK and my client disconnects something is wrong.

Actions

Copy link

Updated by bmbouter over 6 years ago

I wanted to test w/ other version of asyncio. I reproduced the same ServerDisconnect w/ the 3.3.2 and the latest 3.4.4.

Actions

Copy link

Updated by bmbouter over 6 years ago

I reproduced the issue against other non-fedora servers so I don't think this is specific to Fedora:

http://download1.rpmfusion.org/free/fedora/releases/28/Everything/x86_64/os/
http://download-ipv4.eng.brq.redhat.com/released/RHEL-7/7.5/Server/x86_64/os/

Actions

Copy link

Updated by bmbouter over 6 years ago

File rpmfusion_capture.pcapng rpmfusion_capture.pcapng added

Attached is a wireshark sync from rpmfusion with a concurrency of 20 which reproduces the problem. It shows some new information:

1. Each "ArtifactDownloader" is happening in its own TCP request. This explains why so many server's dont like this. For instance on IRC nirik mentioned that Fedora's infra is configured to disallow >15 connections from 1 IP at a time.

2. aiohttp has features to restrict this type of thing already and we're not using it. It's called limit and limit_per_host here: https://docs.aiohttp.org/en/stable/client_reference.html#baseconnector

3. Instead we have our own limiter. Perhaps that's part of our issue. I'm going to continue investigating.

Actions

Copy link

Updated by bmbouter over 6 years ago

For the 503 errors I think we should remove giveup altogether and just have it try over and over. I think users want us to try at all costs basically.

Actions

Copy link

Updated by daviddavis over 6 years ago

Excellent findings. I'm curious about how Pulp 2 deals with these issues. Like how many concurrent tcp connections do we use by default in Pulp 2? And how does Pulp 2 handle 503 errors (ie does it retry or give up?)?

Actions

Copy link

Updated by bmbouter over 6 years ago

Pulp2 uses nectar for downloading, which defaults to 5 retries and 5 concurrent downloads. See the code here

RPM only overrides this if the user defines a value on the importer which is then overridden here

I'm not sure under which conditions the retry occurs.
I'm not sure how many tcp connections are used when connecting to a remote server.
I'm unclear at how 5 was determined as the "right" number.

Actions

Copy link

#10

Updated by mhrivnak over 6 years ago

Here's a summary I posted in IRC:

10:20 <mhrivnak> jortel bmbouter yes, years ago I setup an experiment that you can probably find detailed in the archives for one of the email lists. I can summarize it though.
10:21 <mhrivnak> I took a ton of RPMs from some linux distro and stuck them on a machine in phx2. That introduced real-world latency, but very consistent latency, because traffic was going over a dedicated connection from the office.
10:22 <mhrivnak> I tried a bunch of concurrency levels, and the max throughput was observed with 7 concurrent downloaders.
10:22 <mhrivnak> But the increase from about 4 to 7 was very slim.
10:23 <mhrivnak> So in the interest of being more "polite" to remote servers, we dialed 7 back to 5.
10:23 <mhrivnak> I think katello may have dialed that back to 4 even with the setting, if my memory is correct.
10:25 <mhrivnak> There have been a few occasions where users have observed that a remote server stopped responding to requests, or started responding with 4xx or 5xx codes, because they were receiving too many connections at once from pulp. In those cases, changing the setting to 2 or 3 has resolved it.

To the TCP question, each downloader establishes a single TCP connection between itself and any given remote server, and it reuses that until the remote server closes it (which may or may not happen). That's the ideal situation, because establishing new TCP connections can be surprisingly expensive, especially if you add TLS on top. Depending on the http client you use, you might need to tell it explicitly to re-use connections.

Worth knowing: HTTP 1.1 limits you to a single request/response at a time per connection. You cannot make concurrent requests over a single connection. One of the hot new features of HTTP 2.0 is that it does support concurrent requests over a single connection, but not a lot of clients and servers are using 2.0 yet. And that said, my expectation would be that N (where N > 1) concurrent downloads with http 2.0 over a single connection would be slower than using N connections. I would even suspect that N downloads over 1 connection with http 2.0 would be closer in performance to concurrency of 1 with http 1.1 than concurrency of N with http 1.1. It would be interesting to test, but I think the concurrent requests over a single connection are more useful for small request/response payloads where processing time is needed on the remote end. For example, loading a webpage that requires 25 requests for various small resources.

Actions

Copy link

#11

Updated by bmbouter over 6 years ago

After doing some reading, here is some general info I found:

1. The aiohttp library is correctly reusing TCP connections. I verified this from wireshark data and it confirms the aiohttp claim in their docs.

2. HTTP 1.1 does support Pipelining , but aiohttp as a client does not support it AIUI. This issue in their tracker suggests they don't support it, and I can't find anything their client docs about it. I opened this to get more clarity on it https://github.com/aio-libs/aiohttp/issues/3310

Actions

Copy link

#12

Updated by bmbouter over 6 years ago

Issue #4040 should resolve all of our issues except 2 cases:

1. Where server is responding w/ HTTP > 3XX error codes and Pulp isn't retrying correctly
2. Even with the settings from #4040 limiting the number of TCP connections, if the server decides to close all of them Pulp could still decide to reconnect. The @backoff behavior we haven't won't do this for us because we continually hand a closed connection to future downloaders even after the remote side closed it. This case is complicated enough that I'm moving it to its own ticket and also asking for input from upstream aiohttp.

So for the remainder of this issue I want to focus only on issue (1) that Pulp isn't retrying with >3XX HTTP status codes.

To fix that I recommend we:

Redo the giveup() method to retry anytime it receives an error raised from raise_for_status() in aiohttp. Currently we only retry for HTTP 429, so this would fix the problem that we're not trying for 503, and all the >3XX error codes.

Actions

Copy link

#13

Updated by bmbouter over 6 years ago

Status changed from ASSIGNED to NEW

I'm at AnsibleFest so I can't fix this right now. It would be great if someone else could. I outlined the solution that I think we should go with, and also we need to do #4040 as well.

Actions

Copy link

#14

Updated by bmbouter over 6 years ago

Assignee deleted (~~bmbouter~~)

Actions

Copy link

#15

Updated by ttereshc over 6 years ago

Status changed from NEW to ASSIGNED
Assignee set to ttereshc

Actions

Copy link

#16

Updated by ttereshc over 6 years ago

Status changed from ASSIGNED to POST

https://github.com/pulp/pulp/pull/3701

Added by ttereshc over 6 years ago

Revision d87ebb5a | View on GitHub

Retry in case of some 5XX errors

closes #4036 https://pulp.plan.io/issues/4036

Added by ttereshc over 6 years ago

Revision d87ebb5a | View on GitHub

Retry in case of some 5XX errors

closes #4036 https://pulp.plan.io/issues/4036

Actions

Copy link

#17

Updated by ttereshc over 6 years ago

Status changed from POST to MODIFIED

Applied in changeset pulp|d87ebb5ad5dcda937f69e637a821dff24583a603.

Actions

Copy link

#18

Updated by bmbouter over 6 years ago

I captured wireshark data for Pulp2, but since it works it's 800+ MB so I'm not going to post it here. I captured it with:

pulp-admin -u admin -p admin rpm repo create --repo-id rpmfusion --relative-url rpmfusion --feed http://download1.rpmfusion.org/free/fedora/releases/28/Everything/x86_64/os/
pulp-admin -u admin -p admin rpm repo sync run --repo-id rpmfusion

Here are a few quick observations from the wireshark data: Pulp2 is limiting to 5 concurrent connections and that is working for it. It's closing the TCP connections and recycling them more often. I see over 20 TCP connections made over the duration of the sync, but only 5 are connected at any given time. This is different from aiohttp at least which makes 5 connections immediately and never disconnects.

I'm going to continue my analysis.

Actions

Copy link

#19

Updated by bmbouter over 6 years ago

File server_disconnect_capture.pcapng server_disconnect_capture.pcapng added

After bringing back the semaphore to restrict aiohttp to 5 concurrent downloads, I captured the traffic again, and it's attached.

In analyzing this issue it appears that the client is actually deciding to close the connection. You can see FIN, FIN/ACK, and then RST messages flowing from the client -> Server. Since it's the client side closing the connection and then experiencing the error this is very strange...

Actions

Copy link

#20

Updated by bmbouter about 6 years ago

I've been working on an easy way to reproduce this I made this client and server which will hold open N connections and then close the one that receives the SIGUSR1 signal. The idea is that a client shown closing all connections when one normally closes should show itself with a server like this.

server: https://gist.github.com/bmbouter/222b79bf453f8f9a77069e86e3b65ac8
client: https://gist.github.com/bmbouter/90c1eb9d0b476a0a674930aa33852b56

Run the server using gunicorn with something like:

gunicorn server:my_web_app --bind localhost:8080 --worker-class aiohttp.GunicornWebWorker -w 10

Then when you connect via the client you'll see which pids receive the connections. Send SIGUSR1 to one of the processes that shows it was connected to in the gunicorn output with kill -s SIGUSR1 <pid>

Strangely this does not show the issue. It shows the one connection closing with 200 OK, and the other remaining open.

Actions

Copy link

#21

Updated by bmbouter about 6 years ago

I tried using pulp_file with a custom server able to act like a pulp_file repo, and it also did not reproduce. This suggests that it's not a bug in aiohttp. I went back and reproduced ServerDisconnectError against rpmfusion just to be sure the original problem still reproduces and it does.

I'm thinking more on the tcp wiresharks as I reproduce the issue a few more times.

Actions

Copy link

#22

Updated by bmbouter about 6 years ago

So to recap we've known there were two issues at play (listed below). I believe now one of them isn't real and one is.

1) Pulp needs to add session reconnect/establishment for aiohttp's session. Their issue tracker recommends applications like Pulp carry this.

2) A strange TCP connection closure behavior where a single TCP stream being closed on the server normally would cause the client to initiate tcp closures on all other tcp connections to that server, e.g. comment 19.

I've realized that issue (1) is real and issue (2) is not!

To explain why we observed the behavior of issue (2), it's actually normal. It is expected that the client would immediately close all tcp connections because aiohttp encountered a fatal exception and Pulp left that unhandled. What could aiohttp do but shutdown? So it shuts down immediately, which closes all aiohttp sockets, causing the client -> server TCP RST segments that were observed in Comment 19.

Actions

Copy link

#23

Updated by bmbouter about 6 years ago

Status changed from MODIFIED to POST
Assignee changed from ttereshc to bmbouter

PR at: https://github.com/pulp/pulp/pull/3733

Actions

Copy link

#24

Updated by bmbouter about 6 years ago

Related to Test #4116: Functionally test force_close with HTTP 1.1 servers added

Added by bmbouter about 6 years ago

Revision adec0fb7 | View on GitHub

Use force_close and concurrency restriction

The use of force_close was inspired by:

https://github.com/aio-libs/aiohttp/issues/2867#issuecomment-375278533

Causes Factory and Downloaders to use force_close which causes the TCP connection to close with each request.
Ups the default concurrency to 20
Moves the concurrency restriction feature to the Downloaders and updates HttpDownloader and FileDownloader to use it.
Stops using aiohttp for concurrency restriction because it's done in the downloaders.
The Factory now configures the concurrency restriction using the value on the remote.
creates a _run() method that subclassed Downloaders now need to implement instead of run().

https://pulp.plan.io/issues/4036 https://pulp.plan.io/issues/4075

closes #4036 closes #4075

Added by bmbouter about 6 years ago

Revision adec0fb7 | View on GitHub

Use force_close and concurrency restriction

The use of force_close was inspired by:

https://github.com/aio-libs/aiohttp/issues/2867#issuecomment-375278533

Causes Factory and Downloaders to use force_close which causes the TCP connection to close with each request.
Ups the default concurrency to 20
Moves the concurrency restriction feature to the Downloaders and updates HttpDownloader and FileDownloader to use it.
Stops using aiohttp for concurrency restriction because it's done in the downloaders.
The Factory now configures the concurrency restriction using the value on the remote.
creates a _run() method that subclassed Downloaders now need to implement instead of run().

https://pulp.plan.io/issues/4036 https://pulp.plan.io/issues/4075

closes #4036 closes #4075

Actions

Copy link

#25