Project

Profile

Help

Issue #1629

Frequent package download failures: nectar.downloaders.threaded:WARNING: Connection Error - http://10.19.11.110/pulp/repos//content/dist/rhel/server/6/6Server/x86_64/os/cups-1.4.2-67.el6_6.1.x86_64.rpm could not be reached.

Added by jwmatthews almost 5 years ago. Updated over 1 year ago.

Status:
CLOSED - WORKSFORME
Priority:
Normal
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
3. High
Version:
2.6.0
Platform Release:
2.8.0
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Quarter:

Description

We had 2 hackathon events, ~20+ participants of each using Sat 6.1 to sync ~100GB of content spread over ~8 repos.
The majority of the participants ran into issues of many package failures. Majority as in 75% of people had problems.

Some saw ~16k failures of a repo with ~28k packages, some saw less errors.

The syncs were going against a blade that was in same rack and dedicated to serving content.
This wasn't against the CDN, but a pulp server in same rack.

Env:
Pulp server is running in a rack exporting content.
~20+ other blades are each running an instance of Satellite in a VM (16GB of RAM in the VM, 16 virtual CPUs)
Each Satellite was syncing content from the pulp server exporting

Alex Krzos from the performance team looked at the environment to gather some info.
Below is info from him on what he saw:

Issue:  Synced multiple Satellite 6 servers to a single pulp repo server and observed system performance to see if system resources were a bottleneck.  The repo server did not appear to be bottlenecked at all.  The number of current apache processes serving content was observed on a 1s interval and did not exceed ~130 processes during this time frame.  

One of the Satellite 6 servers was observed concurrently and it was noted in the log when a package could not be synced (nectar.downloaders.threaded:WARNING: Connection Error - http://10.19.11.110/pulp/repos//content/dist/rhel/server/6/6Server/x86_64/os/cups-1.4.2-67.el6_6.1.x86_64.rpm could not be reached.) A wget on the same url responsible to serve the package shows that the rpm is indeed there and able to be served by apache

The resource usage of the satellite server also suggested that while there was high disk io (Noted through ~20% cpu io wait time on a per second interval) that no other bottlenecks should be preventing the rpm from downloading

Looking through the httpd config files for the pulp repo server it was determined a few experiments should be conducted to determine if "requests.packages.urllib3.connectionpool" python library may not be handling apache's configuration on when a connection is closed and when an apache process is recycled.

Versions Used:
- Server sharing content was Pulp 2.7.0 (Pulp only, not Satellite)
- Clients syncing content was Sat 6.1, so Pulp 2.6.0

some info from Alex:

---Procs--- --------------Memory-(MB)--------------------- Swp(MB) --BlockIO-(MB)-----BlockIOaccum-(MB)-- -----Networks-(MB)--------- -----System------ --------CPU %---- Feb 03
r b swpd free buff cache dirty writeb si so bi bo bi+o biA boA bioA LOio LOioA ETHio ETHioA in cs us sy id wa st
4 4 158 238 0 9137 610 96 0 0 1 152 154 2548 21153 23701 0 558 52 20933 29846 29448 4 2 87 8 0 13:28:52
1 2 158 187 0 9171 337 142 0 0 0 344 344 2548 21497 24045 0 558 42 20976 21850 20103 4 1 87 7 0 13:28:53
1 4 158 147 0 9215 381 142 0 0 0 0 0 2548 21497 24045 0 558 54 21030 24588 19900 3 1 87 9 0 13:28:54
0 8 158 124 0 9241 373 34 0 0 0 0 0 2548 21497 24045 0 558 22 21053 9040 7986 0 0 79 20 0 13:28:55
3 4 158 148 0 9227 353 77 0 0 0 107 108 2548 21604 24152 0 558 59 21112 22888 16322 2 1 81 16 0 13:28:56

(DROP happens here) 
2   13  159       135         0      9210    373     38    0    1    5   70   75    2553   21674   24227     0     559    81   21194    53555    59050   7   3  70  20   0  13:28:57
15    2  162       159         0      9154    451      0    0    3   34   14   49    2587   21688   24275     4     563    83   21277    37692    23763   6   3  78  14   0  13:28:58
 8    2  164       215         0      9098    602      0    0    1   85    5   90    2672   21693   24365    14     578   126   21404   116107   156688  18   6  73   3   0  13:28:59
 2    2  173       193         0      9147    675      0    0    8   62   11   73    2734   21704   24438     4     582    70   21475    86398   105883  13   4  78   5   0  13:29:00
 6    1  173       256         0      9091    765      0    0    0   47   22   69    2781   21726   24507     6     589   110   21586   125251   173754  16   6  77   2   0  13:29:01
 3    1  173       248         0      9139    837     31    0    0    1   33   34    2782   21759   24541     5     595   100   21686   107694   144871  17   5  77   1   0  13:29:02
 3    6  174       208         0      9147    764    104    0    1    1  107  108    2783   21866   24649     1     596    57   21744    42238    44077   8   2  82   8   0  13:29:03
 4    2  175       183         0      9214    838     52    0    0    0   24   24    2783   21890   24673     0     596    74   21819    41504    43540   8   2  77  13   0  13:29:04
 3    2  175       141         0      9244    798    103    0    0    1  157  159    2784   22047   24831     5     601   115   21934    83234    99274  10   4  84   1   0  13:29:05
14    3  176       184         0      9226    780     78    0    0    0   94   94    2784   22141   24925     2     604    78   22012    58275    70642   9   3  76  12   0  13:29:06
 1   11  177       120         0      9279    816      0    0    0    0   40   40    2784   22181   24965     0     605    88   22101    48101    49623   8   2  83   6   0  13:29:07
 2    2  177       149         0      9252    754    101    0    0    2  144  146    2786   22325   25111     2     607    99   22201    77630    92513  11   4  77   8   0  13:29:08
 2    8  178       127         0      9276    780     66    0    0    1   31   33    2787   22356   25143     0     607    40   22242    17634    12000   6   1  73  20   0  13:29:09
 6    1  178       153         0      9225    618     65    0    0    0  195  196    2787   22551   25338     1     608    33   22276    28080    33172   7   2  75  16   0  13:29:10
logs-pulp-box-1.tgz (35.6 MB) logs-pulp-box-1.tgz jmrodri, 02/05/2016 03:24 AM
logs-139-busted.tgz (482 KB) logs-139-busted.tgz jmrodri, 02/05/2016 03:25 AM
logs-187-working-1.tgz (346 KB) logs-187-working-1.tgz jmrodri, 02/05/2016 03:27 AM
0001-Log-the-actual-connection-error.patch (901 Bytes) 0001-Log-the-actual-connection-error.patch Patch Nectar to log the actual connection error. jcline@redhat.com, 02/05/2016 09:27 PM

History

#1 Updated by mhrivnak almost 5 years ago

Was it a standard upstream 2.7 apache config? Or were there any modifications?

What OS was pulp 2.7 running on?

#2 Updated by jcline@redhat.com almost 5 years ago

Would it be possible to add the Apache logs (/var/log/httpd/) and the systemd journal (or /var/log/messages) for Pulp?

#3 Updated by jcline@redhat.com almost 5 years ago

I discovered that Pulp's Apache configuration fails to configure 'MaxKeepAliveRequests'[0] so the default value of 100 is used. This causes the TCP connection to be reset after 100 requests. This is a problem (that is easily fixed), but I suspect it is not the only problem. It is inefficient and very noisy in the logs, but there's nothing incorrect about the behaviour.

[0] https://httpd.apache.org/docs/2.2/mod/core.html#maxkeepaliverequests

#4 Updated by jmrodri almost 5 years ago

logs-pulp-box-1.tgz contains the logs and apache configs for the server that was serving up the RPMs for syncing.

logs-139-busted.tgz contains the logs from a broken satellite which failed to sync the content

logs-187-working-1.tgz contains the logs from a working satellite which seemed to sync all the content

#5 Updated by jmrodri almost 5 years ago

This is a snippet of the Resetting dropped connection that we always seem to see:

Feb  3 14:53:41 localhost pulp: requests.packages.urllib3.connectionpool:INFO: Starting new HTTP connection (1): 10.19.11.110
Feb  3 14:53:41 localhost pulp: requests.packages.urllib3.connectionpool:INFO: Starting new HTTP connection (1): 10.19.11.110
Feb  3 14:53:41 localhost pulp: requests.packages.urllib3.connectionpool:INFO: Starting new HTTP connection (1): 10.19.11.110
Feb  3 14:53:41 localhost pulp: requests.packages.urllib3.connectionpool:INFO: Starting new HTTP connection (1): 10.19.11.110
Feb  3 14:53:41 localhost pulp: requests.packages.urllib3.connectionpool:INFO: Starting new HTTP connection (1): 10.19.11.110
Feb  3 14:53:42 localhost pulp: pulp_rpm.plugins.importers.yum.sync:INFO: Downloading 600 RPMs.
Feb  3 14:53:42 localhost pulp: requests.packages.urllib3.connectionpool:INFO: Starting new HTTP connection (1): 10.19.11.110
Feb  3 14:53:42 localhost pulp: kombu.transport.qpid:INFO: Connected to qpid with SASL mechanism ANONYMOUS
Feb  3 14:53:42 localhost pulp: requests.packages.urllib3.connectionpool:INFO: Starting new HTTP connection (1): 10.19.11.110
Feb  3 14:53:42 localhost pulp: requests.packages.urllib3.connectionpool:INFO: Starting new HTTP connection (1): 10.19.11.110
Feb  3 14:53:42 localhost pulp: requests.packages.urllib3.connectionpool:INFO: Starting new HTTP connection (1): 10.19.11.110
Feb  3 14:53:42 localhost pulp: requests.packages.urllib3.connectionpool:INFO: Starting new HTTP connection (1): 10.19.11.110
Feb  3 14:53:43 localhost pulp: pulp_rpm.plugins.importers.yum.sync:INFO: Generating metadata databases.
Feb  3 14:53:44 localhost pulp: pulp_rpm.plugins.importers.yum.sync:INFO: Determining which units need to be downloaded.
Feb  3 14:53:48 localhost pulp: kombu.transport.qpid:INFO: Connected to qpid with SASL mechanism ANONYMOUS
Feb  3 14:53:50 localhost pulp: celery.worker.strategy:INFO: Received task: pulp.server.async.tasks._queue_reserved_task[ccdc88aa-5868-4c0f-abc8-51216d8aeb4c]
Feb  3 14:53:58 localhost pulp: requests.packages.urllib3.connectionpool:INFO: Resetting dropped connection: 10.19.11.110
Feb  3 14:53:58 localhost pulp: requests.packages.urllib3.connectionpool:INFO: Resetting dropped connection: 10.19.11.110
Feb  3 14:53:58 localhost pulp: requests.packages.urllib3.connectionpool:INFO: Resetting dropped connection: 10.19.11.110
Feb  3 14:53:58 localhost pulp: requests.packages.urllib3.connectionpool:INFO: Resetting dropped connection: 10.19.11.110
Feb  3 14:53:58 localhost pulp: requests.packages.urllib3.connectionpool:INFO: Resetting dropped connection: 10.19.11.110
Feb  3 14:54:02 localhost pulp: requests.packages.urllib3.connectionpool:INFO: Resetting dropped connection: 10.19.11.110
Feb  3 14:54:06 localhost pulp: pulp_rpm.plugins.importers.yum.sync:INFO: Determining which units need to be downloaded.
Feb  3 14:54:06 localhost pulp: pulp_rpm.plugins.importers.yum.sync:INFO: Determining which units need to be downloaded.
Feb  3 14:54:06 localhost pulp: requests.packages.urllib3.connectionpool:INFO: Resetting dropped connection: 10.19.11.110
Feb  3 14:54:06 localhost pulp: requests.packages.urllib3.connectionpool:INFO: Resetting dropped connection: 10.19.11.110

#6 Updated by jortel@redhat.com almost 5 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to jcline@redhat.com
  • Platform Release set to 2.8.0
  • Triaged changed from No to Yes

#7 Updated by jcline@redhat.com almost 5 years ago

It would be great if you could reproduce this with the attached patch applied (to something like /usr/lib/python2.7/site-packages/nectar/downloaders/threaded.py). It'll show us exactly what caused the connection error.

This patch is also present in nectar's master branch: https://github.com/pulp/nectar/pull/43

#8 Updated by jcline@redhat.com almost 5 years ago

I believe the root cause of this is how Nectar handles failure. If a download failure occurs (and Nectar doesn't really distinguish between failures very well) a host is blacklisted by Nectar. This includes HTTP 404, 503, etc. Nectar is blacklisting the host and skipping all the downloads to that host once it encounters that one error.

There are several directions we could go to fix this problem:

  • Simply remove the host blacklisting feature. However, it was added because if a host became unavailable for whatever reason, Nectar would churn through potentially thousands of requests pointlessly. Still, this is the edge case rather than the much more normal 503.
  • Try to make Nectar smarter about various failure scenarios.
  • Decide it's okay to fail this way and tell people to re-sync on failure. Not really a fix and not something I think is reasonable.

#9 Updated by amacdona@redhat.com almost 5 years ago

I don't think that removing the blacklist feature is an option.

The problem isn't just a huge number of failing requests, but also that each one must time out.
https://bugzilla.redhat.com/show_bug.cgi?id=1124625#c1

#10 Updated by jcline@redhat.com almost 5 years ago

Thanks Austin, it sounds like teaching Nectar a new trick is required.

#11 Updated by mhrivnak almost 5 years ago

I think your assessment is good about keeping the blacklist feature. It would be very reasonable to modify it such that 404s do not contribute to blacklisting a host. Serving 404s is usually cheap, and if they are advertising content that isn't actually present, that's their fault.

For 5xx errors, a threshhold is reasonable. One or two failures could be bad luck. More than 10 means the chance of success is slim, and you're likely to be making an unhealthy server even less healthy.

What specific kind of failure do you think was encountered in this case by the reporter?

#12 Updated by jcline@redhat.com almost 5 years ago

It's hard to say what the reason for the reporter's failure is because Nectar doesn't log anything helpful. An HTTP server is well within its right to simply refuse a TCP connection rather than serve a HTTP 503. If it does serve a 503, it may include a Retry-After header[0] which might be worth inspecting.

My opinion is that the simplest thing to do is ensure 4XX responses don't blacklist the host (since it's a client error), DNS and timeout errors do blacklist the host (and that is very clearly logged), and 5XX errors are retried for whatever the configured retry count is. It's not perfect, and yes, we may be kicking an unhappy server when it's down, but it's also free to not serve HTTP 503 if it's actually hurting.

[0] https://tools.ietf.org/html/rfc2616#section-10.5.4

#13 Updated by jcline@redhat.com almost 5 years ago

The more I look at the logs the less convince I am that my initial analysis is correct. The 'busted' logs have 0 RPM sync failures from what I can tell, and there are no HTTP failures logged. After I worked through the multi-layered try/except in Nectar, I think HTTP errors don't get blacklisted so we're okay there.

Connection errors are now logged in a helpful way (see the attached patch), so I propose we close this and if a more concrete problem comes up a new issue can be opened.

#14 Updated by jcline@redhat.com almost 5 years ago

  • Status changed from ASSIGNED to CLOSED - WORKSFORME

I'm closing this since I can't really reproduce the problem, and the logs (aside from the reset TCP connections which has been fixed on the Pulp side) don't indicate anything concrete. If this problem re-occurs, please reopen the issue. Thanks!

#15 Updated by bmbouter over 1 year ago

  • Tags Pulp 2 added

Please register to edit this issue

Also available in: Atom PDF