Project

Profile

Help

Issue #9013

rhel7 rpm repo sync canceled - 'Server disconnected' Error

Added by keilr 3 months ago. Updated 11 days ago.

Status:
CLOSED - DUPLICATE
Priority:
Normal
Assignee:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
OS:
Triaged:
No
Groomed:
No
Sprint Candidate:
No
Tags:
Sprint:
Quarter:

Description

We would like to use Pulp3 for syncing all our RHEL repos. (without katello and so on, just plain pulp to have a lightweight and simple setup.) installation and configration is done with ansible by using the pulp-squeezer ansible collection.

During our proof-of-concept we faced a problem during RHEL repo sync.

Setup: pulp-oci-image 3.13.0 download via web-proxy

remote repository:

{
  "pulp_href": "/pulp/api/v3/remotes/rpm/rpm/c6f4f579-3ce5-4677-9515-7a71aeec2d16/",
  "pulp_created": "2021-05-19T09:29:30.880578Z",
  "name": "rhel-7-server-rpms",
  "url": "https://cdn.redhat.com/content/dist/rhel/server/7/7Server/x86_64/os",
  "ca_cert": "-----BEGIN CERTIFICATE-----...output omitted...\n-----END CERTIFICATE-----\n-----BEGIN CERTIFICATE-----\n...output omitted...\n-----END CERTIFICATE-----\n-----BEGIN CERTIFICATE-----\n...output omitted...\n-----END CERTIFICATE-----",
  "client_cert": "-----BEGIN CERTIFICATE-----\n...output omitted...\n-----END CERTIFICATE-----\n-----BEGIN ENTITLEMENT DATA-----\n...output omitted...\n-----END ENTITLEMENT DATA-----\n-----BEGIN RSA SIGNATURE-----\n...output omitted...\n-----END RSA SIGNATURE-----",
  "tls_validation": true,
  "proxy_url": "http://proxy.example.local:8080",
  "pulp_labels": {},
  "pulp_last_updated": "2021-05-19T09:29:30.880597Z",
  "download_concurrency": 10,
  "policy": "immediate",
  "total_timeout": null,
  "connect_timeout": null,
  "sock_connect_timeout": null,
  "sock_read_timeout": null,
  "headers": null,
  "rate_limit": null,
  "sles_auth_token": null
}

repository:

{
  "pulp_href": "/pulp/api/v3/repositories/rpm/rpm/34a26e71-56e3-4069-8aa9-e5efd9970dd2/",
  "pulp_created": "2021-05-19T09:29:32.088270Z",
  "versions_href": "/pulp/api/v3/repositories/rpm/rpm/34a26e71-56e3-4069-8aa9-e5efd9970dd2/versions/",
  "pulp_labels": {},
  "latest_version_href": "/pulp/api/v3/repositories/rpm/rpm/34a26e71-56e3-4069-8aa9-e5efd9970dd2/versions/0/",
  "name": "test-rhel7-base",
  "description": null,
  "retained_versions": null,
  "remote": null,
  "autopublish": false,
  "metadata_signing_service": null,
  "retain_package_versions": 0,
  "metadata_checksum_type": "sha256",
  "package_checksum_type": "sha256",
  "gpgcheck": 0,
  "repo_gpgcheck": 0,
  "sqlite_metadata": false
}

task details:


  {
    "pulp_href": "/pulp/api/v3/tasks/9b62d728-7121-4f33-b9cc-4b28d6df58b7/",
    "pulp_created": "2021-05-19T09:29:33.268826Z",
    "state": "failed",
    "name": "pulp_rpm.app.tasks.synchronizing.synchronize",
    "logging_cid": "4030d4f74ec34c2d8d692fdb7ad34d80",
    "started_at": "2021-05-19T09:29:33.412301Z",
    "finished_at": "2021-05-19T09:42:38.404211Z",
    "error": {
      "traceback": "  File \"/usr/local/lib/python3.6/site-packages/rq/worker.py\", line 1008, in perform_job\n    rv = job.perform()\n  File \"/usr/local/lib/python3.6/site-packages/rq/job.py\", line 706, in perform\n    self._result = self._execute()\n  File \"/usr/local/lib/python3.6/site-packages/rq/job.py\", line 729, in _execute\n    result = self.func(*self.args, **self.kwargs)\n  File \"/usr/local/lib/python3.6/site-packages/pulp_rpm/app/tasks/synchronizing.py\", line 269, in synchronize\n    dv.create()\n  File \"/usr/local/lib/python3.6/site-packages/pulpcore/plugin/stages/declarative_version.py\", line 149, in create\n    loop.run_until_complete(pipeline)\n  File \"/usr/lib64/python3.6/asyncio/base_events.py\", line 484, in run_until_complete\n    return future.result()\n  File \"/usr/local/lib/python3.6/site-packages/pulpcore/plugin/stages/api.py\", line 225, in create_pipeline\n    await asyncio.gather(*futures)\n  File \"/usr/local/lib/python3.6/site-packages/pulpcore/plugin/stages/api.py\", line 43, in __call__\n    await self.run()\n  File \"/usr/local/lib/python3.6/site-packages/pulpcore/plugin/stages/artifact_stages.py\", line 174, in run\n    pb.done += task.result()  # download_count\n  File \"/usr/local/lib/python3.6/site-packages/pulpcore/plugin/stages/artifact_stages.py\", line 200, in _handle_content_unit\n    await asyncio.gather(*downloaders_for_content)\n  File \"/usr/local/lib/python3.6/site-packages/pulpcore/plugin/stages/models.py\", line 89, in download\n    download_result = await downloader.run(extra_data=self.extra_data)\n  File \"/usr/local/lib/python3.6/site-packages/pulpcore/download/base.py\", line 241, in run\n    return await self._run(extra_data=extra_data)\n  File \"/usr/local/lib/python3.6/site-packages/pulp_rpm/app/downloaders.py\", line 88, in _run\n    async with self.session.get(url, proxy=self.proxy, auth=self.auth) as response:\n  File \"/usr/local/lib64/python3.6/site-packages/aiohttp/client.py\", line 1117, in __aenter__\n    self._resp = await self._coro\n  File \"/usr/local/lib64/python3.6/site-packages/aiohttp/client.py\", line 521, in _request\n    req, traces=traces, timeout=real_timeout\n  File \"/usr/local/lib64/python3.6/site-packages/aiohttp/connector.py\", line 535, in connect\n    proto = await self._create_connection(req, traces, timeout)\n  File \"/usr/local/lib64/python3.6/site-packages/aiohttp/connector.py\", line 890, in _create_connection\n    _, proto = await self._create_proxy_connection(req, traces, timeout)\n  File \"/usr/local/lib64/python3.6/site-packages/aiohttp/connector.py\", line 1111, in _create_proxy_connection\n    resp = await proxy_resp.start(conn)\n  File \"/usr/local/lib64/python3.6/site-packages/aiohttp/client_reqrep.py\", line 890, in start\n    message, payload = await self._protocol.read()  # type: ignore\n  File \"/usr/local/lib64/python3.6/site-packages/aiohttp/streams.py\", line 604, in read\n    await self._waiter\n",
      "description": "Server disconnected"
    },
    "worker": "/pulp/api/v3/workers/91fdedcc-9c95-4867-85aa-702ccd512aec/",
    "parent_task": null,
    "child_tasks": [],
    "task_group": null,
    "progress_reports": [
      {
        "message": "Parsed Comps",
        "code": "parsing.comps",
        "state": "completed",
        "total": 91,
        "done": 91,
        "suffix": null
      },
      {
        "message": "Downloading Artifacts",
        "code": "sync.downloading.artifacts",
        "state": "failed",
        "total": null,
        "done": 8711,
        "suffix": null
      },
      {
        "message": "Parsed Packages",
        "code": "parsing.packages",
        "state": "canceled",
        "total": 31771,
        "done": 11012,
        "suffix": null
      },
      {
        "message": "Downloading Metadata Files",
        "code": "downloading.metadata",
        "state": "canceled",
        "total": null,
        "done": 5,
        "suffix": null
      },
      {
        "message": "Associating Content",
        "code": "associating.content",
        "state": "canceled",
        "total": null,
        "done": 13014,
        "suffix": null
      },
      {
        "message": "Parsed Advisories",
        "code": "parsing.advisories",
        "state": "completed",
        "total": 4787,
        "done": 4787,
        "suffix": null
      }
    ],
    "created_resources": [],
    "reserved_resources_record": [
      "/pulp/api/v3/repositories/rpm/rpm/34a26e71-56e3-4069-8aa9-e5efd9970dd2/",
      "/pulp/api/v3/remotes/rpm/rpm/c6f4f579-3ce5-4677-9515-7a71aeec2d16/"
    ]
  }

It looks like it downloaded the content partially but didn't complete the task fully.

We have no clue where to start troubleshooting. We are very grateful for any help!

History

#1 Updated by dalley 3 months ago

  • Status changed from NEW to CLOSED - DUPLICATE

Hi @keilr

This is an issue that has come up recently with the CDN serving the Red Hat repositories - it seems that some of Akamai's DDoS prevention is misbehaving and causing a lot of random download failures for some people.

We addressed this in pulpcore 3.14 by adding automatic retries for a much larger number of errors and reducing the default number of files downloading in parallel. So if you upgrade to a system with pulpcore 3.14 you should stop encountering these issues.

Although - I notice that we don't seem to have a 3.14 image tagged and uploaded. We will see about fixing that - in the meantime, I believe "latest" should work, if we haven't pushed the 3.14 image by the time you see this.

#2 Updated by keilr 3 months ago

I just found this related issues: https://pulp.plan.io/issues/6589

Pulp 3 doesn't retry if the connection is dropped.

I'll try to get it working by adding the "retry" option to the Ansible task. (pulp.squeezer.rpm_sync Ansible Module).

Unfortuntelly this bevahior makes it hard to monitor repo sync tasks. With pulp2 we reported the pulp task status to Grafana and were able to see if repo sync failed. But with Ansible we should be able to accomplish something similar.

#3 Updated by keilr 3 months ago

dalley: thanks a lot for your answer. looks like we commented the issue at the same time :) We are very happy to hear that the retry feature was added in 3.14.

#4 Updated by dalley 3 months ago

@keilr, No problem :) We've just pushed the 3.14 image.

Just to be clear - the sync task itself shouldn't fail anymore. The "retries" I'm referring to are retries of individual http download requests being made during the sync, rather than retrying the entire sync. So in terms of monitoring, you only need to monitor the one sync task, so you shouldn't need any code changes. Basically the sync will attempt to handle some of these errors encountered rather than simply failing.

Thanks for linking that - I updated that thread with the info about having recently added retry support.

#5 Updated by keilr 3 months ago

great, thanks a lot for pushing the image! I can confirm that the sync task has completed successfully with pulp 3.14.

#6 Updated by keilr 27 days ago

dalley: how many redhat repo sync jobs can I start in parallel without triggering the DDoS prevention?

#7 Updated by dalley 11 days ago

@keilr I have no idea, it's very inconsistent. But I rarely see it fail on the same package more than once or maybe twice, so as long as your retry count is 3+ you probably shouldn't see issues.

If it's an actual valid 4xx response code it doesn't retry - only on various types of server errors.

FWIW, I'm not 100% sure it's DDoS prevention, I have heard that Akamai's reliability has been somewhat lacking, and that they are trying to remediate that whole situation.

Please register to edit this issue

Also available in: Atom PDF