Project

Profile

Help

Story #5286

closed

As a user, a Remote should provide an option that allows download errors, e.g. 404 errors to still allow creation of a RepositoryVersion

Added by jsherril@redhat.com over 5 years ago. Updated about 3 years ago.

Status:
CLOSED - DUPLICATE
Priority:
Normal
Assignee:
-
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Platform Release:
Groomed:
No
Sprint Candidate:
No
Tags:
Katello
Sprint:
Quarter:

Description

Ticket moved to GitHub: "pulp/pulpcore/1843":https://github.com/pulp/pulpcore/issues/1843


Problem

If you sync a file repository where one of the files is missing, it seems that the repository syncs as much as it can (as expected), but its reported as a fatal error, with a state of 'failed'.

This is especially problematic for remote repos that you cannot contact the maintainer, have the content you want, but don't have all content available (because it's an incomplete or corrupted repo).

Steps to reproduce:

1) create a file repository where one of the files is missing
2) create a file remote and repository and sync them

Actual task status (apologies its been yaml-fied):

- _href: "/pulp/api/v3/tasks/b6f9b619-c174-4e43-b546-0bbefdfb11e7/"
  _created: '2019-08-15T15:21:37.058+00:00'
  state: failed
  name: pulp_file.app.tasks.synchronizing.synchronize
  started_at: '2019-08-15T15:21:37.177+00:00'
  finished_at: '2019-08-15T15:21:37.382+00:00'
  non_fatal_errors: "[]"
  error:
    code: ''
    description: 404, message='Not Found'
    traceback: |2
        File "/usr/local/lib/pulp/lib64/python3.6/site-packages/rq/worker.py", line 822, in perform_job
          rv = job.perform()
        File "/usr/local/lib/pulp/lib64/python3.6/site-packages/rq/job.py", line 605, in perform
          self._result = self._execute()
        File "/usr/local/lib/pulp/lib64/python3.6/site-packages/rq/job.py", line 611, in _execute
          return self.func(*self.args, **self.kwargs)
        File "/usr/local/lib/pulp/src/pulp-file/pulp_file/app/tasks/synchronizing.py", line 45, in synchronize
          dv.create()
        File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/stages/declarative_version.py", line 169, in create
          loop.run_until_complete(pipeline)
        File "/usr/lib64/python3.6/asyncio/base_events.py", line 484, in run_until_complete
          return future.result()
        File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/stages/api.py", line 209, in create_pipeline
          await asyncio.gather(*futures)
        File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/stages/api.py", line 43, in __call__
          await self.run()
        File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/stages/artifact_stages.py", line 132, in run
          pb.done += task.result()  # download_count
        File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/stages/artifact_stages.py", line 155, in _handle_content_unit
          await asyncio.gather(*downloaders_for_content)
        File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/stages/models.py", line 78, in download
          download_result = await downloader.run(extra_data=self.extra_data)
        File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/download/base.py", line 212, in run
          return await self._run(extra_data=extra_data)
        File "/usr/local/lib/pulp/lib64/python3.6/site-packages/backoff/_async.py", line 131, in retry
          ret = await target(*args, **kwargs)
        File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/download/http.py", line 183, in _run
          response.raise_for_status()
        File "/usr/local/lib/pulp/lib64/python3.6/site-packages/aiohttp/client_reqrep.py", line 942, in raise_for_status
          headers=self.headers)
  worker: "/pulp/api/v3/workers/df7e0085-b0dd-4073-b74d-9ab78ad27a03/"
  spawned_tasks: []
  progress_reports:
  - message: Downloading Metadata
    state: completed
    total: 1
    done: 1
  - message: Parsing Metadata Lines
    state: completed
    total: 2
    done: 2
  - message: Downloading Artifacts
    state: failed
    done: 0
  - message: Associating Content
    state: canceled
    done: 0
  created_resources: []
  reserved_resources_record: []
create_version: true
poll_attempts:
  total: 1
  failed: 1

Solution

It would be useful to have an option that causes sync to not fail on download errors, but instead continue and record the errors as non-fatal exceptions somehow.


Files

patch.txt (5.13 KB) patch.txt dalley, 06/25/2021 02:20 AM

Related issues

Related to Pulp - Story #8687: DeclarativeVersion API doesn't accomodate metadata mirroring use casesCLOSED - DUPLICATE

Actions
Has duplicate Pulp - Story #5880: Add ability to be more aggressive in retrying downloads for intermittent networking issuesCLOSED - DUPLICATE

Actions
Has duplicate Pulp - Story #8873: pulp 3 stops when encounters 403 errorCLOSED - DUPLICATE

Actions
Copied to File Support - Test #5467: Test - sync does not report non-fatal errors properlyCLOSED - DUPLICATEActions
Actions #1

Updated by amacdona@redhat.com over 5 years ago

  • Triaged changed from No to Yes
  • Sprint set to Sprint 57
Actions #2

Updated by rchan over 5 years ago

  • Sprint changed from Sprint 57 to Sprint 58
Actions #3

Updated by rchan over 5 years ago

  • Sprint changed from Sprint 58 to Sprint 59
Actions #4

Updated by kersom over 5 years ago

  • Copied to Test #5467: Test - sync does not report non-fatal errors properly added
Actions #5

Updated by dkliban@redhat.com over 5 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to dkliban@redhat.com
Actions #6

Updated by dkliban@redhat.com over 5 years ago

Pulp currently considers everything above a 400 status code a fatal exception. Which codes should be considered non-fatal? Is it a range of 400-499? a subset of those?

Actions #7

Updated by jsherril@redhat.com over 5 years ago

I was less concerned about the error code and more concerned that a single file in a file repo failing causing the entire repository to fail to sync, which doesn't seem right. I would think it would still sync everything else?

Although i guess you could have a situation where some dependent part of another piece failed to download (such as a docker blob associated with a manifest associated with a tag), which could leave you with a broken repo. However, maybe that is more expected than this behavior.

The idea of a single piece of broken content preventing EVERYTHING else from syncing still seems wrong to me though. How did pulp2 handle it i wonder?

Actions #8

Updated by dkliban@redhat.com over 5 years ago

Pulp 2 appended and error and kept going. A similar pattern is possible in Pulp 3. The ArtifactDownloader stage needs to make use of the Task.append_non_fatal_error() when an exception is raised by a download.

Actions #9

Updated by dkliban@redhat.com over 5 years ago

  • Status changed from ASSIGNED to POST
Actions #10

Updated by bmbouter over 5 years ago

Currently the downloaders raise an exception on 404 errors, and I think those are fatal errors. I believe it's important for the sync machinery to have an exact copy of the data claimed in the RepositoryVersion. There is some retry logic already in the downloaders, but if the server responds via http response saying it doesn't have a file, we don't retry.

I expect 404s to be rare since content in the index should be available. How are you experiencing it? What is the impact if we leave this as NOTABUG or WONTFIX?

Related, we need to make the 404 error friendlier, but that would be a separate piece of work for the fatal exception handler to know about some types.

As an aside, I can't think of a use for non-fatal errors. We should remove the non-fatal exception interface because it's unused and we could add it later.

Actions #11

Updated by jsherril@redhat.com over 5 years ago

bmbouters, i will think about this a bit,

is it also a fatal error if one of the artifact download times out, or has a checksum mismatch?

Actions #12

Updated by dkliban@redhat.com over 5 years ago

Checksum mismatch is a fatal error. I think the connection timeout is a fatal error also (but will have to double check).

We have some retry wiht backoff behavior for when we receive a 429 response code.

Actions #13

Updated by jsherril@redhat.com over 5 years ago

Thinking about this a bit more. Lets say i wanted to mirror all of ansible galaxy. If just one collection was missing on the filesystem (maybe because it had been pulled for a security reason), i can't sync anything. (this was a real thing with puppet modules, where puppet forge had broken modules in their repository).

It also would mean (if i'm thinking clearly), that an on_demand sync would differ from an 'immediate' sync. Meaning, on_demand would be more graceful at handling missing files.

I still think we need an option to treat these kind of errors as non_fatal.

Actions #14

Updated by bmbouter over 5 years ago

wrote:

Thinking about this a bit more. Lets say i wanted to mirror all of ansible galaxy. If just one collection was missing on the filesystem (maybe because it had been pulled for a security reason), i can't sync anything. (this was a real thing with puppet modules, where puppet forge had broken modules in their repository).

It also would mean (if i'm thinking clearly), that an on_demand sync would differ from an 'immediate' sync. Meaning, on_demand would be more graceful at handling missing files.

Actually this got me thinking that in cases where 404 is returned, it could fallback to a lazy config for that content unit and continue. This would give more reliability when content remotes are unreliable. Is this best default behavior?

I still think we need an option to treat these kind of errors as non_fatal.

Actions #15

Updated by jsherril@redhat.com over 5 years ago

I think thats an interesting idea. As long as its obvious that this problem occurred (via non-fatal errors?), and that re-syncing it would cause it to re-attempt the download of it, I like that solution.

Actions #16

Updated by gmbnomis over 5 years ago

bmbouter wrote:

Actually this got me thinking that in cases where 404 is returned, it could fallback to a lazy config for that content unit and continue. This would give more reliability when content remotes are unreliable. Is this best default behavior?

Just my two cents: I also thought about proposing this behavior and quickly rejected it.

If I specify "immediate" policy for a sync, I expect to just get that. If something fails (be it fatal or non-fatal), my expectation is that this can only be rectified by another sync. But I don't want Pulp to be "smarter" than the policy I specified by trying to get some artifacts later when I don't expect it and can't control it.

Actions #17

Updated by rchan over 5 years ago

  • Sprint changed from Sprint 59 to Sprint 60
Actions #18

Updated by dkliban@redhat.com over 5 years ago

gmbnomis wrote:

bmbouter wrote:

Actually this got me thinking that in cases where 404 is returned, it could fallback to a lazy config for that content unit and continue. This would give more reliability when content remotes are unreliable. Is this best default behavior?

Just my two cents: I also thought about proposing this behavior and quickly rejected it.

If I specify "immediate" policy for a sync, I expect to just get that. If something fails (be it fatal or non-fatal), my expectation is that this can only be rectified by another sync. But I don't want Pulp to be "smarter" than the policy I specified by trying to get some artifacts later when I don't expect it and can't control it.

I agree. The user can always choose to sync content on demand. Pulp should not change the policy for a subset of the content.

Actions #19

Updated by bmbouter over 5 years ago

gmbnomis wrote:

bmbouter wrote:

Actually this got me thinking that in cases where 404 is returned, it could fallback to a lazy config for that content unit and continue. This would give more reliability when content remotes are unreliable. Is this best default behavior?

Just my two cents: I also thought about proposing this behavior and quickly rejected it.

If I specify "immediate" policy for a sync, I expect to just get that. If something fails (be it fatal or non-fatal), my expectation is that this can only be rectified by another sync. But I don't want Pulp to be "smarter" than the policy I specified by trying to get some artifacts later when I don't expect it and can't control it.

gmbnomis, I also agree. What we have now adheres to this expectation.

The non_fatal exceptions part would also be removed since no one uses them as part of this issue: https://pulp.plan.io/issues/5442 Please comment on 5442 if you have thoughts on that.

Actions #20

Updated by bmbouter over 5 years ago

  • Tracker changed from Issue to Story
  • Subject changed from sync does not report non-fatal errors properly to As a user, a Remote should provide an option that allows download errors, e.g. 404 errors to still allow creation of a RepositoryVersion
  • Description updated (diff)
  • Status changed from POST to NEW
  • Assignee deleted (dkliban@redhat.com)
  • % Done set to 0
  • Sprint deleted (Sprint 60)
  • Tags deleted (Katello-P2)

After talking with @jsherrill and @mccune we determined it's ok for 3.0 to fail when policy='immediate'. In the future (maybe 3.1) core will need to provide plugin writers the ability to have their sync's "continue". This needs to be done in coordination with core and plugins.

Since no chnage is needed for now I'm removing from the sprint. @jsherrill lmk if you this isn't accurate.

Actions #21

Updated by bmbouter over 5 years ago

  • Project changed from File Support to Pulp

Moving to core to have core enable this.

Actions #22

Updated by bmbouter about 5 years ago

  • Sprint/Milestone set to 3.1.0
Actions #23

Updated by bmbouter about 5 years ago

  • Tags Katello-P2 added

Identified as P2 at katello checkin meeting by @jsherrill

Actions #24

Updated by daviddavis about 5 years ago

  • Has duplicate Story #5880: Add ability to be more aggressive in retrying downloads for intermittent networking issues added
Actions #25

Updated by daviddavis almost 5 years ago

  • Sprint/Milestone changed from 3.1.0 to 3.2.0
Actions #26

Updated by daviddavis almost 5 years ago

  • Sprint/Milestone changed from 3.2.0 to 3.3.0
Actions #27

Updated by daviddavis almost 5 years ago

  • Sprint/Milestone deleted (3.3.0)
Actions #28

Updated by ggainey over 4 years ago

  • Tags Katello added
  • Tags deleted (Katello-P2)
Actions #29

Updated by dalley over 3 years ago

  • Has duplicate Story #8873: pulp 3 stops when encounters 403 error added
Actions #30

Updated by dalley over 3 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to dalley
Actions #31

Updated by dalley over 3 years ago

  • Status changed from ASSIGNED to NEW
  • Assignee deleted (dalley)

Supporting this properly will likely require refactoring DeclarativeVersion. We need to do that eventually to better support our metadata mirroring feature, so this would be a natural thing to consider along with that work.

Actions #32

Updated by dalley over 3 years ago

  • Related to Story #8687: DeclarativeVersion API doesn't accomodate metadata mirroring use cases added
Actions #33

Updated by dalley over 3 years ago

I'm attaching a partial patch to streamline the process when we pick this back up

Actions #34

Updated by ipanova@redhat.com over 3 years ago

adding also link to the PR so the discussions are also visible https://github.com/pulp/pulpcore/pull/1427

Actions #35

Updated by pulpbot about 3 years ago

  • Description updated (diff)
  • Status changed from NEW to CLOSED - DUPLICATE

Also available in: Atom PDF