Project

Profile

Help

Story #5286

As a user, a Remote should provide an option that allows download errors, e.g. 404 errors to still allow creation of a RepositoryVersion

Added by jsherril@redhat.com 6 months ago. Updated 14 days ago.

Status:
NEW
Priority:
Normal
Assignee:
-
Category:
-
Sprint/Milestone:
Start date:
Due date:
% Done:

0%

Platform Release:
Blocks Release:
Backwards Incompatible:
No
Groomed:
No
Sprint Candidate:
No
Tags:
Katello-P2
QA Contact:
Complexity:
Smash Test:
Verified:
No
Verification Required:
No
Sprint:

Description

Problem

If you sync a file repository where one of the files is missing, it seems that the repository syncs as much as it can (as expected), but its reported as a fatal error, with a state of 'failed'.

This is especially problematic for remote repos that you cannot contact the maintainer, have the content you want, but don't have all content available (because it's an incomplete or corrupted repo).

Steps to reproduce:

1) create a file repository where one of the files is missing
2) create a file remote and repository and sync them

Actual task status (apologies its been yaml-fied):

- _href: "/pulp/api/v3/tasks/b6f9b619-c174-4e43-b546-0bbefdfb11e7/"
  _created: '2019-08-15T15:21:37.058+00:00'
  state: failed
  name: pulp_file.app.tasks.synchronizing.synchronize
  started_at: '2019-08-15T15:21:37.177+00:00'
  finished_at: '2019-08-15T15:21:37.382+00:00'
  non_fatal_errors: "[]"
  error:
    code: ''
    description: 404, message='Not Found'
    traceback: |2
        File "/usr/local/lib/pulp/lib64/python3.6/site-packages/rq/worker.py", line 822, in perform_job
          rv = job.perform()
        File "/usr/local/lib/pulp/lib64/python3.6/site-packages/rq/job.py", line 605, in perform
          self._result = self._execute()
        File "/usr/local/lib/pulp/lib64/python3.6/site-packages/rq/job.py", line 611, in _execute
          return self.func(*self.args, **self.kwargs)
        File "/usr/local/lib/pulp/src/pulp-file/pulp_file/app/tasks/synchronizing.py", line 45, in synchronize
          dv.create()
        File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/stages/declarative_version.py", line 169, in create
          loop.run_until_complete(pipeline)
        File "/usr/lib64/python3.6/asyncio/base_events.py", line 484, in run_until_complete
          return future.result()
        File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/stages/api.py", line 209, in create_pipeline
          await asyncio.gather(*futures)
        File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/stages/api.py", line 43, in __call__
          await self.run()
        File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/stages/artifact_stages.py", line 132, in run
          pb.done += task.result()  # download_count
        File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/stages/artifact_stages.py", line 155, in _handle_content_unit
          await asyncio.gather(*downloaders_for_content)
        File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/stages/models.py", line 78, in download
          download_result = await downloader.run(extra_data=self.extra_data)
        File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/download/base.py", line 212, in run
          return await self._run(extra_data=extra_data)
        File "/usr/local/lib/pulp/lib64/python3.6/site-packages/backoff/_async.py", line 131, in retry
          ret = await target(*args, **kwargs)
        File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/download/http.py", line 183, in _run
          response.raise_for_status()
        File "/usr/local/lib/pulp/lib64/python3.6/site-packages/aiohttp/client_reqrep.py", line 942, in raise_for_status
          headers=self.headers)
  worker: "/pulp/api/v3/workers/df7e0085-b0dd-4073-b74d-9ab78ad27a03/"
  spawned_tasks: []
  progress_reports:
  - message: Downloading Metadata
    state: completed
    total: 1
    done: 1
  - message: Parsing Metadata Lines
    state: completed
    total: 2
    done: 2
  - message: Downloading Artifacts
    state: failed
    done: 0
  - message: Associating Content
    state: canceled
    done: 0
  created_resources: []
  reserved_resources_record: []
create_version: true
poll_attempts:
  total: 1
  failed: 1

Solution

It would be useful to have an option that causes sync to not fail on download errors, but instead continue and record the errors as non-fatal exceptions somehow.


Related issues

Duplicated by Pulp - Story #5880: Add ability to be more aggressive in retrying downloads for intermittent networking issues CLOSED - DUPLICATE Actions
Copied to File Support - Test #5467: Test - sync does not report non-fatal errors properly NEW Actions

History

#1 Updated by amacdona@redhat.com 6 months ago

  • Triaged changed from No to Yes
  • Sprint set to Sprint 57

#2 Updated by rchan 6 months ago

  • Sprint changed from Sprint 57 to Sprint 58

#3 Updated by rchan 5 months ago

  • Sprint changed from Sprint 58 to Sprint 59

#4 Updated by kersom 5 months ago

  • Copied to Test #5467: Test - sync does not report non-fatal errors properly added

#5 Updated by dkliban@redhat.com 5 months ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to dkliban@redhat.com

#6 Updated by dkliban@redhat.com 5 months ago

Pulp currently considers everything above a 400 status code a fatal exception. Which codes should be considered non-fatal? Is it a range of 400-499? a subset of those?

#7 Updated by jsherril@redhat.com 5 months ago

I was less concerned about the error code and more concerned that a single file in a file repo failing causing the entire repository to fail to sync, which doesn't seem right. I would think it would still sync everything else?

Although i guess you could have a situation where some dependent part of another piece failed to download (such as a docker blob associated with a manifest associated with a tag), which could leave you with a broken repo. However, maybe that is more expected than this behavior.

The idea of a single piece of broken content preventing EVERYTHING else from syncing still seems wrong to me though. How did pulp2 handle it i wonder?

#8 Updated by dkliban@redhat.com 5 months ago

Pulp 2 appended and error and kept going. A similar pattern is possible in Pulp 3. The ArtifactDownloader stage needs to make use of the Task.append_non_fatal_error() when an exception is raised by a download.

#9 Updated by dkliban@redhat.com 5 months ago

  • Status changed from ASSIGNED to POST

#10 Updated by bmbouter 5 months ago

Currently the downloaders raise an exception on 404 errors, and I think those are fatal errors. I believe it's important for the sync machinery to have an exact copy of the data claimed in the RepositoryVersion. There is some retry logic already in the downloaders, but if the server responds via http response saying it doesn't have a file, we don't retry.

I expect 404s to be rare since content in the index should be available. How are you experiencing it? What is the impact if we leave this as NOTABUG or WONTFIX?

Related, we need to make the 404 error friendlier, but that would be a separate piece of work for the fatal exception handler to know about some types.

As an aside, I can't think of a use for non-fatal errors. We should remove the non-fatal exception interface because it's unused and we could add it later.

#11 Updated by jsherril@redhat.com 5 months ago

bmbouters, i will think about this a bit,

is it also a fatal error if one of the artifact download times out, or has a checksum mismatch?

#12 Updated by dkliban@redhat.com 5 months ago

Checksum mismatch is a fatal error. I think the connection timeout is a fatal error also (but will have to double check).

We have some retry wiht backoff behavior for when we receive a 429 response code.

#13 Updated by jsherril@redhat.com 5 months ago

Thinking about this a bit more. Lets say i wanted to mirror all of ansible galaxy. If just one collection was missing on the filesystem (maybe because it had been pulled for a security reason), i can't sync anything. (this was a real thing with puppet modules, where puppet forge had broken modules in their repository).

It also would mean (if i'm thinking clearly), that an on_demand sync would differ from an 'immediate' sync. Meaning, on_demand would be more graceful at handling missing files.

I still think we need an option to treat these kind of errors as non_fatal.

#14 Updated by bmbouter 5 months ago

wrote:

Thinking about this a bit more. Lets say i wanted to mirror all of ansible galaxy. If just one collection was missing on the filesystem (maybe because it had been pulled for a security reason), i can't sync anything. (this was a real thing with puppet modules, where puppet forge had broken modules in their repository).

It also would mean (if i'm thinking clearly), that an on_demand sync would differ from an 'immediate' sync. Meaning, on_demand would be more graceful at handling missing files.

Actually this got me thinking that in cases where 404 is returned, it could fallback to a lazy config for that content unit and continue. This would give more reliability when content remotes are unreliable. Is this best default behavior?

I still think we need an option to treat these kind of errors as non_fatal.

#15 Updated by jsherril@redhat.com 5 months ago

I think thats an interesting idea. As long as its obvious that this problem occurred (via non-fatal errors?), and that re-syncing it would cause it to re-attempt the download of it, I like that solution.

#16 Updated by gmbnomis 5 months ago

bmbouter wrote:

Actually this got me thinking that in cases where 404 is returned, it could fallback to a lazy config for that content unit and continue. This would give more reliability when content remotes are unreliable. Is this best default behavior?

Just my two cents: I also thought about proposing this behavior and quickly rejected it.

If I specify "immediate" policy for a sync, I expect to just get that. If something fails (be it fatal or non-fatal), my expectation is that this can only be rectified by another sync. But I don't want Pulp to be "smarter" than the policy I specified by trying to get some artifacts later when I don't expect it and can't control it.

#17 Updated by rchan 5 months ago

  • Sprint changed from Sprint 59 to Sprint 60

#18 Updated by dkliban@redhat.com 5 months ago

gmbnomis wrote:

bmbouter wrote:

Actually this got me thinking that in cases where 404 is returned, it could fallback to a lazy config for that content unit and continue. This would give more reliability when content remotes are unreliable. Is this best default behavior?

Just my two cents: I also thought about proposing this behavior and quickly rejected it.

If I specify "immediate" policy for a sync, I expect to just get that. If something fails (be it fatal or non-fatal), my expectation is that this can only be rectified by another sync. But I don't want Pulp to be "smarter" than the policy I specified by trying to get some artifacts later when I don't expect it and can't control it.

I agree. The user can always choose to sync content on demand. Pulp should not change the policy for a subset of the content.

#19 Updated by bmbouter 5 months ago

gmbnomis wrote:

bmbouter wrote:

Actually this got me thinking that in cases where 404 is returned, it could fallback to a lazy config for that content unit and continue. This would give more reliability when content remotes are unreliable. Is this best default behavior?

Just my two cents: I also thought about proposing this behavior and quickly rejected it.

If I specify "immediate" policy for a sync, I expect to just get that. If something fails (be it fatal or non-fatal), my expectation is that this can only be rectified by another sync. But I don't want Pulp to be "smarter" than the policy I specified by trying to get some artifacts later when I don't expect it and can't control it.

@gmbnomis, I also agree. What we have now adheres to this expectation.

The non_fatal exceptions part would also be removed since no one uses them as part of this issue: https://pulp.plan.io/issues/5442 Please comment on 5442 if you have thoughts on that.

#20 Updated by bmbouter 5 months ago

  • Tracker changed from Issue to Story
  • Subject changed from sync does not report non-fatal errors properly to As a user, a Remote should provide an option that allows download errors, e.g. 404 errors to still allow creation of a RepositoryVersion
  • Description updated (diff)
  • Status changed from POST to NEW
  • Assignee deleted (dkliban@redhat.com)
  • % Done set to 0
  • Sprint deleted (Sprint 60)
  • Tags deleted (Katello-P2)

After talking with @jsherrill and @mccune we determined it's ok for 3.0 to fail when policy='immediate'. In the future (maybe 3.1) core will need to provide plugin writers the ability to have their sync's "continue". This needs to be done in coordination with core and plugins.

Since no chnage is needed for now I'm removing from the sprint. @jsherrill lmk if you this isn't accurate.

#21 Updated by bmbouter 5 months ago

  • Project changed from File Support to Pulp

Moving to core to have core enable this.

#22 Updated by bmbouter about 1 month ago

  • Sprint/Milestone set to 3.1.0

#23 Updated by bmbouter about 1 month ago

  • Tags Katello-P2 added

Identified as P2 at katello checkin meeting by @jsherrill

#24 Updated by daviddavis about 1 month ago

  • Duplicated by Story #5880: Add ability to be more aggressive in retrying downloads for intermittent networking issues added

#25 Updated by daviddavis 14 days ago

  • Sprint/Milestone changed from 3.1.0 to 3.2.0

Please register to edit this issue

Also available in: Atom PDF