Story #5286
closedAs a user, a Remote should provide an option that allows download errors, e.g. 404 errors to still allow creation of a RepositoryVersion
0%
Description
Ticket moved to GitHub: "pulp/pulpcore/1843":https://github.com/pulp/pulpcore/issues/1843
Problem¶
If you sync a file repository where one of the files is missing, it seems that the repository syncs as much as it can (as expected), but its reported as a fatal error, with a state of 'failed'.
This is especially problematic for remote repos that you cannot contact the maintainer, have the content you want, but don't have all content available (because it's an incomplete or corrupted repo).
Steps to reproduce:
1) create a file repository where one of the files is missing
2) create a file remote and repository and sync them
Actual task status (apologies its been yaml-fied):
- _href: "/pulp/api/v3/tasks/b6f9b619-c174-4e43-b546-0bbefdfb11e7/"
_created: '2019-08-15T15:21:37.058+00:00'
state: failed
name: pulp_file.app.tasks.synchronizing.synchronize
started_at: '2019-08-15T15:21:37.177+00:00'
finished_at: '2019-08-15T15:21:37.382+00:00'
non_fatal_errors: "[]"
error:
code: ''
description: 404, message='Not Found'
traceback: |2
File "/usr/local/lib/pulp/lib64/python3.6/site-packages/rq/worker.py", line 822, in perform_job
rv = job.perform()
File "/usr/local/lib/pulp/lib64/python3.6/site-packages/rq/job.py", line 605, in perform
self._result = self._execute()
File "/usr/local/lib/pulp/lib64/python3.6/site-packages/rq/job.py", line 611, in _execute
return self.func(*self.args, **self.kwargs)
File "/usr/local/lib/pulp/src/pulp-file/pulp_file/app/tasks/synchronizing.py", line 45, in synchronize
dv.create()
File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/stages/declarative_version.py", line 169, in create
loop.run_until_complete(pipeline)
File "/usr/lib64/python3.6/asyncio/base_events.py", line 484, in run_until_complete
return future.result()
File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/stages/api.py", line 209, in create_pipeline
await asyncio.gather(*futures)
File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/stages/api.py", line 43, in __call__
await self.run()
File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/stages/artifact_stages.py", line 132, in run
pb.done += task.result() # download_count
File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/stages/artifact_stages.py", line 155, in _handle_content_unit
await asyncio.gather(*downloaders_for_content)
File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/stages/models.py", line 78, in download
download_result = await downloader.run(extra_data=self.extra_data)
File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/download/base.py", line 212, in run
return await self._run(extra_data=extra_data)
File "/usr/local/lib/pulp/lib64/python3.6/site-packages/backoff/_async.py", line 131, in retry
ret = await target(*args, **kwargs)
File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/download/http.py", line 183, in _run
response.raise_for_status()
File "/usr/local/lib/pulp/lib64/python3.6/site-packages/aiohttp/client_reqrep.py", line 942, in raise_for_status
headers=self.headers)
worker: "/pulp/api/v3/workers/df7e0085-b0dd-4073-b74d-9ab78ad27a03/"
spawned_tasks: []
progress_reports:
- message: Downloading Metadata
state: completed
total: 1
done: 1
- message: Parsing Metadata Lines
state: completed
total: 2
done: 2
- message: Downloading Artifacts
state: failed
done: 0
- message: Associating Content
state: canceled
done: 0
created_resources: []
reserved_resources_record: []
create_version: true
poll_attempts:
total: 1
failed: 1
Solution¶
It would be useful to have an option that causes sync to not fail on download errors, but instead continue and record the errors as non-fatal exceptions somehow.
Files
Related issues
Updated by amacdona@redhat.com over 5 years ago
- Triaged changed from No to Yes
- Sprint set to Sprint 57
Updated by kersom over 5 years ago
- Copied to Test #5467: Test - sync does not report non-fatal errors properly added
Updated by dkliban@redhat.com about 5 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to dkliban@redhat.com
Updated by dkliban@redhat.com about 5 years ago
Pulp currently considers everything above a 400 status code a fatal exception. Which codes should be considered non-fatal? Is it a range of 400-499? a subset of those?
Updated by jsherril@redhat.com about 5 years ago
I was less concerned about the error code and more concerned that a single file in a file repo failing causing the entire repository to fail to sync, which doesn't seem right. I would think it would still sync everything else?
Although i guess you could have a situation where some dependent part of another piece failed to download (such as a docker blob associated with a manifest associated with a tag), which could leave you with a broken repo. However, maybe that is more expected than this behavior.
The idea of a single piece of broken content preventing EVERYTHING else from syncing still seems wrong to me though. How did pulp2 handle it i wonder?
Updated by dkliban@redhat.com about 5 years ago
Pulp 2 appended and error and kept going. A similar pattern is possible in Pulp 3. The ArtifactDownloader stage needs to make use of the Task.append_non_fatal_error() when an exception is raised by a download.
Updated by dkliban@redhat.com about 5 years ago
- Status changed from ASSIGNED to POST
Updated by bmbouter about 5 years ago
Currently the downloaders raise an exception on 404 errors, and I think those are fatal errors. I believe it's important for the sync machinery to have an exact copy of the data claimed in the RepositoryVersion. There is some retry logic already in the downloaders, but if the server responds via http response saying it doesn't have a file, we don't retry.
I expect 404s to be rare since content in the index should be available. How are you experiencing it? What is the impact if we leave this as NOTABUG or WONTFIX?
Related, we need to make the 404 error friendlier, but that would be a separate piece of work for the fatal exception handler to know about some types.
As an aside, I can't think of a use for non-fatal errors. We should remove the non-fatal exception interface because it's unused and we could add it later.
Updated by jsherril@redhat.com about 5 years ago
bmbouters, i will think about this a bit,
is it also a fatal error if one of the artifact download times out, or has a checksum mismatch?
Updated by dkliban@redhat.com about 5 years ago
Checksum mismatch is a fatal error. I think the connection timeout is a fatal error also (but will have to double check).
We have some retry wiht backoff behavior for when we receive a 429 response code.
Updated by jsherril@redhat.com about 5 years ago
Thinking about this a bit more. Lets say i wanted to mirror all of ansible galaxy. If just one collection was missing on the filesystem (maybe because it had been pulled for a security reason), i can't sync anything. (this was a real thing with puppet modules, where puppet forge had broken modules in their repository).
It also would mean (if i'm thinking clearly), that an on_demand sync would differ from an 'immediate' sync. Meaning, on_demand would be more graceful at handling missing files.
I still think we need an option to treat these kind of errors as non_fatal.
Updated by bmbouter about 5 years ago
jsherril@redhat.com wrote:
Thinking about this a bit more. Lets say i wanted to mirror all of ansible galaxy. If just one collection was missing on the filesystem (maybe because it had been pulled for a security reason), i can't sync anything. (this was a real thing with puppet modules, where puppet forge had broken modules in their repository).
It also would mean (if i'm thinking clearly), that an on_demand sync would differ from an 'immediate' sync. Meaning, on_demand would be more graceful at handling missing files.
Actually this got me thinking that in cases where 404 is returned, it could fallback to a lazy config for that content unit and continue. This would give more reliability when content remotes are unreliable. Is this best default behavior?
I still think we need an option to treat these kind of errors as non_fatal.
Updated by jsherril@redhat.com about 5 years ago
I think thats an interesting idea. As long as its obvious that this problem occurred (via non-fatal errors?), and that re-syncing it would cause it to re-attempt the download of it, I like that solution.
Updated by gmbnomis about 5 years ago
bmbouter wrote:
Actually this got me thinking that in cases where 404 is returned, it could fallback to a lazy config for that content unit and continue. This would give more reliability when content remotes are unreliable. Is this best default behavior?
Just my two cents: I also thought about proposing this behavior and quickly rejected it.
If I specify "immediate" policy for a sync, I expect to just get that. If something fails (be it fatal or non-fatal), my expectation is that this can only be rectified by another sync. But I don't want Pulp to be "smarter" than the policy I specified by trying to get some artifacts later when I don't expect it and can't control it.
Updated by dkliban@redhat.com about 5 years ago
gmbnomis wrote:
bmbouter wrote:
Actually this got me thinking that in cases where 404 is returned, it could fallback to a lazy config for that content unit and continue. This would give more reliability when content remotes are unreliable. Is this best default behavior?
Just my two cents: I also thought about proposing this behavior and quickly rejected it.
If I specify "immediate" policy for a sync, I expect to just get that. If something fails (be it fatal or non-fatal), my expectation is that this can only be rectified by another sync. But I don't want Pulp to be "smarter" than the policy I specified by trying to get some artifacts later when I don't expect it and can't control it.
I agree. The user can always choose to sync content on demand. Pulp should not change the policy for a subset of the content.
Updated by bmbouter about 5 years ago
gmbnomis wrote:
bmbouter wrote:
Actually this got me thinking that in cases where 404 is returned, it could fallback to a lazy config for that content unit and continue. This would give more reliability when content remotes are unreliable. Is this best default behavior?
Just my two cents: I also thought about proposing this behavior and quickly rejected it.
If I specify "immediate" policy for a sync, I expect to just get that. If something fails (be it fatal or non-fatal), my expectation is that this can only be rectified by another sync. But I don't want Pulp to be "smarter" than the policy I specified by trying to get some artifacts later when I don't expect it and can't control it.
gmbnomis, I also agree. What we have now adheres to this expectation.
The non_fatal exceptions part would also be removed since no one uses them as part of this issue: https://pulp.plan.io/issues/5442 Please comment on 5442 if you have thoughts on that.
Updated by bmbouter about 5 years ago
- Tracker changed from Issue to Story
- Subject changed from sync does not report non-fatal errors properly to As a user, a Remote should provide an option that allows download errors, e.g. 404 errors to still allow creation of a RepositoryVersion
- Description updated (diff)
- Status changed from POST to NEW
- Assignee deleted (
dkliban@redhat.com) - % Done set to 0
- Sprint deleted (
Sprint 60) - Tags deleted (
Katello-P2)
After talking with @jsherrill and @mccune we determined it's ok for 3.0 to fail when policy='immediate'. In the future (maybe 3.1) core will need to provide plugin writers the ability to have their sync's "continue". This needs to be done in coordination with core and plugins.
Since no chnage is needed for now I'm removing from the sprint. @jsherrill lmk if you this isn't accurate.
Updated by bmbouter about 5 years ago
- Project changed from File Support to Pulp
Moving to core to have core enable this.
Updated by bmbouter almost 5 years ago
- Tags Katello-P2 added
Identified as P2 at katello checkin meeting by @jsherrill
Updated by daviddavis almost 5 years ago
- Has duplicate Story #5880: Add ability to be more aggressive in retrying downloads for intermittent networking issues added
Updated by daviddavis almost 5 years ago
- Sprint/Milestone changed from 3.1.0 to 3.2.0
Updated by daviddavis almost 5 years ago
- Sprint/Milestone changed from 3.2.0 to 3.3.0
Updated by ggainey over 4 years ago
- Tags Katello added
- Tags deleted (
Katello-P2)
Updated by dalley over 3 years ago
- Has duplicate Story #8873: pulp 3 stops when encounters 403 error added
Updated by dalley over 3 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to dalley
Updated by dalley over 3 years ago
- Status changed from ASSIGNED to NEW
- Assignee deleted (
dalley)
Supporting this properly will likely require refactoring DeclarativeVersion. We need to do that eventually to better support our metadata mirroring feature, so this would be a natural thing to consider along with that work.
Updated by dalley over 3 years ago
- Related to Story #8687: DeclarativeVersion API doesn't accomodate metadata mirroring use cases added
Updated by ipanova@redhat.com over 3 years ago
adding also link to the PR so the discussions are also visible https://github.com/pulp/pulpcore/pull/1427
Updated by pulpbot almost 3 years ago
- Description updated (diff)
- Status changed from NEW to CLOSED - DUPLICATE