Project

Profile

Help

Story #7659

closed

[EPIC] As a user, orphan cleanup does not block all other tasks

Added by bmbouter over 3 years ago. Updated over 2 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
-
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
% Done:

100%

Estimated time:
(Total: 0:00 h)
Platform Release:
Groomed:
Yes
Sprint Candidate:
Yes
Tags:
Sprint:
Sprint 107
Quarter:
Q2-2021

Description

Background

When orphan cleanup runs it blocks all tasks submitted after it, until all workers become idle and then executes the orphan cleanup on the resource manager itself. Also orphan cleanup itself as an operation takes on the order of minutes, it's a non-trivial amount of time.

Problem

On a large Pulp system with many workers running long operations, a user will experience a very significant stalling of all newly submitted work. This occurs even when the user is interested in removing a single piece of content. For stakeholders concerned with tasking throughput on large installations, e.g. galaxy_ng this is not viable.

A simple solution

Have the orphan cleanup run asynchronously and without any locks. This will allow any worker to work on it in parallel with other Pulp task types.

Handling failures

It's a race between another task associating an orphan with a repository version and orphan cleanup deleting an orphan.

In the case orphan_cleanup wins, plugin writers will need to be told they can no longer guarantee that just because content was there a moment ago, it is there now. I expect the exception can just bubble up to the user, and they user can restart the job at which point code like sync will get it right the second time.

In the case the other task associates an orphan, making it a non-orphan, after the orphan_cleanup has identified it as an orphan, we need to ensure the db will stop orphan_cleanup from deleting it via on_delete=PROTECT.

Proposal for complex solution

Add a new field timestamp_of_interest on the artifact and content models

  • This field should be set at the artifact/content creation time
  • This field sould be updated whenever we work with the existing artifact/content
  • Add a configurable option preserve_orphaned_content = X seconds/minutes. It can be whether a global setting or a parameter to the orphan clean up call.

It is expected that the artifact will be added to the content (unorphaning the artifact) quicker than X seconds/minutes (TBD) from the timestamp_of_interest. Similarly, content will be added to a repo version (unorphaning the content) also within X seconds/minutes(TBD).

Sync Pipeline

timestamp_of_interest This timestamp will be set in the sync pipeline in the QueryExistingArtifacts and QueryExistingContents

When querying for existing artifacts/content and they are found, set the timestamp. If committing this transaction fails due to orphan cleanup committing a transaction that removed the objects having timestamp_of_interest set, retry the exact same transaction again, the second time removed objects will not be found and the pipeline will proceed normally to re-download/re-create the artifact/content in the further stages.

For newly created artifacts and content we need to set this timestamp as well, this way they will be marked 'as-planned-to-be-used'( aka I plan to add this artifact to the content or I plan to add this content to the repo)

Upload

  • if one shot upload - handle in 1 transaction artifact and content creation.
  • for non 1 shot upload, set the timestamp during content/artifact creation, or in case artifact/content are existing ones, update the timestamp. It is expected users will make their second call to associate the orphaned artifact within X seconds/minutes (TDB) since timestamp_of_interest.

Modify

  • set the timestamp on the content specified in the add_content_hrefs to prevent the case when orphan would clean them up in the middle of the running task.

Orphan cleanup logic (orphan clean up task)

Will be able to run in parallel without locks.

Remove Content that:

  • has no membership ( does not belong to any of the repo versions)
  • timestamp_of_interest is older than X seconds( was marked to be used X seconds ago but still does not belong to any of repo versions)

Remove Artifact that:

  • has no membership ( does not belong to any of the content)
  • timestamp_of_interest is older than X seconds( was marked to be used X second ago but still does not belong to any of the content)

Under consideration

some syncs can take long time to finish, because of that we could look at the most recent and long running task before removing orphans


Sub-issues 2 (0 open2 closed)

Task #8823: Add timestamp_of_interest to Content and ArtifactCLOSED - CURRENTRELEASEdaviddavis

Actions
Task #8824: Convert orphan cleanup into a non-blocking taskCLOSED - CURRENTRELEASEipanova@redhat.com

Actions
Actions #1

Updated by bmbouter over 3 years ago

  • Description updated (diff)
Actions #2

Updated by bmbouter over 3 years ago

  • Description updated (diff)
Actions #3

Updated by iballou over 3 years ago

I can't think of any reason why the proposed solution would cause trouble in Katello. We'll just want to make sure there is an obvious path for keeping Pulp data consistent.

Actions #4

Updated by ipanova@redhat.com over 3 years ago

In the case orphan_cleanup wins, plugin writers will need to be told they can no longer guarantee that just because content was there a moment ago, it is there now. I expect the exception can just bubble up to the user, and they user can restart the job at which point code like sync will get it right the second time.

In this case, my preference would be to extend Stages API to handle this exception behind the scene, re-download the content, rather than asking users to re-run the sync.

However this does not solve the upload usecase - How do we solve the case when content was initially brought into pulp not by sync operation but upload? I don't think telling we no longer guarantee that just because content was there a moment ago, it is there now will bring a lot of confidence to our users. Our value proposition is to reliably manage content and this statement contradicts to it and it worries me, especially when there is not provided good recovery path.

Actions #5

Updated by bmbouter over 3 years ago

wrote:

In this case, my preference would be to extend Stages API to handle this exception behind the scene, re-download the content, rather than asking users to re-run the sync. I agree with you on this in the long term, but the complexity it brings I don't think is worth it right now. Consider these reasons and let me know if you still think it is.

  1. I think this will make this a plugin-api breaking change, requiring any plugin with a custom pipeline (which is several) to take additional action to port onto it. This would be breaking because I believe to send content from later stages back to earlier stages would require the stages to reference each other.

  2. The level of effort goes up significantly with this requirement when we could deliver this improvement later and keep the changes smaller.

However this does not solve the upload usecase - How do we solve the case when content was initially brought into pulp not by sync operation but upload? I don't think telling we no longer guarantee that just because content was there a moment ago, it is there now will bring a lot of confidence to our users. Our value proposition is to reliably manage content and this statement contradicts to it and it worries me, especially when there is not provided good recovery path.

I acknowledge that this is a step backwards in reliability, but I want to make two points about this.

  1. If we leave the situation as-is, any pulp system will block all new tasks for a really long time every time. Really long like hours because it has to wait for every task on every worker to stop. Some sync's themselves can take hours, not to mention the many operations running on all workers. This is a huge cost for users to pay and I don't think we can justify that user experience.

  2. The upload error case is a race condition, one that would occur I believe rarely and is easily recoverable.

My perspective is that, when I consider the balance between the user experience pain of deleting content (hours, unscalable, and almost unworkable) against the downsides to users (brief and recoverable) this bring overall huge benefits.

Actions #6

Updated by ipanova@redhat.com over 3 years ago

bmbouter wrote:

wrote:

In this case, my preference would be to extend Stages API to handle this exception behind the scene, re-download the content, rather than asking users to re-run the sync. I agree with you on this in the long term, but the complexity it brings I don't think is worth it right now. Consider these reasons and let me know if you still think it is.

  1. I think this will make this a plugin-api breaking change, requiring any plugin with a custom pipeline (which is several) to take additional action to port onto it. This would be breaking because I believe to send content from later stages back to earlier stages would require the stages to reference each other.

  2. The level of effort goes up significantly with this requirement when we could deliver this improvement later and keep the changes smaller.

I agree that current situation with the orphan clean-up is barely acceptable and we need to do something about it. However regardless how much effort will be invested now or later this change will remain a breaking one.

However this does not solve the upload usecase - How do we solve the case when content was initially brought into pulp not by sync operation but upload? I don't think telling we no longer guarantee that just because content was there a moment ago, it is there now will bring a lot of confidence to our users. Our value proposition is to reliably manage content and this statement contradicts to it and it worries me, especially when there is not provided good recovery path.

I acknowledge that this is a step backwards in reliability, but I want to make two points about this.

  1. If we leave the situation as-is, any pulp system will block all new tasks for a really long time every time. Really long like hours because it has to wait for every task on every worker to stop. Some sync's themselves can take hours, not to mention the many operations running on all workers. This is a huge cost for users to pay and I don't think we can justify that user experience.

  2. The upload error case is a race condition, one that would occur I believe rarely and is easily recoverable.

My perspective is that, when I consider the balance between the user experience pain of deleting content (hours, unscalable, and almost unworkable) against the downsides to users (brief and recoverable) this bring overall huge benefits.

It is hard to argue against the above mentioned points, but I still feel uneasy on taking a step back on reliability. I would love to hear what other people think.

Actions #7

Updated by ipanova@redhat.com over 3 years ago

  • Description updated (diff)
Actions #8

Updated by ipanova@redhat.com over 3 years ago

  • Description updated (diff)
Actions #9

Updated by daviddavis over 3 years ago

  • Groomed changed from No to Yes
  • Sprint Candidate changed from No to Yes
Actions #10

Updated by daviddavis about 3 years ago

  • Quarter set to Q1-2021
Actions #11

Updated by ipanova@redhat.com almost 3 years ago

  • Description updated (diff)
Actions #12

Updated by daviddavis almost 3 years ago

  • Quarter changed from Q1-2021 to Q2-2021
Actions #13

Updated by daviddavis almost 3 years ago

  • Description updated (diff)
Actions #14

Updated by daviddavis almost 3 years ago

  • Description updated (diff)
Actions #15

Updated by daviddavis almost 3 years ago

  • Sprint set to Sprint 95
Actions #16

Updated by daviddavis almost 3 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to daviddavis
Actions #17

Updated by daviddavis almost 3 years ago

  • Status changed from ASSIGNED to NEW
  • Assignee deleted (daviddavis)
Actions #18

Updated by daviddavis almost 3 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to daviddavis
Actions #19

Updated by rchan almost 3 years ago

  • Sprint changed from Sprint 95 to Sprint 96
Actions #20

Updated by rchan almost 3 years ago

  • Sprint changed from Sprint 96 to Sprint 97
Actions #21

Updated by daviddavis almost 3 years ago

if one shot upload - handle in 1 transaction artifact and content creation.

This is impossible (at least not without some major refactoring). Currently, artifact creation is handled in the viewset while content create is handled by a task.

That said I don't think it's a big deal. Worst case scenario the content creation fails and the user just has to adjust the preservation setting and retry it.

Actions #22

Updated by bmbouter almost 3 years ago

daviddavis wrote:

if one shot upload - handle in 1 transaction artifact and content creation.

This is impossible (at least not without some major refactoring). Currently, artifact creation is handled in the viewset while content create is handled by a task.

That said I don't think it's a big deal. Worst case scenario the content creation fails and the user just has to adjust the preservation setting and retry it.

In looking at those links, I agree with your conclusion. I also think it's not a very big deal either.

Actions #23

Updated by ipanova@redhat.com almost 3 years ago

daviddavis wrote:

if one shot upload - handle in 1 transaction artifact and content creation.

This is impossible (at least not without some major refactoring). Currently, artifact creation is handled in the viewset while content create is handled by a task.

That said I don't think it's a big deal. Worst case scenario the content creation fails and the user just has to adjust the preservation setting and retry it.

+1 .

Actions #24

Updated by daviddavis almost 3 years ago

We use the general_create task to create content for one-shot uploads which basically sets up a subclass of SingleArtifactContentUploadSerializer and calls save() on it.

I'm not sure exactly where is best to update the timestamp if there's an existing content. I'm imagining I could update SingleArtifactContentUploadSerializer. But I am not sure how to do this in a general way to work for all plugins. For one thing, consider the pulp_rpm package case where it's extracting information from the artifact file and then raising a general ValidationError.

Actions #25

Updated by mdellweg almost 3 years ago

I see what you are saying. In the plugin code where we see the package with exists() we'd need to bump its time of interest.

What if instead of a validation error, the upload task would return the existing content unit (as if it were just uploaded; We might need to add ownership permissions at that point too.)?

Also how is the situation you are describing worse than what we already have? If an orphan cleanup is scheduled right after your upload task, you will never see the content you created even with today's code.

Actions #26

Updated by ipanova@redhat.com almost 3 years ago

mdellweg wrote:

I see what you are saying. In the plugin code where we see the package with exists() we'd need to bump its time of interest.

What if instead of a validation error, the upload task would return the existing content unit (as if it were just uploaded; We might need to add ownership permissions at that point too.)?

Yeah..this is plugin specific, I am tempted to say that it should be plugins responsibility to add the content timestamp for the one-shot upload case. I also don't see a generic way to do it in core :/

Also how is the situation you are describing worse than what we already have? If an orphan cleanup is scheduled right after your upload task, you will never see the content you created even with today's code.

Right, it depends whether you have provided the repository you want the package to be added to https://github.com/pulp/pulpcore/blob/aed355fbc380a13a4a767e2627cbd54c7ff8f107/pulpcore/plugin/viewsets/content.py#L85

Actions #27

Updated by ipanova@redhat.com almost 3 years ago

We probably will fall into the same same trap with /modify for the plugins that define their own endpoints https://github.com/pulp/pulp_container/blob/master/pulp_container/app/viewsets.py#L592

Actions #28

Updated by daviddavis almost 3 years ago

What if instead of a validation error, the upload task would return the existing content unit (as if it were just uploaded; We might need to add ownership permissions at that point too.)?

That sounds like a possible idea. We could notify plugins that they need to start returning the content unit instead of raising an exception and I can have core always bump the timestamp.

Also how is the situation you are describing worse than what we already have? If an orphan cleanup is scheduled right after your upload task, you will never see the content you created even with today's code.

This is true.

I am tempted to say that it should be plugins responsibility to add the content timestamp for the one-shot upload case.

That sounds like a good idea to me too.

We probably will fall into the same same trap with /modify for the plugins that define their own endpoints

Yes, good point.

Actions #29

Updated by ipanova@redhat.com almost 3 years ago

daviddavis wrote:

What if instead of a validation error, the upload task would return the existing content unit (as if it were just uploaded; We might need to add ownership permissions at that point too.)?

That sounds like a possible idea. We could notify plugins that they need to start returning the content unit instead of raising an exception and I can have core always bump the timestamp.

Here is a ticket related to this https://pulp.plan.io/issues/7114#note-1

Also how is the situation you are describing worse than what we already have? If an orphan cleanup is scheduled right after your upload task, you will never see the content you created even with today's code.

This is true.

I am tempted to say that it should be plugins responsibility to add the content timestamp for the one-shot upload case.

That sounds like a good idea to me too.

How do we proceed? It does not feel 100% correct to roll out the non-blocking orphan clean-up without the upload part and seems like (ideally) first we need to adjust upload workflow in plugins so it can facilitate our needs in core. Or do we roll-out this change the way it is and basically tell the plugins to take care of upload and other modify-like endpoints, otherwise they risk to lose orphaned artifacts/content without the timespamp the moment orphan clean up is triggered?

We probably will fall into the same same trap with /modify for the plugins that define their own endpoints

Yes, good point.

Also I have noticed that this ticket does not capture the configurable option for X seconds/minutes to keep the orphans. I recall this has been discussed and contra-proposed to the hard-coded value. Should we add it?

Actions #30

Updated by daviddavis almost 3 years ago

  • Subject changed from As a user, orphan cleanup does not block all other tasks to [EPIC] As a user, orphan cleanup does not block all other tasks
Actions #31

Updated by rchan almost 3 years ago

  • Sprint changed from Sprint 97 to Sprint 98
Actions #32

Updated by ipanova@redhat.com almost 3 years ago

  • Sprint/Milestone set to Content/disk space management
Actions #33

Updated by rchan over 2 years ago

  • Sprint changed from Sprint 98 to Sprint 99
Actions #34

Updated by ipanova@redhat.com over 2 years ago

  • Sprint/Milestone changed from Content/disk space management to 3.15.0
Actions #35

Updated by rchan over 2 years ago

  • Sprint changed from Sprint 99 to Sprint 100
Actions #36

Updated by rchan over 2 years ago

  • Sprint changed from Sprint 100 to Sprint 101
Actions #37

Updated by ipanova@redhat.com over 2 years ago

  • Sprint changed from Sprint 101 to Sprint 102
Actions #38

Updated by daviddavis over 2 years ago

  • Status changed from ASSIGNED to NEW
Actions #39

Updated by daviddavis over 2 years ago

  • Assignee deleted (daviddavis)
Actions #40

Updated by rchan over 2 years ago

  • Sprint changed from Sprint 102 to Sprint 103
Actions #41

Updated by daviddavis over 2 years ago

  • Sprint/Milestone deleted (3.15.0)

Removing from 3.15 since the 3.15 blocker subtasks are done.

Actions #42

Updated by rchan over 2 years ago

  • Sprint changed from Sprint 103 to Sprint 104
Actions #43

Updated by rchan over 2 years ago

  • Sprint changed from Sprint 104 to Sprint 105
Actions #44

Updated by rchan over 2 years ago

  • Sprint changed from Sprint 105 to Sprint 106
Actions #45

Updated by rchan over 2 years ago

  • Sprint changed from Sprint 106 to Sprint 107
Actions #46

Updated by ipanova@redhat.com over 2 years ago

  • Status changed from NEW to CLOSED - CURRENTRELEASE

Also available in: Atom PDF