Project

Profile

Help

Story #7659

[EPIC] As a user, orphan cleanup does not block all other tasks

Added by bmbouter 10 months ago. Updated 16 days ago.

Status:
ASSIGNED
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
Start date:
Due date:
% Done:

33%

Estimated time:
(Total: 0:00 h)
Platform Release:
Groomed:
Yes
Sprint Candidate:
Yes
Tags:
Sprint:
Sprint 101
Quarter:
Q2-2021

Description

Background

When orphan cleanup runs it blocks all tasks submitted after it, until all workers become idle and then executes the orphan cleanup on the resource manager itself. Also orphan cleanup itself as an operation takes on the order of minutes, it's a non-trivial amount of time.

Problem

On a large Pulp system with many workers running long operations, a user will experience a very significant stalling of all newly submitted work. This occurs even when the user is interested in removing a single piece of content. For stakeholders concerned with tasking throughput on large installations, e.g. galaxy_ng this is not viable.

A simple solution

Have the orphan cleanup run asynchronously and without any locks. This will allow any worker to work on it in parallel with other Pulp task types.

Handling failures

It's a race between another task associating an orphan with a repository version and orphan cleanup deleting an orphan.

In the case orphan_cleanup wins, plugin writers will need to be told they can no longer guarantee that just because content was there a moment ago, it is there now. I expect the exception can just bubble up to the user, and they user can restart the job at which point code like sync will get it right the second time.

In the case the other task associates an orphan, making it a non-orphan, after the orphan_cleanup has identified it as an orphan, we need to ensure the db will stop orphan_cleanup from deleting it via on_delete=PROTECT.

Proposal for complex solution

Add a new field timestamp_of_interest on the artifact and content models

  • This field should be set at the artifact/content creation time
  • This field sould be updated whenever we work with the existing artifact/content
  • Add a configurable option preserve_orphaned_content = X seconds/minutes. It can be whether a global setting or a parameter to the orphan clean up call.

It is expected that the artifact will be added to the content (unorphaning the artifact) quicker than X seconds/minutes (TBD) from the timestamp_of_interest. Similarly, content will be added to a repo version (unorphaning the content) also within X seconds/minutes(TBD).

Sync Pipeline

timestamp_of_interest This timestamp will be set in the sync pipeline in the QueryExistingArtifacts and QueryExistingContents

When querying for existing artifacts/content and they are found, set the timestamp. If committing this transaction fails due to orphan cleanup committing a transaction that removed the objects having timestamp_of_interest set, retry the exact same transaction again, the second time removed objects will not be found and the pipeline will proceed normally to re-download/re-create the artifact/content in the further stages.

For newly created artifacts and content we need to set this timestamp as well, this way they will be marked 'as-planned-to-be-used'( aka I plan to add this artifact to the content or I plan to add this content to the repo)

Upload

  • if one shot upload - handle in 1 transaction artifact and content creation.
  • for non 1 shot upload, set the timestamp during content/artifact creation, or in case artifact/content are existing ones, update the timestamp. It is expected users will make their second call to associate the orphaned artifact within X seconds/minutes (TDB) since timestamp_of_interest.

Modify

  • set the timestamp on the content specified in the add_content_hrefs to prevent the case when orphan would clean them up in the middle of the running task.

Orphan cleanup logic (orphan clean up task)

Will be able to run in parallel without locks.

Remove Content that:

  • has no membership ( does not belong to any of the repo versions)
  • timestamp_of_interest is older than X seconds( was marked to be used X seconds ago but still does not belong to any of repo versions)

Remove Artifact that:

  • has no membership ( does not belong to any of the content)
  • timestamp_of_interest is older than X seconds( was marked to be used X second ago but still does not belong to any of the content)

Under consideration

some syncs can take long time to finish, because of that we could look at the most recent and long running task before removing orphans


Subtasks

Task #8823: Add timestamp_of_interest to Content and ArtifactCLOSED - CURRENTRELEASEdaviddavis

Actions
Task #8824: Convert orphan cleanup into a non-blocking taskASSIGNEDdaviddavis

Actions
Task #8960: Ensure that pulp imports can run concurrently with orphan cleanupASSIGNEDdaviddavis

Actions

History

#1 Updated by bmbouter 10 months ago

  • Description updated (diff)

#2 Updated by bmbouter 10 months ago

  • Description updated (diff)

#3 Updated by iballou 10 months ago

I can't think of any reason why the proposed solution would cause trouble in Katello. We'll just want to make sure there is an obvious path for keeping Pulp data consistent.

#4 Updated by ipanova@redhat.com 10 months ago

In the case orphan_cleanup wins, plugin writers will need to be told they can no longer guarantee that just because content was there a moment ago, it is there now. I expect the exception can just bubble up to the user, and they user can restart the job at which point code like sync will get it right the second time.

In this case, my preference would be to extend Stages API to handle this exception behind the scene, re-download the content, rather than asking users to re-run the sync.

However this does not solve the upload usecase - How do we solve the case when content was initially brought into pulp not by sync operation but upload? I don't think telling we no longer guarantee that just because content was there a moment ago, it is there now will bring a lot of confidence to our users. Our value proposition is to reliably manage content and this statement contradicts to it and it worries me, especially when there is not provided good recovery path.

#5 Updated by bmbouter 10 months ago

wrote:

In this case, my preference would be to extend Stages API to handle this exception behind the scene, re-download the content, rather than asking users to re-run the sync. I agree with you on this in the long term, but the complexity it brings I don't think is worth it right now. Consider these reasons and let me know if you still think it is.

  1. I think this will make this a plugin-api breaking change, requiring any plugin with a custom pipeline (which is several) to take additional action to port onto it. This would be breaking because I believe to send content from later stages back to earlier stages would require the stages to reference each other.

  2. The level of effort goes up significantly with this requirement when we could deliver this improvement later and keep the changes smaller.

However this does not solve the upload usecase - How do we solve the case when content was initially brought into pulp not by sync operation but upload? I don't think telling we no longer guarantee that just because content was there a moment ago, it is there now will bring a lot of confidence to our users. Our value proposition is to reliably manage content and this statement contradicts to it and it worries me, especially when there is not provided good recovery path.

I acknowledge that this is a step backwards in reliability, but I want to make two points about this.

  1. If we leave the situation as-is, any pulp system will block all new tasks for a really long time every time. Really long like hours because it has to wait for every task on every worker to stop. Some sync's themselves can take hours, not to mention the many operations running on all workers. This is a huge cost for users to pay and I don't think we can justify that user experience.

  2. The upload error case is a race condition, one that would occur I believe rarely and is easily recoverable.

My perspective is that, when I consider the balance between the user experience pain of deleting content (hours, unscalable, and almost unworkable) against the downsides to users (brief and recoverable) this bring overall huge benefits.

#6 Updated by ipanova@redhat.com 10 months ago

bmbouter wrote:

wrote:

In this case, my preference would be to extend Stages API to handle this exception behind the scene, re-download the content, rather than asking users to re-run the sync. I agree with you on this in the long term, but the complexity it brings I don't think is worth it right now. Consider these reasons and let me know if you still think it is.

  1. I think this will make this a plugin-api breaking change, requiring any plugin with a custom pipeline (which is several) to take additional action to port onto it. This would be breaking because I believe to send content from later stages back to earlier stages would require the stages to reference each other.

  2. The level of effort goes up significantly with this requirement when we could deliver this improvement later and keep the changes smaller.

I agree that current situation with the orphan clean-up is barely acceptable and we need to do something about it. However regardless how much effort will be invested now or later this change will remain a breaking one.

However this does not solve the upload usecase - How do we solve the case when content was initially brought into pulp not by sync operation but upload? I don't think telling we no longer guarantee that just because content was there a moment ago, it is there now will bring a lot of confidence to our users. Our value proposition is to reliably manage content and this statement contradicts to it and it worries me, especially when there is not provided good recovery path.

I acknowledge that this is a step backwards in reliability, but I want to make two points about this.

  1. If we leave the situation as-is, any pulp system will block all new tasks for a really long time every time. Really long like hours because it has to wait for every task on every worker to stop. Some sync's themselves can take hours, not to mention the many operations running on all workers. This is a huge cost for users to pay and I don't think we can justify that user experience.

  2. The upload error case is a race condition, one that would occur I believe rarely and is easily recoverable.

My perspective is that, when I consider the balance between the user experience pain of deleting content (hours, unscalable, and almost unworkable) against the downsides to users (brief and recoverable) this bring overall huge benefits.

It is hard to argue against the above mentioned points, but I still feel uneasy on taking a step back on reliability. I would love to hear what other people think.

#7 Updated by ipanova@redhat.com 8 months ago

  • Description updated (diff)

#8 Updated by ipanova@redhat.com 8 months ago

  • Description updated (diff)

#9 Updated by daviddavis 8 months ago

  • Groomed changed from No to Yes
  • Sprint Candidate changed from No to Yes

#10 Updated by daviddavis 7 months ago

  • Quarter set to Q1-2021

#11 Updated by ipanova@redhat.com 4 months ago

  • Description updated (diff)

#12 Updated by daviddavis 4 months ago

  • Quarter changed from Q1-2021 to Q2-2021

#13 Updated by daviddavis 3 months ago

  • Description updated (diff)

#14 Updated by daviddavis 3 months ago

  • Description updated (diff)

#15 Updated by daviddavis 3 months ago

  • Sprint set to Sprint 95

#16 Updated by daviddavis 3 months ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to daviddavis

#17 Updated by daviddavis 3 months ago

  • Status changed from ASSIGNED to NEW
  • Assignee deleted (daviddavis)

#18 Updated by daviddavis 3 months ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to daviddavis

#19 Updated by rchan 3 months ago

  • Sprint changed from Sprint 95 to Sprint 96

#20 Updated by rchan 3 months ago

  • Sprint changed from Sprint 96 to Sprint 97

#21 Updated by daviddavis 3 months ago

if one shot upload - handle in 1 transaction artifact and content creation.

This is impossible (at least not without some major refactoring). Currently, artifact creation is handled in the viewset while content create is handled by a task.

That said I don't think it's a big deal. Worst case scenario the content creation fails and the user just has to adjust the preservation setting and retry it.

#22 Updated by bmbouter 3 months ago

daviddavis wrote:

if one shot upload - handle in 1 transaction artifact and content creation.

This is impossible (at least not without some major refactoring). Currently, artifact creation is handled in the viewset while content create is handled by a task.

That said I don't think it's a big deal. Worst case scenario the content creation fails and the user just has to adjust the preservation setting and retry it.

In looking at those links, I agree with your conclusion. I also think it's not a very big deal either.

#23 Updated by ipanova@redhat.com 2 months ago

daviddavis wrote:

if one shot upload - handle in 1 transaction artifact and content creation.

This is impossible (at least not without some major refactoring). Currently, artifact creation is handled in the viewset while content create is handled by a task.

That said I don't think it's a big deal. Worst case scenario the content creation fails and the user just has to adjust the preservation setting and retry it.

+1 .

#24 Updated by daviddavis 2 months ago

We use the general_create task to create content for one-shot uploads which basically sets up a subclass of SingleArtifactContentUploadSerializer and calls save() on it.

I'm not sure exactly where is best to update the timestamp if there's an existing content. I'm imagining I could update SingleArtifactContentUploadSerializer. But I am not sure how to do this in a general way to work for all plugins. For one thing, consider the pulp_rpm package case where it's extracting information from the artifact file and then raising a general ValidationError.

#25 Updated by mdellweg 2 months ago

I see what you are saying. In the plugin code where we see the package with exists() we'd need to bump its time of interest.

What if instead of a validation error, the upload task would return the existing content unit (as if it were just uploaded; We might need to add ownership permissions at that point too.)?

Also how is the situation you are describing worse than what we already have? If an orphan cleanup is scheduled right after your upload task, you will never see the content you created even with today's code.

#26 Updated by ipanova@redhat.com 2 months ago

mdellweg wrote:

I see what you are saying. In the plugin code where we see the package with exists() we'd need to bump its time of interest.

What if instead of a validation error, the upload task would return the existing content unit (as if it were just uploaded; We might need to add ownership permissions at that point too.)?

Yeah..this is plugin specific, I am tempted to say that it should be plugins responsibility to add the content timestamp for the one-shot upload case. I also don't see a generic way to do it in core :/

Also how is the situation you are describing worse than what we already have? If an orphan cleanup is scheduled right after your upload task, you will never see the content you created even with today's code.

Right, it depends whether you have provided the repository you want the package to be added to https://github.com/pulp/pulpcore/blob/aed355fbc380a13a4a767e2627cbd54c7ff8f107/pulpcore/plugin/viewsets/content.py#L85

#27 Updated by ipanova@redhat.com 2 months ago

We probably will fall into the same same trap with /modify for the plugins that define their own endpoints https://github.com/pulp/pulp_container/blob/master/pulp_container/app/viewsets.py#L592

#28 Updated by daviddavis 2 months ago

What if instead of a validation error, the upload task would return the existing content unit (as if it were just uploaded; We might need to add ownership permissions at that point too.)?

That sounds like a possible idea. We could notify plugins that they need to start returning the content unit instead of raising an exception and I can have core always bump the timestamp.

Also how is the situation you are describing worse than what we already have? If an orphan cleanup is scheduled right after your upload task, you will never see the content you created even with today's code.

This is true.

I am tempted to say that it should be plugins responsibility to add the content timestamp for the one-shot upload case.

That sounds like a good idea to me too.

We probably will fall into the same same trap with /modify for the plugins that define their own endpoints

Yes, good point.

#29 Updated by ipanova@redhat.com 2 months ago

daviddavis wrote:

What if instead of a validation error, the upload task would return the existing content unit (as if it were just uploaded; We might need to add ownership permissions at that point too.)?

That sounds like a possible idea. We could notify plugins that they need to start returning the content unit instead of raising an exception and I can have core always bump the timestamp.

Here is a ticket related to this https://pulp.plan.io/issues/7114#note-1

Also how is the situation you are describing worse than what we already have? If an orphan cleanup is scheduled right after your upload task, you will never see the content you created even with today's code.

This is true.

I am tempted to say that it should be plugins responsibility to add the content timestamp for the one-shot upload case.

That sounds like a good idea to me too.

How do we proceed? It does not feel 100% correct to roll out the non-blocking orphan clean-up without the upload part and seems like (ideally) first we need to adjust upload workflow in plugins so it can facilitate our needs in core. Or do we roll-out this change the way it is and basically tell the plugins to take care of upload and other modify-like endpoints, otherwise they risk to lose orphaned artifacts/content without the timespamp the moment orphan clean up is triggered?

We probably will fall into the same same trap with /modify for the plugins that define their own endpoints

Yes, good point.

Also I have noticed that this ticket does not capture the configurable option for X seconds/minutes to keep the orphans. I recall this has been discussed and contra-proposed to the hard-coded value. Should we add it?

#30 Updated by daviddavis 2 months ago

  • Subject changed from As a user, orphan cleanup does not block all other tasks to [EPIC] As a user, orphan cleanup does not block all other tasks

#31 Updated by rchan about 2 months ago

  • Sprint changed from Sprint 97 to Sprint 98

#32 Updated by ipanova@redhat.com about 2 months ago

  • Sprint/Milestone set to Content/disk space management

#33 Updated by rchan about 1 month ago

  • Sprint changed from Sprint 98 to Sprint 99

#34 Updated by ipanova@redhat.com about 1 month ago

  • Sprint/Milestone changed from Content/disk space management to 3.15.0

#35 Updated by rchan 29 days ago

  • Sprint changed from Sprint 99 to Sprint 100

#36 Updated by rchan 16 days ago

  • Sprint changed from Sprint 100 to Sprint 101

Please register to edit this issue

Also available in: Atom PDF