Issue #3045
closed
Running orphan cleanup tasks simultaniously leads to high mongod cpu usage
Description
This is often seen on the Katello setups with smart proxies.
Even though it can be improved on Katello side - not to run orphan clean up after each sync on smart proxies, there is no need for two or more orphan tasks to be run in parallel. So suggestion here is to prevent that.
Making this change also would avoid the race condition reported in #3043
As per mhrivnak comment:
I don't see any harm in us making it a resource-reserving task. We aren't gaining much by running multiple in parallel. It should be a simple 1-line change to call "apply_async_with_reservation(...)" instead of "apply_async()"
Check related BZ for more details.
- Related to Issue #3043: Race condition during orphan cleanup added
- Description updated (diff)
- Tags Easy Fix added
- Priority changed from Normal to High
- Sprint/Milestone set to 45
- Triaged changed from No to Yes
I don't think we should make any changes to restrict the concurrency of an orphan cleanup. A change here is helpful in this pattern of usage, but other patterns of usage would be harmed by a change in this area. Consider this use case:
1) A user performs an operation like a sync
2) The user wants to ensure that any orhpans are deleted
Assuming users are doing the above workflow concurrently, if orphan cleanups are linearized then the following would happen:
1) a user starts a sync on repo A
2) a user starts a sync on repo B
3) repo sync A completes and the user dispatches an orphan cleanup which begins immediately
4) repo sync B completes and the user dispatches an orphan cleanup which does not begin immediately
5) orphan cleanup from step (3) finishes
6) orphan cleanup from step (4) starts
7) orphan cleanup from step (5) finishes
So with the above pattern, we are introducing additional delay between steps (4) and step (7). My concern is that we are trying to be smarter than our users. If users load Pulp up with a bunch of tasks, they may just want Pulp to do them as fast as possible.
How about if we just make it optional? We could preserve the existing behavior, and let the API user optionally request that the operation use a reservation. That would give the user the most flexibility to push their deployment if it can handle parallel orphan cleanups, or restrict them to one at a time if not.
I have two concerns with doing this optionally. (1) I don't think implementing that option will create much value. Users can issue the cancels that make sense for their given workflows, so having an option which also cancels tasks I don't think is valuable. Also (2) that option could have very unintented consequences on a multi-tenant Pulp system which we probably can't do.
One way that we can help (maybe-ish) is to put a tip or note section in the orphan cleanup docs about that situation. Even that though I don't think makes perfect sense though. We could probably put a similar note that reads "maybe cancel tasks if you are dispatching a bunch of redundant work to pulp" all over pulp's docs.
So with the multi-tenancy concerns, and that users can already manage task cancellation on their own, I think closing as NOTABUG would be the best. What do others think about these concerns?
I agree in general that we should take into account more than one scenario and give users more flexibility to control different parts of Pulp.
In case of orphan cleanup, I'm not sure what users can gain by running it in parallel and what is the issue with having orphans in db for a while?
What is a case when this delay between orphan tasks matters? For sync/publish/some other tasks that makes sense but I'm not sure it has value for the orphan one. Having orphans has no impact on operations user does in Pulp, more over it can help not to re-download content.
As seen in BZ, mongo uses a lot of CPU and everything is slowing down. We can run some tests but I think running orphan tasks sequentially may even speed up the process, especially when there are much more than two orphan tasks running in parallel. Also on the second run potentially there will be less orphans to go through and clean up.
Imagine a user who follows a workflow where they want to sync, clean up orphans, and then do something else after the orphan cleanup has completed. Users who do that benefit from these tasks running in parallel because overall the task wait time is lower.
Maybe there are some installations that would benefit from a synchronous runtime for this task type. That would be a feature not an issue. Also that seems relatively low priority since users can resolve their cpu load issue themselves by cancelling or not submitting orphan cleanup tasks themselves without us making a change.
So after thinking more about this, maybe leave as open but switch to a feature and send through feature planning. What do others think about this?
To recap some irc discussion, I believe we decided if we were to make an adjustment it would be a Pulp installation-wide setting. Since that is a feature it needs more planning since this was being treated as a bugfix. I think we should take it off the sprint, but I want to hear from others before I change that.
- Priority changed from High to Normal
- Sprint/Milestone deleted (
45)
This is being worked around a different way, and we don't have agreement on how to proceed with this issue, thus we're taking it off the sprint. We can re-visit in the future if necessary.
- Sprint Candidate changed from Yes to No
- Status changed from NEW to CLOSED - WONTFIX
- Tags deleted (
Easy Fix)
It's not on the pulp2 roadmap.
The workaround is not to run orphan cleanup in parallel.
Also available in: Atom
PDF