_release_resource should be acks_late and idempotent
In pulp.server.async.tasks, if I look at _release_resource, the job of this task seems to be:
- if a task is still marked as running then mark it as failed
- delete the ReservedResource associated with a task
I notice the task isn't marked as acks_late=True.
Therefore it looks like the following scenario can happen:
- _release_resource task is dequeued & acknowledged from qpid
- body of _release_resource starts executing
- fails to complete e.g. because query to mongo fails
- tasks are left running and ReservedResource is left in DB indefinitely
Should this rather be made idempotent and marked as acks_late=True, so that the task would be retried in the above scenario?
#2 Updated by bmbouter over 3 years ago
- Description updated (diff)
I updated the description per your comment. I also added you to the 'Contributor' group which will allow you to update issues in the future. If there are others that also want to edit issues in pulp.plan.io please let me know what their usernames are.
#3 Updated by bmbouter over 3 years ago
- Tags RCM added
The description of the responsibilities of _release_resource is spot on. I also want more reliability to ensure tasks are not left running and ReservedResource records are left in DB indefinitely. Let's talk about how to accomplish that.
Adding acks_late by itself won't really help the failure scenario you describe because if an exception is raised the task is still acknowledged . We could add acks_late, make it idempotent, and have a Python failure cause a Celery retry . From my reading of the retry feature it will dispatch a new task with the same task id to the same queue which will cause it to be processed out of order. The out of order processing is safe, but it will still have the task in the running state and the ReservedResource record will exist for longer than it should. It would probably be a good improvement though.
I want to put some more detail on what happens when the "mongo query fails". It will raise an Exception that will be a descendant of the PyMongo exception AutoReconnect. There is a Pulp configuration option called unsafe_autoretry which wraps all mongo operations in a try-except block and catches any AutoReconnect and retries the operation. If enabled this would cause the scenario of a mongo failure in _release_resource to wait and retry until it resolved. This would effectively prevent the failure scenario described. It's an option. The feature is called "unsafe retry" for a reason because there are some potentially negative side effects. I wanted to highlight how the failure scenario is related to that failure handling feature of Pulp.
One last thing to note is that whenever the worker who has the "ghost task" (the one that didn't get cleaned up due to the failure in _release_resource()) will be marked as cancelled upon worker restart. So it's an issue, but it wouldn't be a permanent data quality issue because at some point the worker will restart. This is a related FYI. I still think making an improvement here would be good.
Given all ^ do you think a change should be made to _release_resource to use retry if a Python failure occurs, set acks_late, and ensure the task is idempotent?
#4 Updated by rmcgover over 3 years ago
It sounds like introducing retry and acks_late is an improvement over the current situation.
Regarding severity - I think this is a minor issue, and we've never had any wrong data traced back to this. This is just an apparent gap that I noticed while reading how the reserved resource system works.
It's noticeable because queue_reserved_task has acks_late=True while release_resource doesn't. To me there's no obvious reason why it would make sense for one of those to retry if interrupted, but not the other. I thought it might be written that way because queue_reserved_task sleeps and so is more likely to be interrupted, but release_resource of course also has a chance of being interrupted.
#5 Updated by bmbouter over 3 years ago
- Severity changed from 2. Medium to 1. Low
I want more reliability here, but what about this type of failure scenario. The failures to guard against here are Python exceptions, probably raised due to some database availability issue. If we add acks_late and retry=True and make it idempotent couldn't this happen?
0. The _release_resource task is running
1. The database becomes unavailable such that when you query it you get a Python exception (subclass or pymongo.exceptions.AutoReconnect).
2. The _release_resource task fails and is set back to the queue because of retry=True
3. Go back to step 0
^ will cause an infinite loop until an administrator intervenes. It may be better for the task to just fail, or the user can use the Pulp feature unsafe_retry=True. What do you think about this case?
Thanks for noting the severity also. I'm going to drop it to low because it seems we both see it that way.
To share some of the thinking I had while developing this, I used acks_late on _queue_reserved_task() to guard against the resource manager dying (like power loss or OOM) while processing a task. I can't guard against that type of failure with _release_resource task because of Issue #489.
It's nice to talk to someone who has looked at the work I've put together so closely.
#6 Updated by rmcgover over 3 years ago
I'm not sure how practical it is with celery, but my preference would be for it to happen like you said except:
- There should be some maximum number of retries rather than an infinite loop
- If maximum number of retries is reached, that is signalled in some way on which I can build an alert (for example, the failure goes to system logs and can be reliably extracted by a certain pattern)
I wouldn't consider using unsafe_autoretry at the moment since it seems to apply across the entire application, which doesn't seem like it could possibly be safe unless every query was designed to be idempotent.
Also I don't want to have retry happening on all tasks: for regular tasks triggered via API, if a failure occurs there I would prefer it be reported back to the caller who can make the decision whether to try again.
#8 Updated by bmbouter over 3 years ago
@rmcgover. Thanks for writing. I added some checklist items and we triaged today.
I like having a maximum number of retries, but what is the right number? I don't think any user would want to set this value so I wrote the check list as if we are hard coding it. I totally guessed and thought maybe 10 was right. Thoughts on this approach?
#10 Updated by bmbouter almost 2 years ago
Pulp 2 is approaching maintenance mode, and this Pulp 2 ticket is not being actively worked on. As such, it is being closed as WONTFIX. Pulp 2 is still accepting contributions though, so if you want to contribute a fix for this ticket, please reopen or comment on it. If you don't have permissions to reopen this ticket, or you want to discuss an issue, please reach out via the developer mailing list.
Please register to edit this issue