Story #2186
closedAs a user, pulp-manage-db refuses to run if other pulp processes are running
Added by jsherril@redhat.com over 8 years ago. Updated over 5 years ago.
100%
Description
pulp db migrations are not meant to be run while pulp processes are active and can cause major data issues if there are active workers.
There should be a test to prevent this at the startup of the pulp-manage-db process.
[UPDATE] See implementation plan on comment 21.
Updated by mhrivnak over 8 years ago
- Tracker changed from Issue to Story
- Subject changed from check for active pulp workers at start of manage-db to As a user, pulp-manage-db refuses to run if other pulp processes are running
- Description updated (diff)
Updated by mhrivnak over 8 years ago
On a multi-machine deployment, this could be difficult to guarantee, but pulp-manage-db could make a best effort.
Updated by bmbouter over 8 years ago
Pulp is a clustered software. We have to assume that there are multiple machines with pulp workers running on them. Can you elaborate on what a best-effort solution would look like?
Updated by mhrivnak over 8 years ago
A best effort could, for example, check the local process list for anything (besides itself) containing the word "pulp", and if found, ask the user if they're sure they want to proceed. Slightly fancier would be to check the database for any known workers or any tasks listed in a running state.
Of course that would not be a guarantee or catch all cases, but even the simple process check would have caught all of the recent cases we've been seeing where people ran pulp-manage-db on a single-system deployment while pulp was running. Those cases lead to database corruption.
Updated by bmbouter over 8 years ago
- Subject changed from As a user, pulp-manage-db refuses to run if other pulp processes are running to As a user, pulp-manage-db refuses to run if other pulp processes are running locally
I had not considered checking something locally. I always think about Pulp as being clustered. The process check would provide a level of fault tolerance for single-box deployments so we can leave this open with that implementation in mind. I re-titled the story to math that it's only a partial solution.
Updated by mhrivnak about 8 years ago
- Sprint/Milestone set to 27
- Groomed changed from No to Yes
- Sprint Candidate changed from No to Yes
Updated by bizhang about 8 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to bizhang
Updated by bizhang about 8 years ago
I had a discussion with Brian regarding this, and here are the pros and cons of each opinion:
1.) Check DB listing
- Would work in a clustered environment
- status.get_workers() will have celery_beat worker left over even when all processes have stopped
- We can try to check the worker timestamp but can't tell for sure until 390s have elapsed
- there are some failure scenarios where processes could be running but the timestamps have stopped being recorded
- if we proceed with this changes need to made to scheduler.py for a sigterm handler to mitigate the above concerns
2.) Check local process listing
- Can tell if there are any workers left real time
- does not support clustered upgrade
Personally I believe that 1 is the correct way to go, and will make the changes in scheduler.py to support this.
Updated by bmbouter about 8 years ago
I see a lot of usable value in option 1 also. Thanks bizhang!
To correct a point I made from IRC, I think you can use a failover time of 300 seconds. The 390 number is the maximum delay until recover from failover, but if all records are over 300 seconds old then we can continue with the migrations.
Updated by bmbouter about 8 years ago
Also if the database is down we should check that pulp-manage-db handles it gracefully. It wouldn't be good to have a traceback occur because this database check can't connect. I'm not sure what it does today or where that code lives.
Updated by bizhang about 8 years ago
- Status changed from ASSIGNED to POST
Added by bizhang about 8 years ago
Added by bizhang about 8 years ago
Revision 4f43a85d | View on GitHub
Prompt user of running workers/tasks on migration
Add close() method to celery scheduler for clean exit. Catch ServerSelectionTimeoutError to handle when mongo is down.
Updated by bizhang about 8 years ago
- Status changed from POST to MODIFIED
- % Done changed from 0 to 100
Applied in changeset pulp|4f43a85dd568f4a0b50ae9b07bbec7138861e92b.
Added by bizhang about 8 years ago
Revision 75f0ac09 | View on GitHub
datetime.total_time() not supported by python 2.6
Added by semyers almost 8 years ago
Revision 70f8726b | View on GitHub
Update release notes about migration caveat
Added by bizhang almost 8 years ago
Revision bf2d1d4a | View on GitHub
Revert "Update release notes about migration caveat"
This reverts commit 70f8726b96decbcfa9ef1b799809beeb864d50a4.
re #2186 #2468 https://pulp.plan.io/issues/2186 https://pulp.plan.io/issues/2468
Added by bizhang almost 8 years ago
Revision bf2d1d4a | View on GitHub
Revert "Update release notes about migration caveat"
This reverts commit 70f8726b96decbcfa9ef1b799809beeb864d50a4.
re #2186 #2468 https://pulp.plan.io/issues/2186 https://pulp.plan.io/issues/2468
Added by bizhang almost 8 years ago
Revision 38e60824 | View on GitHub
Revert "datetime.total_time() not supported by python 2.6"
This reverts commit 75f0ac094ec3dd0bc0efd96670fe036ae7aa1982.
Added by bizhang almost 8 years ago
Revision d633f819 | View on GitHub
Revert "Prompt user of running workers/tasks on migration"
This reverts commit 4f43a85dd568f4a0b50ae9b07bbec7138861e92b.
Added by bizhang almost 8 years ago
Revision 38e60824 | View on GitHub
Revert "datetime.total_time() not supported by python 2.6"
This reverts commit 75f0ac094ec3dd0bc0efd96670fe036ae7aa1982.
Added by bizhang almost 8 years ago
Revision d633f819 | View on GitHub
Revert "Prompt user of running workers/tasks on migration"
This reverts commit 4f43a85dd568f4a0b50ae9b07bbec7138861e92b.
Updated by bizhang almost 8 years ago
- Status changed from 5 to ASSIGNED
- Sprint/Milestone changed from 27 to 30
- Platform Release deleted (
2.11.0)
Updated by bmbouter almost 8 years ago
- Status changed from ASSIGNED to NEW
- Assignee deleted (
bizhang) - Sprint/Milestone deleted (
30) - Groomed changed from Yes to No
Taking this issue back to the planning phase and I will send out a note to pulp-dev to discuss options.
Updated by semyers almost 8 years ago
https://github.com/pulp/pulp/pull/2883 is the PR that reverted this feature. I've merged the revert forward to master, so evidence of this feature no longer exists from 2.11-dev forward; you ca use the referenced PR to find easy commit hashes for restoring it when we pick this work back up.
Updated by bmbouter almost 8 years ago
- Description updated (diff)
Rewriting the bug based on the discussion from pulp-dev[0].
Here is a pseudo code representation:
most_recent_worker_timestamp = Workers.objects.get_the_most_recent_worker_timestamp
# show a message to the users indicating which workers are being checked to see if they are running (show the names) and how many seconds the user has to wait
sleep(most_recent_worker_timestamp + 60 seconds - now) # sleep until the time that 60 seconds have passed since the most recent timestamp was observed
# check the db again and see if there is a newer timestamp than most_recent_worker_timestamp
if there_is_newer:
# show the user a message indicating there are still workers running and show them the worker names
# exit with a non zero exit code
else:
# keep going!
To facilitate this, we should adjust the timings of heartbeats for pulp_celerybeat, pulp_resource_manager, and pulp_workers. All of them should use a heartbeat value of 20 seconds. For pulp_celerybeat that is set here[1]. For pulp_workers that is done through the init scripts and systemd scripts which pass a command line argument. Here is 1 link[2] as an example, but this is probably in 4 places at least.
Also can the heartbeat timing changes all be in 1 commit and the other changes (code and docs) be in another commit. Packagers may want to pick the heartbeat changes without the pulp-manage-db changes both coming together.
[0]: https://www.redhat.com/archives/pulp-dev/2016-December/msg00025.html
[1]: https://github.com/werwty/pulp/blob/4f43a85dd568f4a0b50ae9b07bbec7138861e92b/common/pulp/common/constants.py#L68
[2]: https://github.com/pulp/pulp/blob/master/server/usr/lib/systemd/system/pulp_resource_manager.service#L12
Updated by mhrivnak almost 8 years ago
- Sprint/Milestone set to 31
- Groomed changed from No to Yes
Updated by bizhang almost 8 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to bizhang
Updated by bizhang almost 8 years ago
- Status changed from ASSIGNED to POST
Updated by bizhang almost 8 years ago
The heartbeat changes have been moved to a separate story #2509
Added by bizhang almost 8 years ago
Revision 798a9315 | View on GitHub
Add pulp-manage-db check for running workers.
pulp-manage-db command now waits to check if workers have been stopped. If there are running workers the migration is stopped
Added by bizhang almost 8 years ago
Revision 798a9315 | View on GitHub
Add pulp-manage-db check for running workers.
pulp-manage-db command now waits to check if workers have been stopped. If there are running workers the migration is stopped
Updated by bizhang almost 8 years ago
- Status changed from POST to MODIFIED
- % Done changed from 0 to 100
Applied in changeset pulp|798a9315c42f624bcf2a785acd19ee7c194e77d4.
Updated by bmbouter almost 8 years ago
- Subject changed from As a user, pulp-manage-db refuses to run if other pulp processes are running locally to As a user, pulp-manage-db refuses to run if other pulp processes are running
Updated by semyers almost 8 years ago
- Status changed from 5 to CLOSED - CURRENTRELEASE
Prompt user of running workers/tasks on migration
Add close() method to celery scheduler for clean exit. Catch ServerSelectionTimeoutError to handle when mongo is down.
closes #2186 https://pulp.plan.io/issues/2186