As a user, pulp-manage-db refuses to run if other pulp processes are running
pulp db migrations are not meant to be run while pulp processes are active and can cause major data issues if there are active workers.
There should be a test to prevent this at the startup of the pulp-manage-db process.
[UPDATE] See implementation plan on comment 21.
Updated by mhrivnak over 6 years ago
A best effort could, for example, check the local process list for anything (besides itself) containing the word "pulp", and if found, ask the user if they're sure they want to proceed. Slightly fancier would be to check the database for any known workers or any tasks listed in a running state.
Of course that would not be a guarantee or catch all cases, but even the simple process check would have caught all of the recent cases we've been seeing where people ran pulp-manage-db on a single-system deployment while pulp was running. Those cases lead to database corruption.
Updated by bmbouter over 6 years ago
- Subject changed from As a user, pulp-manage-db refuses to run if other pulp processes are running to As a user, pulp-manage-db refuses to run if other pulp processes are running locally
I had not considered checking something locally. I always think about Pulp as being clustered. The process check would provide a level of fault tolerance for single-box deployments so we can leave this open with that implementation in mind. I re-titled the story to math that it's only a partial solution.
Updated by bizhang about 6 years ago
I had a discussion with Brian regarding this, and here are the pros and cons of each opinion:
1.) Check DB listing
- Would work in a clustered environment
- status.get_workers() will have celery_beat worker left over even when all processes have stopped
- We can try to check the worker timestamp but can't tell for sure until 390s have elapsed
- there are some failure scenarios where processes could be running but the timestamps have stopped being recorded
- if we proceed with this changes need to made to scheduler.py for a sigterm handler to mitigate the above concerns
2.) Check local process listing
- Can tell if there are any workers left real time
- does not support clustered upgrade
Personally I believe that 1 is the correct way to go, and will make the changes in scheduler.py to support this.
Updated by bmbouter about 6 years ago
I see a lot of usable value in option 1 also. Thanks bizhang!
To correct a point I made from IRC, I think you can use a failover time of 300 seconds. The 390 number is the maximum delay until recover from failover, but if all records are over 300 seconds old then we can continue with the migrations.
Updated by semyers about 6 years ago
https://github.com/pulp/pulp/pull/2883 is the PR that reverted this feature. I've merged the revert forward to master, so evidence of this feature no longer exists from 2.11-dev forward; you ca use the referenced PR to find easy commit hashes for restoring it when we pick this work back up.
Updated by bmbouter almost 6 years ago
- Description updated (diff)
Rewriting the bug based on the discussion from pulp-dev.
Here is a pseudo code representation:
most_recent_worker_timestamp = Workers.objects.get_the_most_recent_worker_timestamp # show a message to the users indicating which workers are being checked to see if they are running (show the names) and how many seconds the user has to wait sleep(most_recent_worker_timestamp + 60 seconds - now) # sleep until the time that 60 seconds have passed since the most recent timestamp was observed # check the db again and see if there is a newer timestamp than most_recent_worker_timestamp if there_is_newer: # show the user a message indicating there are still workers running and show them the worker names # exit with a non zero exit code else: # keep going!
To facilitate this, we should adjust the timings of heartbeats for pulp_celerybeat, pulp_resource_manager, and pulp_workers. All of them should use a heartbeat value of 20 seconds. For pulp_celerybeat that is set here. For pulp_workers that is done through the init scripts and systemd scripts which pass a command line argument. Here is 1 link as an example, but this is probably in 4 places at least.
Also can the heartbeat timing changes all be in 1 commit and the other changes (code and docs) be in another commit. Packagers may want to pick the heartbeat changes without the pulp-manage-db changes both coming together.