Project

Profile

Help

Story #2186

closed

As a user, pulp-manage-db refuses to run if other pulp processes are running

Added by jsherril@redhat.com over 7 years ago. Updated almost 5 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
% Done:

100%

Estimated time:
(Total: 0:00 h)
Platform Release:
2.12.0
Groomed:
Yes
Sprint Candidate:
Yes
Tags:
Pulp 2
Sprint:
Sprint 13
Quarter:

Description

pulp db migrations are not meant to be run while pulp processes are active and can cause major data issues if there are active workers.

There should be a test to prevent this at the startup of the pulp-manage-db process.

[UPDATE] See implementation plan on comment 21.


Sub-issues 1 (0 open1 closed)

Story #2469: As an user, I should be able to specify an option to pulp-manage-db to assume yes for the running workers questionCLOSED - WONTFIX

Actions
Actions #1

Updated by mhrivnak over 7 years ago

  • Tracker changed from Issue to Story
  • Subject changed from check for active pulp workers at start of manage-db to As a user, pulp-manage-db refuses to run if other pulp processes are running
  • Description updated (diff)
Actions #2

Updated by mhrivnak over 7 years ago

On a multi-machine deployment, this could be difficult to guarantee, but pulp-manage-db could make a best effort.

Actions #3

Updated by bmbouter over 7 years ago

Pulp is a clustered software. We have to assume that there are multiple machines with pulp workers running on them. Can you elaborate on what a best-effort solution would look like?

Actions #4

Updated by mhrivnak over 7 years ago

A best effort could, for example, check the local process list for anything (besides itself) containing the word "pulp", and if found, ask the user if they're sure they want to proceed. Slightly fancier would be to check the database for any known workers or any tasks listed in a running state.

Of course that would not be a guarantee or catch all cases, but even the simple process check would have caught all of the recent cases we've been seeing where people ran pulp-manage-db on a single-system deployment while pulp was running. Those cases lead to database corruption.

Actions #5

Updated by bmbouter over 7 years ago

  • Subject changed from As a user, pulp-manage-db refuses to run if other pulp processes are running to As a user, pulp-manage-db refuses to run if other pulp processes are running locally

I had not considered checking something locally. I always think about Pulp as being clustered. The process check would provide a level of fault tolerance for single-box deployments so we can leave this open with that implementation in mind. I re-titled the story to math that it's only a partial solution.

Actions #8

Updated by mhrivnak over 7 years ago

  • Sprint/Milestone set to 27
  • Groomed changed from No to Yes
  • Sprint Candidate changed from No to Yes
Actions #9

Updated by bizhang over 7 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to bizhang
Actions #10

Updated by bizhang over 7 years ago

I had a discussion with Brian regarding this, and here are the pros and cons of each opinion:

1.) Check DB listing

  • Would work in a clustered environment
  • status.get_workers() will have celery_beat worker left over even when all processes have stopped
    • We can try to check the worker timestamp but can't tell for sure until 390s have elapsed
    • there are some failure scenarios where processes could be running but the timestamps have stopped being recorded
    • if we proceed with this changes need to made to scheduler.py for a sigterm handler to mitigate the above concerns

2.) Check local process listing

  • Can tell if there are any workers left real time
  • does not support clustered upgrade

Personally I believe that 1 is the correct way to go, and will make the changes in scheduler.py to support this.

Actions #11

Updated by bmbouter over 7 years ago

I see a lot of usable value in option 1 also. Thanks bizhang!

To correct a point I made from IRC, I think you can use a failover time of 300 seconds. The 390 number is the maximum delay until recover from failover, but if all records are over 300 seconds old then we can continue with the migrations.

Actions #12

Updated by bmbouter over 7 years ago

Also if the database is down we should check that pulp-manage-db handles it gracefully. It wouldn't be good to have a traceback occur because this database check can't connect. I'm not sure what it does today or where that code lives.

Actions #13

Updated by bizhang over 7 years ago

  • Status changed from ASSIGNED to POST

Added by bizhang over 7 years ago

Revision 4f43a85d | View on GitHub

Prompt user of running workers/tasks on migration

Add close() method to celery scheduler for clean exit. Catch ServerSelectionTimeoutError to handle when mongo is down.

closes #2186 https://pulp.plan.io/issues/2186

Added by bizhang over 7 years ago

Revision 4f43a85d | View on GitHub

Prompt user of running workers/tasks on migration

Add close() method to celery scheduler for clean exit. Catch ServerSelectionTimeoutError to handle when mongo is down.

closes #2186 https://pulp.plan.io/issues/2186

Actions #14

Updated by bizhang over 7 years ago

  • Status changed from POST to MODIFIED
  • % Done changed from 0 to 100
Actions #15

Updated by semyers over 7 years ago

  • Platform Release set to 2.11.0

Added by bizhang over 7 years ago

Revision 75f0ac09 | View on GitHub

datetime.total_time() not supported by python 2.6

Actions #16

Updated by semyers over 7 years ago

  • Status changed from MODIFIED to 5

Added by semyers over 7 years ago

Revision 70f8726b | View on GitHub

Update release notes about migration caveat

closes #2468 https://pulp.plan.io/issues/2468

Added by bizhang over 7 years ago

Revision bf2d1d4a | View on GitHub

Revert "Update release notes about migration caveat"

This reverts commit 70f8726b96decbcfa9ef1b799809beeb864d50a4.

re #2186 #2468 https://pulp.plan.io/issues/2186 https://pulp.plan.io/issues/2468

Added by bizhang over 7 years ago

Revision bf2d1d4a | View on GitHub

Revert "Update release notes about migration caveat"

This reverts commit 70f8726b96decbcfa9ef1b799809beeb864d50a4.

re #2186 #2468 https://pulp.plan.io/issues/2186 https://pulp.plan.io/issues/2468

Added by bizhang over 7 years ago

Revision 38e60824 | View on GitHub

Revert "datetime.total_time() not supported by python 2.6"

This reverts commit 75f0ac094ec3dd0bc0efd96670fe036ae7aa1982.

re #2186 https://pulp.plan.io/issues/2186

Added by bizhang over 7 years ago

Revision d633f819 | View on GitHub

Revert "Prompt user of running workers/tasks on migration"

This reverts commit 4f43a85dd568f4a0b50ae9b07bbec7138861e92b.

re #2186 https://pulp.plan.io/issues/2186

Added by bizhang over 7 years ago

Revision 38e60824 | View on GitHub

Revert "datetime.total_time() not supported by python 2.6"

This reverts commit 75f0ac094ec3dd0bc0efd96670fe036ae7aa1982.

re #2186 https://pulp.plan.io/issues/2186

Added by bizhang over 7 years ago

Revision d633f819 | View on GitHub

Revert "Prompt user of running workers/tasks on migration"

This reverts commit 4f43a85dd568f4a0b50ae9b07bbec7138861e92b.

re #2186 https://pulp.plan.io/issues/2186

Actions #17

Updated by bizhang over 7 years ago

  • Status changed from 5 to ASSIGNED
  • Sprint/Milestone changed from 27 to 30
  • Platform Release deleted (2.11.0)
Actions #18

Updated by bmbouter over 7 years ago

  • Status changed from ASSIGNED to NEW
  • Assignee deleted (bizhang)
  • Sprint/Milestone deleted (30)
  • Groomed changed from Yes to No

Taking this issue back to the planning phase and I will send out a note to pulp-dev to discuss options.

Actions #19

Updated by semyers over 7 years ago

https://github.com/pulp/pulp/pull/2883 is the PR that reverted this feature. I've merged the revert forward to master, so evidence of this feature no longer exists from 2.11-dev forward; you ca use the referenced PR to find easy commit hashes for restoring it when we pick this work back up.

Actions #21

Updated by bmbouter over 7 years ago

  • Description updated (diff)

Rewriting the bug based on the discussion from pulp-dev[0].

Here is a pseudo code representation:

most_recent_worker_timestamp = Workers.objects.get_the_most_recent_worker_timestamp
# show a message to the users indicating which workers are being checked to see if they are running (show the names) and how many seconds the user has to wait
sleep(most_recent_worker_timestamp + 60 seconds - now)  # sleep until the time that 60 seconds have passed since the most recent timestamp was observed
# check the db again and see if there is a newer timestamp than most_recent_worker_timestamp
if there_is_newer:
    # show the user a message indicating there are still workers running and show them the worker names
    # exit with a non zero exit code
else:
    # keep going!

To facilitate this, we should adjust the timings of heartbeats for pulp_celerybeat, pulp_resource_manager, and pulp_workers. All of them should use a heartbeat value of 20 seconds. For pulp_celerybeat that is set here[1]. For pulp_workers that is done through the init scripts and systemd scripts which pass a command line argument. Here is 1 link[2] as an example, but this is probably in 4 places at least.

Also can the heartbeat timing changes all be in 1 commit and the other changes (code and docs) be in another commit. Packagers may want to pick the heartbeat changes without the pulp-manage-db changes both coming together.

[0]: https://www.redhat.com/archives/pulp-dev/2016-December/msg00025.html
[1]: https://github.com/werwty/pulp/blob/4f43a85dd568f4a0b50ae9b07bbec7138861e92b/common/pulp/common/constants.py#L68
[2]: https://github.com/pulp/pulp/blob/master/server/usr/lib/systemd/system/pulp_resource_manager.service#L12

Actions #22

Updated by mhrivnak over 7 years ago

  • Sprint/Milestone set to 31
  • Groomed changed from No to Yes
Actions #23

Updated by bizhang over 7 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to bizhang
Actions #24

Updated by bizhang about 7 years ago

  • Status changed from ASSIGNED to POST
Actions #25

Updated by bizhang about 7 years ago

The heartbeat changes have been moved to a separate story #2509

Added by bizhang about 7 years ago

Revision 798a9315 | View on GitHub

Add pulp-manage-db check for running workers.

pulp-manage-db command now waits to check if workers have been stopped. If there are running workers the migration is stopped

closes #2186 https://pulp.plan.io/issues/2186

Added by bizhang about 7 years ago

Revision 798a9315 | View on GitHub

Add pulp-manage-db check for running workers.

pulp-manage-db command now waits to check if workers have been stopped. If there are running workers the migration is stopped

closes #2186 https://pulp.plan.io/issues/2186

Actions #26

Updated by bizhang about 7 years ago

  • Status changed from POST to MODIFIED
  • % Done changed from 0 to 100
Actions #27

Updated by semyers about 7 years ago

  • Platform Release set to 2.12.0
Actions #29

Updated by semyers about 7 years ago

  • Status changed from MODIFIED to 5
Actions #30

Updated by bmbouter about 7 years ago

  • Subject changed from As a user, pulp-manage-db refuses to run if other pulp processes are running locally to As a user, pulp-manage-db refuses to run if other pulp processes are running
Actions #32

Updated by semyers about 7 years ago

  • Status changed from 5 to CLOSED - CURRENTRELEASE
Actions #33

Updated by bmbouter about 6 years ago

  • Sprint set to Sprint 13
Actions #34

Updated by bmbouter about 6 years ago

  • Sprint/Milestone deleted (31)
Actions #35

Updated by bmbouter almost 5 years ago

  • Tags Pulp 2 added

Also available in: Atom PDF