Pulp

Updated by mhrivnak almost 8 years ago

Tracker changed from Issue to Story
Subject changed from check for active pulp workers at start of manage-db to As a user, pulp-manage-db refuses to run if other pulp processes are running
Description updated (diff)

Actions

Updated by mhrivnak almost 8 years ago

On a multi-machine deployment, this could be difficult to guarantee, but pulp-manage-db could make a best effort.

Actions

Updated by bmbouter almost 8 years ago

Pulp is a clustered software. We have to assume that there are multiple machines with pulp workers running on them. Can you elaborate on what a best-effort solution would look like?

Actions

Updated by mhrivnak almost 8 years ago

A best effort could, for example, check the local process list for anything (besides itself) containing the word "pulp", and if found, ask the user if they're sure they want to proceed. Slightly fancier would be to check the database for any known workers or any tasks listed in a running state.

Of course that would not be a guarantee or catch all cases, but even the simple process check would have caught all of the recent cases we've been seeing where people ran pulp-manage-db on a single-system deployment while pulp was running. Those cases lead to database corruption.

Actions

Updated by bmbouter almost 8 years ago

Subject changed from As a user, pulp-manage-db refuses to run if other pulp processes are running to As a user, pulp-manage-db refuses to run if other pulp processes are running locally

I had not considered checking something locally. I always think about Pulp as being clustered. The process check would provide a level of fault tolerance for single-box deployments so we can leave this open with that implementation in mind. I re-titled the story to math that it's only a partial solution.

Actions

Updated by mhrivnak over 7 years ago

Sprint/Milestone set to 27
Groomed changed from No to Yes
Sprint Candidate changed from No to Yes

Actions

Updated by bizhang over 7 years ago

Status changed from NEW to ASSIGNED
Assignee set to bizhang

Actions

#10

Updated by bizhang over 7 years ago

I had a discussion with Brian regarding this, and here are the pros and cons of each opinion:

1.) Check DB listing

Would work in a clustered environment
status.get_workers() will have celery_beat worker left over even when all processes have stopped
- We can try to check the worker timestamp but can't tell for sure until 390s have elapsed
- there are some failure scenarios where processes could be running but the timestamps have stopped being recorded
- if we proceed with this changes need to made to scheduler.py for a sigterm handler to mitigate the above concerns

2.) Check local process listing

Can tell if there are any workers left real time
does not support clustered upgrade

Personally I believe that 1 is the correct way to go, and will make the changes in scheduler.py to support this.

Actions

#11

Updated by bmbouter over 7 years ago

I see a lot of usable value in option 1 also. Thanks bizhang!

To correct a point I made from IRC, I think you can use a failover time of 300 seconds. The 390 number is the maximum delay until recover from failover, but if all records are over 300 seconds old then we can continue with the migrations.

Actions

#12

Updated by bmbouter over 7 years ago

Also if the database is down we should check that pulp-manage-db handles it gracefully. It wouldn't be good to have a traceback occur because this database check can't connect. I'm not sure what it does today or where that code lives.

Actions

Revision 4f43a85d | View on GitHub

#13

Updated by bizhang over 7 years ago

Status changed from ASSIGNED to POST

PR up: https://github.com/pulp/pulp/pull/2781

Added by bizhang over 7 years ago

Prompt user of running workers/tasks on migration

Add close() method to celery scheduler for clean exit. Catch ServerSelectionTimeoutError to handle when mongo is down.

Revision 4f43a85d | View on GitHub

Added by bizhang over 7 years ago

Prompt user of running workers/tasks on migration

Add close() method to celery scheduler for clean exit. Catch ServerSelectionTimeoutError to handle when mongo is down.

Actions

#14

Updated by bizhang over 7 years ago

Status changed from POST to MODIFIED
% Done changed from 0 to 100

Applied in changeset pulp|4f43a85dd568f4a0b50ae9b07bbec7138861e92b.

Actions

Revision 75f0ac09 | View on GitHub

#15

Updated by semyers over 7 years ago

Platform Release set to 2.11.0

Added by bizhang over 7 years ago

datetime.total_time() not supported by python 2.6

Actions

Revision 70f8726b | View on GitHub

#16

Updated by semyers over 7 years ago

Status changed from MODIFIED to 5

Added by semyers over 7 years ago

Update release notes about migration caveat

closes #2468 https://pulp.plan.io/issues/2468

Added by bizhang over 7 years ago

Revision bf2d1d4a | View on GitHub

Revert "Update release notes about migration caveat"

This reverts commit 70f8726b96decbcfa9ef1b799809beeb864d50a4.

re #2186 #2468 https://pulp.plan.io/issues/2186 https://pulp.plan.io/issues/2468

Added by bizhang over 7 years ago

Revision bf2d1d4a | View on GitHub

Revert "Update release notes about migration caveat"

This reverts commit 70f8726b96decbcfa9ef1b799809beeb864d50a4.

re #2186 #2468 https://pulp.plan.io/issues/2186 https://pulp.plan.io/issues/2468

Added by bizhang over 7 years ago

Revision 38e60824 | View on GitHub

Revert "datetime.total_time() not supported by python 2.6"

This reverts commit 75f0ac094ec3dd0bc0efd96670fe036ae7aa1982.

Revision d633f819 | View on GitHub

Added by bizhang over 7 years ago

Revert "Prompt user of running workers/tasks on migration"

This reverts commit 4f43a85dd568f4a0b50ae9b07bbec7138861e92b.

Revision 38e60824 | View on GitHub

Added by bizhang over 7 years ago

Revert "datetime.total_time() not supported by python 2.6"

This reverts commit 75f0ac094ec3dd0bc0efd96670fe036ae7aa1982.

Revision d633f819 | View on GitHub

Added by bizhang over 7 years ago

Revert "Prompt user of running workers/tasks on migration"

This reverts commit 4f43a85dd568f4a0b50ae9b07bbec7138861e92b.

Actions

#17

Updated by bizhang over 7 years ago

Status changed from 5 to ASSIGNED
Sprint/Milestone changed from 27 to 30
Platform Release deleted (~~2.11.0~~)

Actions

#18

Updated by bmbouter over 7 years ago

Status changed from ASSIGNED to NEW
Assignee deleted (~~bizhang~~)
Sprint/Milestone deleted (30)
Groomed changed from Yes to No

Taking this issue back to the planning phase and I will send out a note to pulp-dev to discuss options.

Actions

#19

Updated by semyers over 7 years ago

https://github.com/pulp/pulp/pull/2883 is the PR that reverted this feature. I've merged the revert forward to master, so evidence of this feature no longer exists from 2.11-dev forward; you ca use the referenced PR to find easy commit hashes for restoring it when we pick this work back up.

Actions

[0]: https://www.redhat.com/archives/pulp-dev/2016-December/msg00025.html
[1]: https://github.com/werwty/pulp/blob/4f43a85dd568f4a0b50ae9b07bbec7138861e92b/common/pulp/common/constants.py#L68
[2]: https://github.com/pulp/pulp/blob/master/server/usr/lib/systemd/system/pulp_resource_manager.service#L12

#21

Updated by bmbouter over 7 years ago

Description updated (diff)

Rewriting the bug based on the discussion from pulp-dev[0].

Here is a pseudo code representation:

most_recent_worker_timestamp = Workers.objects.get_the_most_recent_worker_timestamp
# show a message to the users indicating which workers are being checked to see if they are running (show the names) and how many seconds the user has to wait
sleep(most_recent_worker_timestamp + 60 seconds - now)  # sleep until the time that 60 seconds have passed since the most recent timestamp was observed
# check the db again and see if there is a newer timestamp than most_recent_worker_timestamp
if there_is_newer:
    # show the user a message indicating there are still workers running and show them the worker names
    # exit with a non zero exit code
else:
    # keep going!

To facilitate this, we should adjust the timings of heartbeats for pulp_celerybeat, pulp_resource_manager, and pulp_workers. All of them should use a heartbeat value of 20 seconds. For pulp_celerybeat that is set here[1]. For pulp_workers that is done through the init scripts and systemd scripts which pass a command line argument. Here is 1 link[2] as an example, but this is probably in 4 places at least.

Also can the heartbeat timing changes all be in 1 commit and the other changes (code and docs) be in another commit. Packagers may want to pick the heartbeat changes without the pulp-manage-db changes both coming together.

Actions

#22

Updated by mhrivnak over 7 years ago

Sprint/Milestone set to 31
Groomed changed from No to Yes

Actions

#23

Updated by bizhang over 7 years ago

Status changed from NEW to ASSIGNED
Assignee set to bizhang

Actions

PR: https://github.com/pulp/pulp/pull/2903

#24

Updated by bizhang over 7 years ago

Status changed from ASSIGNED to POST

Actions

Revision 798a9315 | View on GitHub

#25

Updated by bizhang over 7 years ago

The heartbeat changes have been moved to a separate story #2509

Added by bizhang over 7 years ago

Add pulp-manage-db check for running workers.

pulp-manage-db command now waits to check if workers have been stopped. If there are running workers the migration is stopped

Revision 798a9315 | View on GitHub

Added by bizhang over 7 years ago

Add pulp-manage-db check for running workers.

pulp-manage-db command now waits to check if workers have been stopped. If there are running workers the migration is stopped

Actions

#26

Updated by bizhang over 7 years ago

Status changed from POST to MODIFIED
% Done changed from 0 to 100

Applied in changeset pulp|798a9315c42f624bcf2a785acd19ee7c194e77d4.

Actions

#27

Updated by semyers over 7 years ago

Platform Release set to 2.12.0

Actions

#29

Updated by semyers over 7 years ago

Status changed from MODIFIED to 5

Actions

#30

Updated by bmbouter over 7 years ago

Subject changed from As a user, pulp-manage-db refuses to run if other pulp processes are running locally to As a user, pulp-manage-db refuses to run if other pulp processes are running

Actions

#32

Updated by semyers over 7 years ago

Status changed from 5 to CLOSED - CURRENTRELEASE

Actions

#33

Updated by bmbouter over 6 years ago

Sprint set to Sprint 13

Actions

#34

Updated by bmbouter over 6 years ago

Sprint/Milestone deleted (31)

Actions