Project

Profile

Help

Issue #757

How to stop workers on remote machine

Added by jluza over 6 years ago. Updated over 2 years ago.

Status:
CLOSED - WORKSFORME
Priority:
Normal
Assignee:
-
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
2.5
Platform Release:
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Quarter:

Description

If I stop workers on remote machine, reserved resources for these workers stay locked. I just been told, there's no way how to stop only specified workers in pulp. This could be quite issue because if some workers crash (from various reasons including hardware incident), resource stay locked even after restarting them.
I don't know of any ways how to remove reserved resource other from go directly into mongo and remove lock manually. This is quite impractical and also hard to do because we don't have (and won't have) direct access to mongo database.

Please consider this issue as also RFE of:
- way how to stop workers
- safety worker stop should be something like this: remove worker from db.workers - so it can't accept any tasks, [wait for running task], remove reserved
resources for this worker, kill worker

- way how to unlock reserved resources
- worker could have also list of reserved resources in memory. After restarting, worker would check in db if there's some reserved resources and then removed them. I don't think it's possible resume task after worker restart, so reserved resources wouldn't have to be reserved anymore.
- resource manager could also remove reserved resources in process of removing inactive workers.

History

#1 Updated by bmbouter over 6 years ago

  • Status changed from NEW to 7
  • Triaged changed from No to Yes

jluza wrote:

If I stop workers on remote machine, reserved resources for these workers stay locked. I just been told, there's no way how to stop only specified workers in pulp. This could be quite issue because if some workers crash (from various reasons including hardware incident), resource stay locked even after restarting them.

The upcoming 2.6.0 behavior handles this situation correctly.

I don't know of any ways how to remove reserved resource other from go directly into mongo and remove lock manually. This is quite impractical and also hard to do because we don't have (and won't have) direct access to mongo database.

Please consider this issue as also RFE of:
- way how to stop workers

The normal method for starting and stopping pulp_workers should be used for this.

- safety worker stop should be something like this: remove worker from db.workers - so it can't accept any tasks, [wait for running task], remove reserved
resources for this worker, kill worker

- way how to unlock reserved resources

The upcoming Pulp 2.6.0 will auto recover unlocked resources

- worker could have also list of reserved resources in memory. After restarting, worker would check in db if there's some reserved resources and then removed them. I don't think it's possible resume task after worker restart, so reserved resources wouldn't have to be reserved anymore.

The database is the reference for the shared state of reserved resources. It works well to keep it there.

- resource manager could also remove reserved resources in process of removing inactive workers.

This is also the behavior of 2.6.0

I'm closing this because 2.6.0 handles all situations requested here. Try the 2.6.0 beta and let us know of improvements you see in its behavior.

#2 Updated by mhrivnak over 6 years ago

It seems there is some confusion about this bug report. The problem statement I believe is just this:

"If I stop workers on remote machine, reserved resources for these workers stay locked. I just been told, there's no way how to stop only specified workers in pulp."

That is not correct, so apparently there has been some misunderstanding. I will clarify the important points.

First, it does not matter what machine the worker is running on. When stopping a worker, the behavior will be the same regardless of what machine the process is running on.

If a worker stops for any reason, its queued tasks will get cancelled automatically. If the worker re-starts, pre-existing tasks will get cancelled/cleaned up when it starts. If the worker is down for more than 5 minutes, pulp will automatically clean up its leftover tasks.

You can stop an individual worker on el7 by issuing a command like this:

systemctl stop pulp_worker-0

Otherwise you can send a worker SIGTERM to have it exit as soon as it completes its current task, or SIGQUIT to make it cancel its current task and exit immediately.

It is recommended that you stop a worker while it is idle, so that no tasks need to get cancelled.

#3 Updated by dgregor@redhat.com over 6 years ago

mhrivnak wrote:

If a worker stops for any reason, its queued tasks will get cancelled automatically. If the worker re-starts, pre-existing tasks will get cancelled/cleaned up when it starts. If the worker is down for more than 5 minutes, pulp will automatically clean up its leftover tasks.

Can you clarify "its queued tasks will get cancelled automatically"? Are tasks considered queued when they are in WAITING state, or only when RUNNING?

#4 Updated by jluza over 6 years ago

OK, so again:

Here is my reproducer:

We have shellshock test - 34 packages to 177 repos. So I waited until process got into stage of publishing repositories. Then through flower check there's assigned tasks to remote worker. Just for remind: we have 4 workers running on fte01 server where are running also celerybeat and resource_manager. And 4 workers on fte02.
celery -A pulp.server.async.app status throw out 9 workers(4+4+resource_manager)
So when fte02 workers were publishing some repositories I called exactly this on fte02 server:

$ sudo sh -c "service httpd stop; service pulp_workers stop"

And now, it's already at least 10 minutes since I executed command above:

on fte01 login into mongo and run:

db.reserved_resources.find()

and here's output:

{ "_id" : "1ad0c039-e986-4857-b866-90eff205732e", "worker_name" : "", "_ns" : "reserved_resources", "resource_id" : "repository:rhel-5-desktop-rpms__5Client__i386" }
{ "_id" : "7e36cb23-28e6-4929-b4bc-a5108fe05510", "worker_name" : "", "_ns" : "reserved_resources", "resource_id" : "repository:rhel-5-desktop-source-rpms__5Client__i386" }
{ "_id" : "3656ca56-3d31-46cc-9f45-2e9b24ae076a", "worker_name" : "", "_ns" : "reserved_resources", "resource_id" : "repository:rhel-5-desktop-source-rpms__5Client__x86_64" }
{ "_id" : "c43373b9-dcce-480a-83ad-e27b755caba3", "worker_name" : "", "_ns" : "reserved_resources", "resource_id" : "repository:rhel-5-for-power-rpms__5Server__ppc" }
{ "_id" : "e52eceba-67b7-483c-aae4-1d9bec92d8f2", "worker_name" : "", "_ns" : "reserved_resources", "resource_id" : "repository:rhel-5-server-debug-rpms__5Server__ia64" }
{ "_id" : "b5a12c77-a772-4971-bf25-8d3a63a0b323", "worker_name" : "", "_ns" : "reserved_resources", "resource_id" : "repository:rhel-5-server-debug-rpms__5Server__x86_64" }
{ "_id" : "3e85aef3-9a3b-4078-8196-45b947647e49", "worker_name" : "", "_ns" : "reserved_resources", "resource_id" : "repository:rhel-5-server-rhui-debug-rpms__5Server__i386" }

pulpdocker:PRIMARY> db.workers.find()
{ "_id" : "", "last_heartbeat" : 1426667938.929607 }
{ "_id" : "", "last_heartbeat" : 1426667939.70187 }
{ "_id" : "", "last_heartbeat" : 1426667938.37051 }
{ "_id" : "", "last_heartbeat" : 1426667937.950169 }
{ "_id" : "", "last_heartbeat" : 1426676794.669609 }
{ "_id" : "", "last_heartbeat" : 1426676795.809993 }
{ "_id" : "", "last_heartbeat" : 1426676800.822848 }

So it looks like only one of 4 fte02 workers were removed

on fte02:

service --status-all

Using config script: /etc/default/pulp_workers
node reserved_resource_worker-0 is stopped...
node reserved_resource_worker-1 is stopped...
node reserved_resource_worker-2 is stopped...
node reserved_resource_worker-3 is stopped...

Just reminder, we merged from upstream pulp-2.5.3-1 and I'm pretty sure we didn't do any changes in handling workers or celery stuff.

#5 Updated by jluza over 6 years ago

Little update:

This is output log of stoping pulp_workers
[root@pulp-fte02 ~]# sudo sh -c "service httpd stop; service pulp_workers stop"
Stopping httpd: [ OK ]
celery init v10.0.
Using config script: /etc/default/pulp_workers
celery multi v3.1.11 (Cipater)

Stopping nodes...

> : QUIT -> 26594
> : QUIT -> 26540
> : QUIT -> 26563
> : QUIT -> 26621

Waiting for 4 nodes -> 26594, 26540, 26563, 26621........

> : OK

Waiting for 3 nodes -> 26540, 26563, 26621.............. (lots of dots here) ........................

> : OK

Waiting for 2 nodes -> 26540, 26621.....

> : OK

Waiting for 1 node -> 26540......

> : OK

#6 Updated by mhrivnak over 6 years ago

wrote:

Can you clarify "its queued tasks will get cancelled automatically"? Are tasks considered queued when they are in WAITING state, or only when RUNNING?

Any task in WAITING state is in the queue (literally in a queue on the message broker).

#7 Updated by mhrivnak over 6 years ago

  • Triaged changed from Yes to No

It appears that the deployment you are using has highly modified code that basically constitutes a fork of pulp. At the current state, you have changes to many parts of the code that affect worker cleanup. We are not able to reproduce this problem with a pure upstream release of 2.5.3, so we will leave this closed.

If you can reproduce this problem on upstream pulp, please re-open.

#8 Updated by bmbouter over 6 years ago

  • Status changed from 7 to CLOSED - WORKSFORME

#9 Updated by bmbouter over 6 years ago

  • Severity changed from Medium to 2. Medium

#10 Updated by bmbouter over 6 years ago

  • Triaged changed from No to Yes

#11 Updated by bmbouter over 2 years ago

  • Tags Pulp 2 added

Please register to edit this issue

Also available in: Atom PDF