OK, so again:
Here is my reproducer:
We have shellshock test - 34 packages to 177 repos. So I waited until process got into stage of publishing repositories. Then through flower check there's assigned tasks to remote worker. Just for remind: we have 4 workers running on fte01 server where are running also celerybeat and resource_manager. And 4 workers on fte02.
celery -A pulp.server.async.app status throw out 9 workers(4+4+resource_manager)
So when fte02 workers were publishing some repositories I called exactly this on fte02 server:
$ sudo sh -c "service httpd stop; service pulp_workers stop"
And now, it's already at least 10 minutes since I executed command above:
on fte01 login into mongo and run:
db.reserved_resources.find()
and here's output:
{ "_id" : "1ad0c039-e986-4857-b866-90eff205732e", "worker_name" : "reserved_resource_worker-2@pulp-fte02.web.dev.int.devlab.redhat.com", "_ns" : "reserved_resources", "resource_id" : "repository:rhel-5-desktop-rpms__5Client__i386" }
{ "_id" : "7e36cb23-28e6-4929-b4bc-a5108fe05510", "worker_name" : "reserved_resource_worker-1@pulp-fte02.web.dev.int.devlab.redhat.com", "_ns" : "reserved_resources", "resource_id" : "repository:rhel-5-desktop-source-rpms__5Client__i386" }
{ "_id" : "3656ca56-3d31-46cc-9f45-2e9b24ae076a", "worker_name" : "reserved_resource_worker-3@pulp-fte02.web.dev.int.devlab.redhat.com", "_ns" : "reserved_resources", "resource_id" : "repository:rhel-5-desktop-source-rpms__5Client__x86_64" }
{ "_id" : "c43373b9-dcce-480a-83ad-e27b755caba3", "worker_name" : "reserved_resource_worker-1@pulp-fte01.web.dev.int.devlab.redhat.com", "_ns" : "reserved_resources", "resource_id" : "repository:rhel-5-for-power-rpms__5Server__ppc" }
{ "_id" : "e52eceba-67b7-483c-aae4-1d9bec92d8f2", "worker_name" : "reserved_resource_worker-0@pulp-fte01.web.dev.int.devlab.redhat.com", "_ns" : "reserved_resources", "resource_id" : "repository:rhel-5-server-debug-rpms__5Server__ia64" }
{ "_id" : "b5a12c77-a772-4971-bf25-8d3a63a0b323", "worker_name" : "reserved_resource_worker-3@pulp-fte01.web.dev.int.devlab.redhat.com", "_ns" : "reserved_resources", "resource_id" : "repository:rhel-5-server-debug-rpms__5Server__x86_64" }
{ "_id" : "3e85aef3-9a3b-4078-8196-45b947647e49", "worker_name" : "reserved_resource_worker-2@pulp-fte01.web.dev.int.devlab.redhat.com", "_ns" : "reserved_resources", "resource_id" : "repository:rhel-5-server-rhui-debug-rpms__5Server__i386" }
pulpdocker:PRIMARY> db.workers.find()
{ "_id" : "reserved_resource_worker-0@pulp-fte01.web.dev.int.devlab.redhat.com", "last_heartbeat" : 1426667938.929607 }
{ "_id" : "reserved_resource_worker-1@pulp-fte01.web.dev.int.devlab.redhat.com", "last_heartbeat" : 1426667939.70187 }
{ "_id" : "reserved_resource_worker-2@pulp-fte01.web.dev.int.devlab.redhat.com", "last_heartbeat" : 1426667938.37051 }
{ "_id" : "reserved_resource_worker-3@pulp-fte01.web.dev.int.devlab.redhat.com", "last_heartbeat" : 1426667937.950169 }
{ "_id" : "reserved_resource_worker-2@pulp-fte02.web.dev.int.devlab.redhat.com", "last_heartbeat" : 1426676794.669609 }
{ "_id" : "reserved_resource_worker-3@pulp-fte02.web.dev.int.devlab.redhat.com", "last_heartbeat" : 1426676795.809993 }
{ "_id" : "reserved_resource_worker-1@pulp-fte02.web.dev.int.devlab.redhat.com", "last_heartbeat" : 1426676800.822848 }
So it looks like only one of 4 fte02 workers were removed
on fte02:
service --status-all
Using config script: /etc/default/pulp_workers
node reserved_resource_worker-0 is stopped...
node reserved_resource_worker-1 is stopped...
node reserved_resource_worker-2 is stopped...
node reserved_resource_worker-3 is stopped...
Just reminder, we merged from upstream pulp-2.5.3-1 and I'm pretty sure we didn't do any changes in handling workers or celery stuff.