Project

Profile

Help

Issue #1838

closed

Tasks being stuck

Added by mihai.ibanescu@gmail.com over 8 years ago. Updated over 4 years ago.

Status:
CLOSED - NOTABUG
Priority:
Normal
Assignee:
-
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
3. High
Version:
2.7.1
Platform Release:
OS:
RHEL 7
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Quarter:

Description

I now have 3 tasks that are stuck in "Waiting".

We have 2 hosts that run as an HA cluster, with corosync as the heartbeat. Celery runs on both, so both should process tasks. The resource manager runs only on one, and gets moved to the other if corosync determines the primary is dead.

Here is some debug output:

2016-04-12 09:22:45,763 - DEBUG - sending GET request to /pulp/api/v2/tasks/622041ac-e9e4-4a15-bd7c-7c98a17782e0/
2016-04-12 09:22:46,023 - INFO - GET request to /pulp/api/v2/tasks/622041ac-e9e4-4a15-bd7c-7c98a17782e0/ with parameters None
2016-04-12 09:22:46,023 - INFO - Response status : 200 

2016-04-12 09:22:46,023 - INFO - Response body :
 {
  "exception": null, 
  "task_type": "pulp.server.managers.repo.publish.publish", 
  "_href": "/pulp/api/v2/tasks/622041ac-e9e4-4a15-bd7c-7c98a17782e0/", 
  "task_id": "622041ac-e9e4-4a15-bd7c-7c98a17782e0", 
  "tags": [
    "pulp:repository:thirdparty-snapshot-rpm-latest", 
    "pulp:action:publish"
  ], 
  "finish_time": null, 
  "_ns": "task_status", 
  "start_time": null, 
  "traceback": null, 
  "spawned_tasks": [], 
  "progress_report": {}, 
  "queue": "None.dq", 
  "state": "waiting", 
  "worker_name": null, 
  "result": null, 
  "error": null, 
  "_id": {
    "$oid": "5705bd46cbdef6e14906bf98"
  }, 
  "id": "5705bd46cbdef6e14906bf98"
}

Operations:       publish
Resources:        thirdparty-snapshot-rpm-latest (repository)
State:            Waiting
Start Time:       Unstarted
Finish Time:      Incomplete
Result:           Incomplete
Task Id:          622041ac-e9e4-4a15-bd7c-7c98a17782e0
Progress Report:  

Output of ps afuxw | grep celery:

On host1:

root      2921  0.0  0.0 112640   960 pts/2    S+   09:31   0:00  |                       \_ grep --color=auto celery
apache   21996  0.1  0.0 519060 62080 ?        Ssl  Apr06  10:43 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-0.pid --heartbeat-interval=30
apache   22119  2.6  0.1 654736 193452 ?       Rl   Apr06 220:36  \_ /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-0.pid --heartbeat-interval=30
apache   21998  0.1  0.0 518364 61656 ?        Ssl  Apr06  10:12 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-1@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-1.pid --heartbeat-interval=30
apache   22124  0.3  0.0 544160 80196 ?        Sl   Apr06  25:32  \_ /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-1@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-1.pid --heartbeat-interval=30
apache   22000  0.1  0.0 519052 61984 ?        Ssl  Apr06  10:56 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-2@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-2.pid --heartbeat-interval=30
apache   22129  2.3  0.2 669752 208464 ?       Dl   Apr06 198:42  \_ /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-2@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-2.pid --heartbeat-interval=30
apache   22002  0.1  0.0 518980 62028 ?        Ssl  Apr06  10:50 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-3@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-3.pid --heartbeat-interval=30
apache   22126  2.5  0.4 867344 405440 ?       Dl   Apr06 217:02  \_ /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-3@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-3.pid --heartbeat-interval=30
apache   22004  0.1  0.0 518972 62176 ?        Ssl  Apr06  10:41 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-4@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-4.pid --heartbeat-interval=30
apache   22128  2.3  0.2 681192 219840 ?       Dl   Apr06 196:41  \_ /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-4@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-4.pid --heartbeat-interval=30
apache   22006  0.1  0.0 518500 61580 ?        Ssl  Apr06  10:17 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-5@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-5.pid --heartbeat-interval=30
apache   22132  0.0  0.0 518960 54696 ?        Sl   Apr06   7:16  \_ /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-5@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-5.pid --heartbeat-interval=30
apache   22008  0.1  0.0 518364 61624 ?        Ssl  Apr06  10:20 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-6@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-6.pid --heartbeat-interval=30
apache   22120  0.3  0.0 519700 57868 ?        Dl   Apr06  31:11  \_ /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-6@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-6.pid --heartbeat-interval=30
apache   22010  0.1  0.0 518700 61616 ?        Ssl  Apr06  10:24 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-7@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-7.pid --heartbeat-interval=30
apache   22121  1.6  0.2 671912 210604 ?       Rl   Apr06 138:42  \_ /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-7@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-7.pid --heartbeat-interval=30
apache   21270  0.3  0.0 487004 27936 ?        Ssl  Apr11   2:41 /usr/bin/python /usr/bin/celery beat --app=pulp.server.async.celery_instance.celery --scheduler=pulp.server.async.scheduler.Scheduler
apache   17185  0.5  0.0 522104 65144 ?        Ssl  08:59   0:10 /usr/bin/python /usr/bin/celery worker -A pulp.server.async.app -n resource_manager@%h -Q resource_manager -c 1 --events --umask 18 --pidfile=/var/run/pulp/resource_manager.pid --heartbeat-interval=30
apache   17289  5.9  0.0 518356 54268 ?        Sl   08:59   1:55  \_ /usr/bin/python /usr/bin/celery worker -A pulp.server.async.app -n resource_manager@%h -Q resource_manager -c 1 --events --umask 18 --pidfile=/var/run/pulp/resource_manager.pid --heartbeat-interval=30

On host2:

root      4431  0.0  0.0 112640   960 pts/0    S+   09:32   0:00  |                       \_ grep --color=auto celery
apache   14669  0.1  0.0 520664 63784 ?        Ssl  Apr06  12:17 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-0.pid --heartbeat-interval=30
apache   15042  1.9  0.1 652572 190552 ?       Dl   Apr06 166:59  \_ /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-0.pid --heartbeat-interval=30
apache   14671  0.1  0.0 520672 63668 ?        Ssl  Apr06  12:24 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-1@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-1.pid --heartbeat-interval=30
apache   15046  2.4  0.1 618272 153048 ?       Sl   Apr06 205:57  \_ /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-1@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-1.pid --heartbeat-interval=30
apache   14674  0.1  0.0 520168 63324 ?        Ssl  Apr06  12:07 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-2@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-2.pid --heartbeat-interval=30
apache   15044  2.7  0.1 645860 184516 ?       Rl   Apr06 234:59  \_ /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-2@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-2.pid --heartbeat-interval=30
apache   14676  0.1  0.0 520672 63816 ?        Ssl  Apr06  12:12 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-3@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-3.pid --heartbeat-interval=30
apache   15048  2.7  0.2 665080 203128 ?       Dl   Apr06 230:19  \_ /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-3@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-3.pid --heartbeat-interval=30
apache   14678  0.1  0.0 520664 63724 ?        Ssl  Apr06  12:18 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-4@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-4.pid --heartbeat-interval=30
apache   15045  2.3  0.2 680920 219648 ?       Rl   Apr06 201:53  \_ /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-4@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-4.pid --heartbeat-interval=30
apache   14681  0.1  0.0 520680 63792 ?        Ssl  Apr06  12:07 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-5@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-5.pid --heartbeat-interval=30
apache   15041  2.6  0.2 666260 204232 ?       Dl   Apr06 223:23  \_ /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-5@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-5.pid --heartbeat-interval=30
apache   14684  0.1  0.0 520168 63304 ?        Ssl  Apr06  11:44 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-6@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-6.pid --heartbeat-interval=30
apache   15043  0.1  0.0 534632 71388 ?        Sl   Apr06  13:16  \_ /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-6@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-6.pid --heartbeat-interval=30
apache   14693  0.1  0.0 520940 64036 ?        Ssl  Apr06  13:41 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-7@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-7.pid --heartbeat-interval=30
apache   15047  2.8  0.2 667648 205668 ?       Rl   Apr06 240:37  \_ /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-7@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-7.pid --heartbeat-interval=30
apache    1909  0.4  0.0 521980 64864 ?        Ssl  08:57   0:09 /usr/bin/python /usr/bin/celery worker -A pulp.server.async.app -n resource_manager@%h -Q resource_manager -c 1 --events --umask 18 --pidfile=/var/run/pulp/resource_manager.pid --heartbeat-interval=30
apache    2020  5.4  0.0 518348 54256 ?        Sl   08:57   1:51  \_ /usr/bin/python /usr/bin/celery worker -A pulp.server.async.app -n resource_manager@%h -Q resource_manager -c 1 --events --umask 18 --pidfile=/var/run/pulp/resource_manager.pid --heartbeat-interval=30
Actions #1

Updated by mihai.ibanescu@gmail.com over 8 years ago

Running on rabbitmq.

Actions #2

Updated by bmbouter over 8 years ago

  • Description updated (diff)
Actions #3

Updated by mihai.ibanescu@gmail.com over 8 years ago

[root@repulpmst01r ~]# rabbitmqctl list_queues
Listing queues ...
resource_manager@repulpmst01r.unx.sas.com.dq    0
reserved_resource_worker-1@repulpmst01r.unx.sas.com.dq    0
celeryev.df343c1f-2803-4a44-990b-14d6d3c13801    0
reserved_resource_worker-0@repulpmst02r.unx.sas.com.dq    0
reserved_resource_worker-4@repulpmst01r.unx.sas.com.dq    0
reserved_resource_worker-6@repulpmst01r.unx.sas.com.celery.pidbox    0
reserved_resource_worker-7@repulpmst02r.unx.sas.com.celery.pidbox    0
reserved_resource_worker-6@repulpmst02r.unx.sas.com.celery.pidbox    0
reserved_resource_worker-1@repulpmst02r.unx.sas.com.celery.pidbox    0
reserved_resource_worker-3@repulpmst02r.unx.sas.com.dq    0
celeryev.5dcb0d61-11cf-4ab4-aa4e-adbc018d130a    0
celeryev.c9c6b8fb-f95c-4532-a6bf-5bb91e3a1865    0
celeryev.9e2f0f1e-5eb8-4e7e-a488-0d234ac91f01    0
reserved_resource_worker-3@repulpmst01r.unx.sas.com.celery.pidbox    0
reserved_resource_worker-7@repulpmst02r.unx.sas.com.dq    0
resource_manager@repulpmst02r.unx.sas.com.celery.pidbox    0
reserved_resource_worker-6@repulpmst01r.unx.sas.com.dq    0
celeryev.55205282-58b1-43fc-9647-bc201bedca65    0
reserved_resource_worker-5@repulpmst01r.unx.sas.com.celery.pidbox    0
celery    0
celeryev.54e4bdb1-aa71-423c-8589-ece93ed05c41    0
reserved_resource_worker-5@repulpmst02r.unx.sas.com.dq    0
reserved_resource_worker-1@repulpmst02r.unx.sas.com.dq    0
celeryev.f8361df4-1a2b-4e49-861a-0da77d373c88    0
reserved_resource_worker-4@repulpmst02r.unx.sas.com.celery.pidbox    0
reserved_resource_worker-5@repulpmst02r.unx.sas.com.celery.pidbox    0
celeryev.ef25e2df-0c1b-4f92-9452-18765cb8b0ab    0
celeryev.1606c43b-42c2-4eb9-80be-82da99a5230d    0
reserved_resource_worker-0@repulpmst01r.unx.sas.com.dq    0
celeryev.10341687-d167-4033-9e39-6d8057160153    0
celeryev.4842ebb5-f26b-48ff-af59-f30a735a16ab    0
reserved_resource_worker-5@repulpmst01r.unx.sas.com.dq    0
reserved_resource_worker-2@repulpmst02r.unx.sas.com.dq    0
celeryev.853c0d3d-f8b0-43d8-a73e-5d0024f83886    0
celeryev.80c223e1-1192-41d1-9007-94366e25f502    0
celeryev.9cf0c549-650d-453f-8ed4-43dd2c2058d6    0
celeryev.78f2aa99-0226-4998-a446-10d0ea348b82    0
reserved_resource_worker-2@repulpmst01r.unx.sas.com.celery.pidbox    0
reserved_resource_worker-3@repulpmst01r.unx.sas.com.dq    0
reserved_resource_worker-7@repulpmst01r.unx.sas.com.celery.pidbox    0
reserved_resource_worker-6@repulpmst02r.unx.sas.com.dq    0
resource_manager@repulpmst02r.unx.sas.com.dq    0
reserved_resource_worker-4@repulpmst02r.unx.sas.com.dq    0
reserved_resource_worker-2@repulpmst02r.unx.sas.com.celery.pidbox    0
reserved_resource_worker-3@repulpmst02r.unx.sas.com.celery.pidbox    0
reserved_resource_worker-0@repulpmst01r.unx.sas.com.celery.pidbox    0
celeryev.2aac488f-f8b6-4b50-aea7-9014dac2693e    0
reserved_resource_worker-0@repulpmst02r.unx.sas.com.celery.pidbox    0
resource_manager    0
reserved_resource_worker-1@repulpmst01r.unx.sas.com.celery.pidbox    0
resource_manager@repulpmst01r.unx.sas.com.celery.pidbox    0
reserved_resource_worker-2@repulpmst01r.unx.sas.com.dq    0
celeryev.2ac48cd5-dfcc-455f-ba95-0a76b8dd6acc    0
pulp.task    0
reserved_resource_worker-4@repulpmst01r.unx.sas.com.celery.pidbox    0
celeryev.fefb61fc-2f85-4fde-8eea-2d4a3757d83a    0
celeryev.f8588021-36b9-4b9e-8795-b53711acd0eb    0
reserved_resource_worker-7@repulpmst01r.unx.sas.com.dq    0
[root@repulpmst02r ~]# rabbitmqctl list_queues
Listing queues ...
resource_manager@repulpmst01r.unx.sas.com.dq    0
reserved_resource_worker-1@repulpmst01r.unx.sas.com.dq    0
celeryev.df343c1f-2803-4a44-990b-14d6d3c13801    0
reserved_resource_worker-0@repulpmst02r.unx.sas.com.dq    0
reserved_resource_worker-4@repulpmst01r.unx.sas.com.dq    0
reserved_resource_worker-6@repulpmst01r.unx.sas.com.celery.pidbox    0
reserved_resource_worker-7@repulpmst02r.unx.sas.com.celery.pidbox    0
reserved_resource_worker-6@repulpmst02r.unx.sas.com.celery.pidbox    0
reserved_resource_worker-1@repulpmst02r.unx.sas.com.celery.pidbox    0
reserved_resource_worker-3@repulpmst02r.unx.sas.com.dq    0
celeryev.5dcb0d61-11cf-4ab4-aa4e-adbc018d130a    0
celeryev.c9c6b8fb-f95c-4532-a6bf-5bb91e3a1865    0
celeryev.9e2f0f1e-5eb8-4e7e-a488-0d234ac91f01    0
reserved_resource_worker-3@repulpmst01r.unx.sas.com.celery.pidbox    0
reserved_resource_worker-7@repulpmst02r.unx.sas.com.dq    0
resource_manager@repulpmst02r.unx.sas.com.celery.pidbox    0
reserved_resource_worker-6@repulpmst01r.unx.sas.com.dq    0
celeryev.55205282-58b1-43fc-9647-bc201bedca65    0
reserved_resource_worker-5@repulpmst01r.unx.sas.com.celery.pidbox    0
celery    0
celeryev.54e4bdb1-aa71-423c-8589-ece93ed05c41    0
reserved_resource_worker-5@repulpmst02r.unx.sas.com.dq    0
reserved_resource_worker-1@repulpmst02r.unx.sas.com.dq    0
celeryev.f8361df4-1a2b-4e49-861a-0da77d373c88    0
reserved_resource_worker-4@repulpmst02r.unx.sas.com.celery.pidbox    0
reserved_resource_worker-5@repulpmst02r.unx.sas.com.celery.pidbox    0
celeryev.ef25e2df-0c1b-4f92-9452-18765cb8b0ab    0
celeryev.1606c43b-42c2-4eb9-80be-82da99a5230d    0
reserved_resource_worker-0@repulpmst01r.unx.sas.com.dq    0
celeryev.10341687-d167-4033-9e39-6d8057160153    0
celeryev.4842ebb5-f26b-48ff-af59-f30a735a16ab    0
reserved_resource_worker-5@repulpmst01r.unx.sas.com.dq    0
reserved_resource_worker-2@repulpmst02r.unx.sas.com.dq    0
celeryev.853c0d3d-f8b0-43d8-a73e-5d0024f83886    0
celeryev.80c223e1-1192-41d1-9007-94366e25f502    0
celeryev.9cf0c549-650d-453f-8ed4-43dd2c2058d6    0
celeryev.78f2aa99-0226-4998-a446-10d0ea348b82    0
reserved_resource_worker-2@repulpmst01r.unx.sas.com.celery.pidbox    0
reserved_resource_worker-3@repulpmst01r.unx.sas.com.dq    0
reserved_resource_worker-7@repulpmst01r.unx.sas.com.celery.pidbox    0
reserved_resource_worker-6@repulpmst02r.unx.sas.com.dq    0
resource_manager@repulpmst02r.unx.sas.com.dq    0
reserved_resource_worker-4@repulpmst02r.unx.sas.com.dq    0
reserved_resource_worker-2@repulpmst02r.unx.sas.com.celery.pidbox    0
reserved_resource_worker-3@repulpmst02r.unx.sas.com.celery.pidbox    0
reserved_resource_worker-0@repulpmst01r.unx.sas.com.celery.pidbox    0
celeryev.2aac488f-f8b6-4b50-aea7-9014dac2693e    0
reserved_resource_worker-0@repulpmst02r.unx.sas.com.celery.pidbox    0
resource_manager    0
reserved_resource_worker-1@repulpmst01r.unx.sas.com.celery.pidbox    0
resource_manager@repulpmst01r.unx.sas.com.celery.pidbox    0
reserved_resource_worker-2@repulpmst01r.unx.sas.com.dq    0
celeryev.2ac48cd5-dfcc-455f-ba95-0a76b8dd6acc    0
pulp.task    0
reserved_resource_worker-4@repulpmst01r.unx.sas.com.celery.pidbox    0
celeryev.fefb61fc-2f85-4fde-8eea-2d4a3757d83a    0
celeryev.f8588021-36b9-4b9e-8795-b53711acd0eb    0
reserved_resource_worker-7@repulpmst01r.unx.sas.com.dq    0
Actions #4

Updated by bmbouter over 8 years ago

You only have 1 rabbitMQ broker right? All Pulp services need to use the same broker for celery communication.

The "worker_name": null tells me that the task never got processed by the resource_manager and assigned to a specific worker. The resource_manager reads out of the resource_manager queue which shows a depth of 0 so the work is effectively "gone" by Pulp.

One thing about your deployment is you have 2 resource_managers in the cluster. Pulp is only designed to work with 1 resource manager in the cluster currently. Having two could lead to strange issues. Could you do the following:

0) verify all of your workers are configured to connect to the same broker (see server.conf settings on all nodes).
1) ensure there is only 1 resource_manager running in the entire cluster.
2) cancel all tasks so you have 0 in the waiting or running states.
3) try to reproduce the issue and post back.

Actions #5

Updated by mihai.ibanescu@gmail.com over 8 years ago

I will try to answer this to the best of my understanding, given that I am not the one who set up the cluster.

Each node connects to rabbitmq at localhost.

The queues on rabbitmq are configured to be HA, so they should be shared between the two rabbitmq nodes.

There is only one resource manager in the cluster. Or at least there should only be one.

If you have different recommendations on connecting to a highly-available rabbitmq message bus, please let us know. We want to have an active/active setup as much as possible, with the understanding that there should be only one resource manager which gets moved around by Pacemaker.

Actions #6

Updated by bmbouter over 8 years ago

wrote:

There is only one resource manager in the cluster. Or at least there should only be one.

If you have different recommendations on connecting to a highly-available rabbitmq message bus, please let us know. We want to have an active/active setup as much as possible, with the understanding that there should be only one resource manager which gets moved around by Pacemaker.

Based on your ps output in the issue description, you have 2 resource managers running. Please re-read the steps from comment 4 to try to reproduce the issue when there is only 1 resource manager running in your environment. This issue was skipped at triage today because it's unclear if this is a legitimate bug or not. We'll need more info to move forward on this.

Also joining in #pulp would be a good way for us to synchronously resolve the issue. Feel free to ping my nick in there, I'm 'bmbouter'.

Actions #7

Updated by bmbouter over 8 years ago

  • Status changed from NEW to CLOSED - NOTABUG
  • Triaged changed from No to Yes

After IRC discussion it is believed that this issue was due to an environmental issue of two resource managers being run. I'm closing it as NOTABUG. If you experience an issue in the future please reopen, e-mail pulp-list, or discuss on IRC.

Actions #8

Updated by bmbouter over 5 years ago

  • Tags Pulp 2 added
Actions #9

Updated by bmbouter over 4 years ago

  • Category deleted (14)

We are removing the 'API' category per open floor discussion June 16, 2020.

Also available in: Atom PDF