Project

Profile

Help

Issue #1363

closed

tasks randomly stuck at waiting or running

Added by bmbouter over 8 years ago. Updated almost 5 years ago.

Status:
CLOSED - NOTABUG
Priority:
High
Assignee:
-
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Quarter:

Description

A pulp system will be working normally. It will process many tasks for a long period of time. At some random time, a Pulp celery process (pulp_resource_manager, pulp_workers, or pulp_celerybeat) will deadlock. This is observable as a task being in the running state and never finishing, or being in the waiting state and never starting. A processing task is expected to write log statements, so if a task has been picked up you should see progress in the log from that worker.

We believe this only affects Qpid users. One of Pulp's dependencies, a package called python-qpid had a deadlocking problem which was present until late in 2015, but has been fixed in on almost all distros. We are considering there may be a second root cause in python-qpid which will cause deadlock, and we are searching for users who are running one of the "fixed" versions of python-qpid and still experiencing deadlock. If you are not upgraded to one of these versions you really should.

RHEL6 - python-qpid-0.32-12.el6 which you should get from the "Qpid at Copr" repo here[0]
RHEL7 - python-qpid-0.32-12.el7 which you'll get from epel7
Fedora 22 - There is not a fix available at this time for you. You are exposed to deadlock.
Fedora 23 - python-qpid-0.32-12.f23
Fedora 24 - python-qpid-0.32-12.f24
Fedora Rawhide - python-qpid-0.32-12.f24

If you experience deadlock while running one of the following versions please gather the output/files of the following commands and tar them up or put them online somehow. The core files will be too large to attach to this issue and are very important to be delivered. Ideally would you post a link on the issue to the large files.

# the python-qpid version you are running
rpm -qa | grep python-qpid

# some process information
ps -awfux
ps -efLm

# Qpid queue information
qpid-stat -q

# core dumps of your celery processes
for pid in $(ps -awfux| grep celery | grep "@" | awk '{ print $2 }'); do gcore $pid; done

Also please post all logs including the Pulp logs. Please make sure the logs have the pulp logs and that they cover the time the tasks were started up to the current time.

Also get dumps of two mongodb collections:

mongo pulp_database --eval "db.task_status.find().pretty()" > task_status.json
mongo pulp_database --eval "db.reserved_resources.find().pretty()" > reserved_resources.json

[0]: http://qpid.apache.org/packages.html#epel

Also available in: Atom PDF