Project

Profile

Help

Issue #1363

tasks randomly stuck at waiting or running

Added by bmbouter over 3 years ago. Updated 6 months ago.

Status:
CLOSED - NOTABUG
Priority:
High
Assignee:
-
Category:
-
Sprint/Milestone:
-
Severity:
2. Medium
Version:
Platform Release:
Blocks Release:
OS:
Backwards Incompatible:
No
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
QA Contact:
Complexity:
Smash Test:
Verified:
No
Verification Required:
No
Sprint:

Description

A pulp system will be working normally. It will process many tasks for a long period of time. At some random time, a Pulp celery process (pulp_resource_manager, pulp_workers, or pulp_celerybeat) will deadlock. This is observable as a task being in the running state and never finishing, or being in the waiting state and never starting. A processing task is expected to write log statements, so if a task has been picked up you should see progress in the log from that worker.

We believe this only affects Qpid users. One of Pulp's dependencies, a package called python-qpid had a deadlocking problem which was present until late in 2015, but has been fixed in on almost all distros. We are considering there may be a second root cause in python-qpid which will cause deadlock, and we are searching for users who are running one of the "fixed" versions of python-qpid and still experiencing deadlock. If you are not upgraded to one of these versions you really should.

RHEL6 - python-qpid-0.32-12.el6 which you should get from the "Qpid at Copr" repo here0
RHEL7 - python-qpid-0.32-12.el7 which you'll get from epel7
Fedora 22 - There is not a fix available at this time for you. You are exposed to deadlock.
Fedora 23 - python-qpid-0.32-12.f23
Fedora 24 - python-qpid-0.32-12.f24
Fedora Rawhide - python-qpid-0.32-12.f24

If you experience deadlock while running one of the following versions please gather the output/files of the following commands and tar them up or put them online somehow. The core files will be too large to attach to this issue and are very important to be delivered. Ideally would you post a link on the issue to the large files.

# the python-qpid version you are running
rpm -qa | grep python-qpid

# some process information
ps -awfux
ps -efLm

# Qpid queue information
qpid-stat -q

# core dumps of your celery processes
for pid in $(ps -awfux| grep celery | grep "@" | awk '{ print $2 }'); do gcore $pid; done

Also please post all logs including the Pulp logs. Please make sure the logs have the pulp logs and that they cover the time the tasks were started up to the current time.

Also get dumps of two mongodb collections:

mongo pulp_database --eval "db.task_status.find().pretty()" > task_status.json
mongo pulp_database --eval "db.reserved_resources.find().pretty()" > reserved_resources.json

[0]: http://qpid.apache.org/packages.html#epel

History

#1 Updated by mhrivnak over 3 years ago

  • Triaged changed from No to Yes

#2 Updated by bmbouter about 3 years ago

  • Description updated (diff)

#3 Updated by bmbouter about 3 years ago

  • Description updated (diff)

#4 Updated by bmbouter about 3 years ago

  • Status changed from ASSIGNED to NEW
  • Assignee deleted (bmbouter)

This issue's purpose is to gather data about the problem. I'm not actively working on it, so I'm setting back to NEW.

#6 Updated by bmbouter almost 3 years ago

  • Status changed from NEW to CLOSED - NOTABUG

This bug has not collected any new reports of "stuck task" issues. As such I'm closing it as NOTABUG.

Please register to edit this issue

Also available in: Atom PDF