Project

Profile

Help

Issue #1363

Updated by bmbouter over 8 years ago

A pulp system will be working normally. It will process many tasks for a long period of time. At some random time, 

 Then randomly a single Pulp celery process (pulp_resource_manager, pulp_workers, or pulp_celerybeat) (a pulp worker, the resource manager, etc) will halt, seeming to deadlock. This is observable as If a task being is in the running state and never finishing, or being it will stay running. If it is in the waiting state and never starting. A processing task is expected to write log statements, so if a task has been picked up you should see progress in the log from that worker. 

 We believe this only affects Qpid users. One of Pulp's dependencies, a package called python-qpid had a deadlocking problem which was present until late in 2015, but has been fixed in on almost all distros. We are considering there may be a second root cause in python-qpid which state, it will cause deadlock, and we are searching for users who are running one of the "fixed" versions of python-qpid and still experiencing deadlock. If you are not upgraded to one of these versions you really should. 

 RHEL6 - python-qpid-0.32-12.el6 which you should get from the "Qpid stay at Copr" repo here[0] 
 RHEL7 - python-qpid-0.32-12.el7 which you'll get from epel7 
 Fedora 22 - There is not a fix available at this time for you. You are exposed to deadlock. 
 Fedora 23 - python-qpid-0.32-12.f23 
 Fedora 24 - python-qpid-0.32-12.f24 
 Fedora Rawhide - python-qpid-0.32-12.f24 waiting. 

 If you experience deadlock while running one of workaround: restarting the following versions please gather stuck process will mark the output/files of the following commands task as cancelled if its already running, and tar them up or put them online somehow. The core files work will be too large to attach to this issue and are very important to be delivered. Ideally would you post a link on the issue to the large files. 

 <pre> 
 # the python-qpid version you are running 
 rpm -qa | grep python-qpid 

 # some process information 
 ps -awfux 
 ps -efLm 

 # Qpid queue information 
 qpid-stat -q 

 # core dumps of your celery processes 
 for pid resume normally until it happens randomly in $(ps -awfux| grep celery | grep "@" | awk '{ print $2 }'); do gcore $pid; done 
 </pre> 

 Also please post all logs including the Pulp logs. Please make sure the logs have the pulp logs and that they cover the time the tasks were started up to the current time. 

 Also get dumps of two mongodb collections: 
 <pre> 
 mongo pulp_database --eval "db.task_status.find().pretty()" > task_status.json 
 mongo pulp_database --eval "db.reserved_resources.find().pretty()" > reserved_resources.json 
 </pre> 

 [0]: http://qpid.apache.org/packages.html#epel future.

Back