Issue #2045
closedTask stuck at waiting if child process segfaults
Description
1. Start a sync or publish
2. While the sync or publish is running have the child celery task segfault
3. Observe a traceback like this one in the logs:
Jun 28 10:42:01 hp-dl380pgen8-02-vm-9 celery: (process:557): GLib-GIO-CRITICAL **: g_simple_async_result_run_in_thread: assertion 'G_IS_SIMPLE_ASYNC_RESULT (simple)' failed
Jun 28 10:42:01 hp-dl380pgen8-02-vm-9 celery: (process:557): GLib-GIO-CRITICAL **: g_simple_async_result_new: assertion '!source_object || G_IS_OBJECT (source_object)' failed
Jun 28 10:42:01 hp-dl380pgen8-02-vm-9 celery: (process:557): GLib-GIO-CRITICAL **: g_simple_async_result_set_op_res_gpointer: assertion 'G_IS_SIMPLE_ASYNC_RESULT (simple)' failed
Jun 28 10:42:01 hp-dl380pgen8-02-vm-9 celery: (process:557): GLib-GIO-CRITICAL **: g_simple_async_result_run_in_thread: assertion 'G_IS_SIMPLE_ASYNC_RESULT (simple)' failed
Jun 28 10:42:01 hp-dl380pgen8-02-vm-9 pulp: celery.worker.job:ERROR: (32700-91808) Task pulp.server.managers.repo.sync.sync[252ad894-d037-4bdb-bcd6-cc1b623fcc5e] raised unexpected: WorkerLostError('Worker exited prematurely: signal 11 (SIGSEGV).',)
Jun 28 10:42:01 hp-dl380pgen8-02-vm-9 pulp: celery.worker.job:ERROR: (32700-91808) Traceback (most recent call last):
Jun 28 10:42:01 hp-dl380pgen8-02-vm-9 pulp: celery.worker.job:ERROR: (32700-91808) File "/usr/lib64/python2.7/site-packages/billiard/pool.py", line 1169, in mark_as_worker_lost
Jun 28 10:42:01 hp-dl380pgen8-02-vm-9 pulp: celery.worker.job:ERROR: (32700-91808) human_status(exitcode)),
Jun 28 10:42:01 hp-dl380pgen8-02-vm-9 pulp: celery.worker.job:ERROR: (32700-91808) WorkerLostError: Worker exited prematurely: signal 11 (SIGSEGV).
Jun 28 10:42:01 hp-dl380pgen8-02-vm-9 celery: reserved_resource_worker-1@hp-dl380pgen8-02-vm-9.lab.bos.redhat.com ready.
4. Observe that parent celery process spawns an additional worker and begins processing additional tasks as normal
5. Observe that the task which was running when the segfault occured never leaves the RUNNING state and it never will
Related issues
Updated by bmbouter over 8 years ago
I can think of three places for us to put some recovery logic for this.
Two good options:
1. The parent process could update the TaskStatus when it realizes the child has been killed. I'm not sure if there is an opportunity to add actions when this type of event occurs.
2. pulp_celerybeat could update the TaskStatus when it observes a worker task has been killed. I'm not sure if celery emits a signal for this or not.
One not as great option because it doesn't handle all cases:
3. The release_resource task which runs just after a sync, publish, etc could check to see if the task associated with the reservation is still in running and if so update it to canceled.
Updated by bmbouter over 8 years ago
- Status changed from NEW to CLOSED - DUPLICATE
- Triaged changed from No to Yes
Updated by bmbouter over 8 years ago
- Related to Issue #1673: Pulp's worker watcher does not notice workers that got killed by OOM killer and their tasks stay "running" forever added