Issue #2045: Task stuck at waiting if child process segfaults - Pulp

Actions

Send by e-mail Copy link

Issue #2045

closed

Task stuck at waiting if child process segfaults

Added by bmbouter over 8 years ago. Updated almost 3 years ago.

Status:

CLOSED - DUPLICATE

Priority:

Normal

Assignee:

Category:

Sprint/Milestone:

Start date:

Due date:

Estimated time:

Severity:

3. High

Version:

Platform Release:

OS:

Triaged:

Yes

Groomed:

Sprint Candidate:

Tags:

Pulp 2

Sprint:

Quarter:

Description

1. Start a sync or publish
2. While the sync or publish is running have the child celery task segfault
3. Observe a traceback like this one in the logs:

Jun 28 10:42:01 hp-dl380pgen8-02-vm-9 celery: (process:557): GLib-GIO-CRITICAL **: g_simple_async_result_run_in_thread: assertion 'G_IS_SIMPLE_ASYNC_RESULT (simple)' failed
Jun 28 10:42:01 hp-dl380pgen8-02-vm-9 celery: (process:557): GLib-GIO-CRITICAL **: g_simple_async_result_new: assertion '!source_object || G_IS_OBJECT (source_object)' failed
Jun 28 10:42:01 hp-dl380pgen8-02-vm-9 celery: (process:557): GLib-GIO-CRITICAL **: g_simple_async_result_set_op_res_gpointer: assertion 'G_IS_SIMPLE_ASYNC_RESULT (simple)' failed
Jun 28 10:42:01 hp-dl380pgen8-02-vm-9 celery: (process:557): GLib-GIO-CRITICAL **: g_simple_async_result_run_in_thread: assertion 'G_IS_SIMPLE_ASYNC_RESULT (simple)' failed
Jun 28 10:42:01 hp-dl380pgen8-02-vm-9 pulp: celery.worker.job:ERROR: (32700-91808) Task pulp.server.managers.repo.sync.sync[252ad894-d037-4bdb-bcd6-cc1b623fcc5e] raised unexpected: WorkerLostError('Worker exited prematurely: signal 11 (SIGSEGV).',)
Jun 28 10:42:01 hp-dl380pgen8-02-vm-9 pulp: celery.worker.job:ERROR: (32700-91808) Traceback (most recent call last):
Jun 28 10:42:01 hp-dl380pgen8-02-vm-9 pulp: celery.worker.job:ERROR: (32700-91808)   File "/usr/lib64/python2.7/site-packages/billiard/pool.py", line 1169, in mark_as_worker_lost
Jun 28 10:42:01 hp-dl380pgen8-02-vm-9 pulp: celery.worker.job:ERROR: (32700-91808)     human_status(exitcode)),
Jun 28 10:42:01 hp-dl380pgen8-02-vm-9 pulp: celery.worker.job:ERROR: (32700-91808) WorkerLostError: Worker exited prematurely: signal 11 (SIGSEGV).
Jun 28 10:42:01 hp-dl380pgen8-02-vm-9 celery: reserved_resource_worker-1@hp-dl380pgen8-02-vm-9.lab.bos.redhat.com ready.

4. Observe that parent celery process spawns an additional worker and begins processing additional tasks as normal
5. Observe that the task which was running when the segfault occured never leaves the RUNNING state and it never will

Related issues

Actions

Copy link

Updated by bmbouter over 8 years ago

I can think of three places for us to put some recovery logic for this.

Two good options:
1. The parent process could update the TaskStatus when it realizes the child has been killed. I'm not sure if there is an opportunity to add actions when this type of event occurs.
2. pulp_celerybeat could update the TaskStatus when it observes a worker task has been killed. I'm not sure if celery emits a signal for this or not.

One not as great option because it doesn't handle all cases:
3. The release_resource task which runs just after a sync, publish, etc could check to see if the task associated with the reservation is still in running and if so update it to canceled.

Actions

Copy link

Updated by bmbouter over 8 years ago

Status changed from NEW to CLOSED - DUPLICATE
Triaged changed from No to Yes

Actions

Copy link

Updated by bmbouter over 8 years ago

Related to Issue #1673: Pulp's worker watcher does not notice workers that got killed by OOM killer and their tasks stay "running" forever added

Actions

Copy link

Updated by bmbouter over 5 years ago

Tags Pulp 2 added

Actions

Send by e-mail Copy link

Also available in: Atom PDF

Project

Profile

Help

Pulp

Agile boards

Custom queries

Issue #2045

Task stuck at waiting if child process segfaults

Updated by bmbouter over 8 years ago

Updated by bmbouter over 8 years ago

Updated by bmbouter over 8 years ago

Updated by bmbouter over 5 years ago