Project

Profile

Help

Issue #2849

Worker terminates abnormally while processing task

Added by ipanova@redhat.com over 2 years ago. Updated 7 months ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
High
Category:
-
Sprint/Milestone:
Start date:
Due date:
Severity:
3. High
Version - Docker:
Platform Release:
2.18.0
Blocks Release:
Target Release - Docker:
OS:
Backwards Incompatible:
No
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
QA Contact:
Complexity:
Smash Test:
Verified:
Yes
Verification Required:
No
Sprint:
Sprint 45

Description

When i am syncing manifests that have foreign layers, i eventually get worker get killed.


$ pulp-admin docker repo  sync run --repo-id bz
Warning: path should have mode 0700 because it may contain sensitive information: /home/ipanova/.pulp/

+----------------------------------------------------------------------+
                     Synchronizing Repository [bz]
+----------------------------------------------------------------------+

This command may be exited via ctrl+c without affecting the request.

Downloading manifests
[\]
... completed

Copying units already in pulp
[-]
... completed

Copying units already in pulp
[-]
... completed

Downloading remote files
[=                                                 ] 3%
8 of 221 items

Task Failed

Worker terminated abnormally while processing task
42fabc46-17b3-419f-a79c-ccd0bcba1474.  Check the logs for details

Pulp starts to panic when it hits 404.
We need to check if the registry closes the connection prematurely and investigate what causes the worker termination. It is not OOM kill.

Jun 28 12:34:14 ina pulp: nectar.downloaders.threaded:INFO: Download failed: Download of https://registry-1.docker.io/v2/portainer/portainer/blobs/sha256:6330793656b1565cd8b5d5c1e2a736a351dec39f7250daf31eb91e3a76cc872b failed with code 404: Not Found
Jun 28 12:34:14 ina pulp: celery.worker.job:ERROR: (11813-08608) Task pulp.server.managers.repo.sync.sync[b28a4117-b647-484f-a9da-45a15e06f09e] raised unexpected: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',)

Related issues

Related to Docker Support - Issue #2966: Unable to sync docker repo because worker dies CLOSED - CURRENTRELEASE Actions
Related to Docker Support - Test #4128: Test sync of a repository that is missing blobs CLOSED - COMPLETE Actions

Associated revisions

Revision c81bc353 View on GitHub
Added by amacdona@redhat.com 12 months ago

Fail gracefully when image download fails.

https://pulp.plan.io/issues/2849
fixes #2849

https://pulp.plan.io/issues/2966
fixes #2966

(cherry picked from commit b09801b1de94ea7a542b565e6c9caa456aae8619)

Revision c81bc353 View on GitHub
Added by amacdona@redhat.com 12 months ago

Fail gracefully when image download fails.

https://pulp.plan.io/issues/2849
fixes #2849

https://pulp.plan.io/issues/2966
fixes #2966

(cherry picked from commit b09801b1de94ea7a542b565e6c9caa456aae8619)

History

#1 Updated by ipanova@redhat.com over 2 years ago

  • Subject changed from | to Worker terminates abnormally while processing task

#2 Updated by bmbouter over 2 years ago

If it's not being killed by the OOM, is it segfaulting? I'm not 100% sure how to do this, but you can configure linux to create a coredump for any process that is segfaulting.

#3 Updated by ttereshc over 2 years ago

  • Priority changed from Normal to High
  • Severity changed from 2. Medium to 3. High
  • Triaged changed from No to Yes

#4 Updated by kdelee@redhat.com over 2 years ago

  • Related to Issue #2966: Unable to sync docker repo because worker dies added

#6 Updated by daviddavis about 1 year ago

  • Sprint set to Sprint 44

There's another BZ opened against this bug downstream and I just worked with a customer that reported this issue today. Seems like this is happening regularly. I'm adding this to the current sprint and hoping we can get a fix into pulp 2.18.

#7 Updated by amacdona@redhat.com about 1 year ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to amacdona@redhat.com

#8 Updated by amacdona@redhat.com about 1 year ago

  • Status changed from ASSIGNED to POST

#9 Updated by daviddavis about 1 year ago

  • Platform Release set to 2.18.0

#10 Updated by daviddavis about 1 year ago

  • Sprint/Milestone set to 2.18.0
  • Platform Release deleted (2.18.0)

#11 Updated by rchan about 1 year ago

  • Sprint changed from Sprint 44 to Sprint 45

#12 Updated by amacdona@redhat.com about 1 year ago

The patch from https://github.com/pulp/pulp_docker/pull/275 is against latest 2-master.

To reproduce this issue, I used pulp 2.13.4 and pulp_docker 2.4.1, syncing portainer from docker hub. The problem was not related to OOM, but was just a mishandled 404.

The cherry-pick did not apply cleanly, but it the conflict was easily handled. After the cherry-pick was applied, the worker was no longer killed and the new coded exception was raised and the sync failed. This confirms that the patch will correctly handle 404s.

#13 Updated by amacdona@redhat.com about 1 year ago

  • Related to Test #4128: Test sync of a repository that is missing blobs added

#14 Updated by amacdona@redhat.com about 1 year ago

I was able to reproduce this more rigorously against latest 2-master and prove the correctness of the patch. I set up a user installation of pulp on a vm:

User Pulp (to sync from)

[puser@omgwut ~]$ rpm -qa | grep pulp
pulp-admin-client-2.17.1-1.fc27.noarch
python-pulp-common-2.17.1-1.fc27.noarch
pulp-selinux-2.17.1-1.fc27.noarch
pulp-docker-admin-extensions-3.2.1-1.fc27.noarch
python-pulp-docker-common-3.2.1-1.fc27.noarch
python-pulp-repoauth-2.17.1-1.fc27.noarch
pulp-server-2.17.1-1.fc27.noarch
python-pulp-bindings-2.17.1-1.fc27.noarch
python-pulp-client-lib-2.17.1-1.fc27.noarch
pulp-docker-plugins-3.2.1-1.fc27.noarch
[puser@omgwut ~]$ rpm -qa | grep crane
python-crane-3.2.0-1.fc27.noarch

On the user installation box, I synced the busybox repository, published it, and served with crane.

From a developer installation running latest 2-master pulp and 2-master pulp_docker, I was able to sync from the user installation (sanity check).

Back on the user installation box, I moved one of the symlinks for a blob, stored in

/var/lib/pulp/published/docker/v2/web/synctest2/blobs

After wiping the db, I did a --force-full sync from the 2-master installation, which failed and killed the worker.

Nov 09 15:34:39 pulp2.dev pulp[3228]: pulp.server.async.tasks:INFO: [5f1bd81c] Task failed : [67a41d34-faed-4680-9f57-0a77b4a3178e] : Worker terminated abnormally while processing task 67a41d34-faed-4680-9f57-0a77b4a3178e.  Check the logs for details
Nov 09 15:34:39 pulp2.dev pulp[3228]: celery.app.trace:ERROR: [5f1bd81c] (3228-81120) Task pulp.server.async.tasks._release_resource[5f1bd81c-1644-4743-a093-6a119d6f1be7] raised unexpected: AttributeError("'NoneType' object has no attribute 'top'",)
Nov 09 15:34:39 pulp2.dev pulp[3228]: celery.app.trace:ERROR: [5f1bd81c] (3228-81120) Traceback (most recent call last):
Nov 09 15:34:39 pulp2.dev pulp[3228]: celery.app.trace:ERROR: [5f1bd81c] (3228-81120)   File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 367, in trace_task
Nov 09 15:34:39 pulp2.dev pulp[3228]: celery.app.trace:ERROR: [5f1bd81c] (3228-81120)     R = retval = fun(*args, **kwargs)
Nov 09 15:34:39 pulp2.dev pulp[3228]: celery.app.trace:ERROR: [5f1bd81c] (3228-81120)   File "/home/vagrant/devel/pulp/server/pulp/server/async/tasks.py", line 107, in __call__
Nov 09 15:34:39 pulp2.dev pulp[3228]: celery.app.trace:ERROR: [5f1bd81c] (3228-81120)     return super(PulpTask, self).__call__(*args, **kwargs)
Nov 09 15:34:39 pulp2.dev pulp[3228]: celery.app.trace:ERROR: [5f1bd81c] (3228-81120)   File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 622, in __protected_call__
Nov 09 15:34:39 pulp2.dev pulp[3228]: celery.app.trace:ERROR: [5f1bd81c] (3228-81120)     return self.run(*args, **kwargs)
Nov 09 15:34:39 pulp2.dev pulp[3228]: celery.app.trace:ERROR: [5f1bd81c] (3228-81120)   File "/home/vagrant/devel/pulp/server/pulp/server/async/tasks.py", line 296, in _release_resource
Nov 09 15:34:39 pulp2.dev pulp[3228]: celery.app.trace:ERROR: [5f1bd81c] (3228-81120)     new_task.on_failure(exception, task_id, (), {}, MyEinfo)
Nov 09 15:34:39 pulp2.dev pulp[3228]: celery.app.trace:ERROR: [5f1bd81c] (3228-81120)   File "/home/vagrant/devel/pulp/server/pulp/server/async/tasks.py", line 602, in on_failure
Nov 09 15:34:39 pulp2.dev pulp[3228]: celery.app.trace:ERROR: [5f1bd81c] (3228-81120)     if not self.request.called_directly:
Nov 09 15:34:39 pulp2.dev pulp[3228]: celery.app.trace:ERROR: [5f1bd81c] (3228-81120)   File "/usr/lib/python2.7/site-packages/celery/app/task.py", line 978, in _get_request
Nov 09 15:34:39 pulp2.dev pulp[3228]: celery.app.trace:ERROR: [5f1bd81c] (3228-81120)     req = self.request_stack.top
Nov 09 15:34:39 pulp2.dev pulp[3228]: celery.app.trace:ERROR: [5f1bd81c] (3228-81120) AttributeError: 'NoneType' object has no attribute 'top'
Nov 09 15:34:49 pulp2.dev pulp[1608]: celery.worker.request:ERROR: (1608-81120) Task handler raised error: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',)
Nov 09 15:34:49 pulp2.dev pulp[1608]: celery.worker.request:ERROR: (1608-81120) Traceback (most recent call last):
Nov 09 15:34:49 pulp2.dev pulp[1608]: celery.worker.request:ERROR: (1608-81120)   File "/usr/lib64/python2.7/site-packages/billiard/pool.py", line 1223, in mark_as_worker_lost
Nov 09 15:34:49 pulp2.dev pulp[1608]: celery.worker.request:ERROR: (1608-81120)     human_status(exitcode)),
Nov 09 15:34:49 pulp2.dev pulp[1608]: celery.worker.request:ERROR: (1608-81120) WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).

On the devel box, I changed branches to include this patch, cleaned the db, and again triggered a --force-full sync, which failed but did not kill the worker.

[vagrant@pulp2 pulp_docker]$ pulp-admin docker repo create --repo-id fixture --upstream-name synctest2 --feed http://omgwut:5000
Repository [fixture] successfully created

[vagrant@pulp2 pulp_docker]$ pulp-admin docker repo sync run --repo-id fixture --force-full
+----------------------------------------------------------------------+
                   Synchronizing Repository [fixture]
+----------------------------------------------------------------------+

The following sync configuration options will be used:

Force Full:  True

This command may be exited via ctrl+c without affecting the request.

Downloading manifests
[\]
... completed

Copying units already in pulp
[-]
... completed

Copying units already in pulp
[-]
... completed

Downloading remote files
[================================================= ] 99%
373 of 375 items
... failed

Task Failed

Image download(s) from
http://omgwut:5000/v2/synctest2/blobs/sha256:edc53726e077686a24352d99aa150c01d47
24637541162238195e72b86739adf failed. Sync task has failed to prevent a
corrupted repository.

The worker did not die, and the logs were more reasonable.

Nov 09 15:46:17 pulp2.dev pulp[8658]: nectar.downloaders.threaded:INFO: Download of http://omgwut:5000/v2/synctest2/blobs/sha256:2de56e1a41335b794e4c830d4da72e440f4551d0ab50174b6d634abd9f4502d6 was cancelled
Nov 09 15:46:17 pulp2.dev pulp[8658]: pulp.server.async.tasks:INFO: [d16f4e65] Task failed : [d16f4e65-7876-4a4e-9385-566f34ad12c7] : Image download(s) from http://omgwut:5000/v2/synctest2/blobs/sha256:edc53726e077686a24352d99aa150c01d4724637541162238195e72b86739adf failed. Sync task has failed to prevent a corrupted repository.
Nov 09 15:46:17 pulp2.dev pulp[8658]: celery.app.trace:INFO: [d16f4e65] Task pulp.server.managers.repo.sync.sync[d16f4e65-7876-4a4e-9385-566f34ad12c7] raised expected: PulpCodedException()

#15 Updated by amacdona@redhat.com 12 months ago

  • Status changed from POST to MODIFIED

#16 Updated by ttereshc 12 months ago

  • Platform Release set to 2.18.0

#18 Updated by ttereshc 12 months ago

  • Status changed from MODIFIED to ON_QA

#19 Updated by bherring 11 months ago

  • Verified changed from No to Yes

See #4128 for more information about verification.

#20 Updated by ttereshc 11 months ago

  • Status changed from ON_QA to CLOSED - CURRENTRELEASE

#21 Updated by bmbouter 7 months ago

  • Tags Pulp 2 added

Please register to edit this issue

Also available in: Atom PDF