Project

Profile

Help

Issue #8708

closed

Issue #8912: [EPIC] Issues with the traditional tasking system

Tasking: Uploading many advisories in a row can have intermittent failures

Added by ggainey almost 3 years ago. Updated almost 3 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Sprint:
Quarter:

Description

Uploading ~350 advisories at the same time, and 'a few' of them fail with tracebacks like the following:

May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[59340]: pulp [None]: rq.worker:INFO: resource-manager: 69556e0b-19a6-4de8-98c2-e3731333b2bc
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com gunicorn[59342]: pulp [d620ef055ae940a894615c1c7989da44]: 127.0.0.1 - admin [05/May/2021:20:41:00 +0000] "POST /pulp/api/v3/content/rpm/adviso
ries/ HTTP/1.0" 202 67 "-" "HTTPie/2.4.0"
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: pulp [d620ef055ae940a894615c1c7989da44]: rq.worker:ERROR: Traceback (most recent call last):
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:   File "/usr/local/lib/pulp/lib64/python3.9/site-packages/rq/worker.py", line 1008, in perform_job
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:     rv = job.perform()
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:   File "/usr/local/lib/pulp/lib64/python3.9/site-packages/rq/job.py", line 706, in perform
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:     self._result = self._execute()
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:   File "/usr/local/lib/pulp/lib64/python3.9/site-packages/rq/job.py", line 729, in _execute
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:     result = self.func(*self.args, **self.kwargs)
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:   File "/home/vagrant/devel/pulpcore/pulpcore/tasking/tasks.py", line 144, in _queue_reserved_task
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:     reservation = worker.reservations.get(resource=resource)
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:   File "/usr/local/lib/pulp/lib64/python3.9/site-packages/django/db/models/manager.py", line 82, in manager_method
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:     return getattr(self.get_queryset(), name)(*args, **kwargs)
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:   File "/usr/local/lib/pulp/lib64/python3.9/site-packages/django/db/models/query.py", line 406, in get
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:     raise self.model.DoesNotExist(
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: pulpcore.app.models.task.ReservedResource.DoesNotExist: ReservedResource matching query does not exist.
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: Traceback (most recent call last):
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:   File "/usr/local/lib/pulp/lib64/python3.9/site-packages/rq/worker.py", line 1008, in perform_job
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:     rv = job.perform()
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:   File "/usr/local/lib/pulp/lib64/python3.9/site-packages/rq/job.py", line 706, in perform
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:     self._result = self._execute()
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:   File "/usr/local/lib/pulp/lib64/python3.9/site-packages/rq/job.py", line 729, in _execute
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:     result = self.func(*self.args, **self.kwargs)
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:   File "/home/vagrant/devel/pulpcore/pulpcore/tasking/tasks.py", line 144, in _queue_reserved_task
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:     reservation = worker.reservations.get(resource=resource)
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:   File "/usr/local/lib/pulp/lib64/python3.9/site-packages/django/db/models/manager.py", line 82, in manager_method
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:     return getattr(self.get_queryset(), name)(*args, **kwargs)
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:   File "/usr/local/lib/pulp/lib64/python3.9/site-packages/django/db/models/query.py", line 406, in get
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:     raise self.model.DoesNotExist(
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: pulpcore.app.models.task.ReservedResource.DoesNotExist: ReservedResource matching query does not exist.

Using the advisories in this repo: https://github.com/ggainey/pulp_startup/tree/main/lero_tests/tmp/bla

And this to do the uploads:

cd lero_tests/tmp/bla
REPO_HREF=`pulp rpm repository create --name bar | jq -r '.pulp_href'`
for x in *; do http --form POST :/pulp/api/v3/content/rpm/advisories/ file@./$x repository=$REPO_HREF; done

Each advisory-upload is its own task. Occasionally, in the logs, you'll see the traceback above. Also, check the count of advisories in the repository - it should be 356, and will occasionally be less:

pulp rpm repository version show --repository bar | jq '.content_summary.present."rpm.advisory".count'
352

NOTE: uploading an advisory, that already exists, will not only fail - it will make it impossible to upload further advisories, until https://pulp.plan.io/issues/8683 gets fixed. Addressing this issue probably can/needs-to wait until after 8683 is fixed/merged.


Related issues

Related to RPM Support - Issue #8683: re-uploading an existing advisory makes new advisory upload failsCLOSED - CURRENTRELEASEggaineyActions
Related to RPM Support - Issue #8716: Advisory-digest-calculation can result in the 'same' advisory having different digestsCLOSED - CURRENTRELEASEggaineyActions
Actions #1

Updated by pulpbot almost 3 years ago

  • Status changed from NEW to POST
Actions #2

Updated by ggainey almost 3 years ago

  • Project changed from Pulp to RPM Support
Actions #3

Updated by ggainey almost 3 years ago

  • Status changed from POST to NEW

I associated this with PR#1982 INCORRECTLY - there is no fix for this, yet!

Actions #4

Updated by ggainey almost 3 years ago

  • Related to Issue #8683: re-uploading an existing advisory makes new advisory upload fails added
Actions #5

Updated by ggainey almost 3 years ago

  • Related to Issue #8716: Advisory-digest-calculation can result in the 'same' advisory having different digests added
Actions #6

Updated by mdellweg almost 3 years ago

I was not able to reproduce running the steps described above. The machine is used is pulp3-sandbox-debian-testing with core=3.12.2 and rpm=3.12.0.

Actions #7

Updated by ggainey almost 3 years ago

It took me a couple of tries, but I was able to see it happen (once) on pulp2-nightly-pulp3-source-centos7

[vagrant@pulp2-nightly-pulp3-source-centos7 ~]$ sudo journalctl -u pulp\* -f | grep ReservedResource.DoesNotExist
Jun 02 21:29:54 pulp2-nightly-pulp3-source-centos7.padre-fedora.example.com pulpcore-worker[2676]: pulpcore.app.models.task.ReservedResource.DoesNotExist: ReservedResource matching query does not exist.
...
(pulp) [vagrant@pulp2-nightly-pulp3-source-centos7 bla]$ pulp rpm repository version show --repository bar | jq '.content_summary.present."rpm.advisory".count'
355

It's possible the window is smaller than it used to be - I used to hit this Every Single Time - but whatever this window is, it's still there.

Actions #8

Updated by knzivid almost 3 years ago

Slightly unrelated because this comes from the pulp_deb plugin.

I triggered a sync on 8 repositories in parallel and I have four workers. One task was still stuck in waiting state. When I checked the resource manager logs, I see a similar error.

Even though the workers eventually became free, the task was stuck waiting.

resource-manager_1  | 07:15:56 Traceback (most recent call last):
resource-manager_1  |   File "/opt/pulp/lib/python3.7/site-packages/rq/worker.py", line 1013, in perform_job
resource-manager_1  |     rv = job.perform()
resource-manager_1  |   File "/opt/pulp/lib/python3.7/site-packages/rq/job.py", line 709, in perform
resource-manager_1  |     self._result = self._execute()
resource-manager_1  |   File "/opt/pulp/lib/python3.7/site-packages/rq/job.py", line 732, in _execute
resource-manager_1  |     result = self.func(*self.args, **self.kwargs)
resource-manager_1  |   File "/opt/pulp/lib/python3.7/site-packages/pulpcore/tasking/tasks.py", line 144, in _queue_reserved_task
resource-manager_1  |     reservation = worker.reservations.get(resource=resource)
resource-manager_1  |   File "/opt/pulp/lib/python3.7/site-packages/django/db/models/manager.py", line 82, in manager_method
resource-manager_1  |     return getattr(self.get_queryset(), name)(*args, **kwargs)
resource-manager_1  |   File "/opt/pulp/lib/python3.7/site-packages/django/db/models/query.py", line 408, in get
resource-manager_1  |     self.model._meta.object_name
resource-manager_1  | pulpcore.app.models.task.ReservedResource.DoesNotExist: ReservedResource matching query does not exist.

Package versions

pulp-deb                        2.13.0
pulp-file                       1.7.0
pulpcore                        3.13.0
Actions #9

Updated by dalley almost 3 years ago

  • Project changed from RPM Support to Pulp
  • Triaged changed from No to Yes
Actions #10

Updated by dalley almost 3 years ago

  • Subject changed from Uploading many advisories in a row can have intermittent failures to Tasking: Uploading many advisories in a row can have intermittent failures
Actions #11

Updated by ggainey almost 3 years ago

Adding data about my env here - I'm running the pulp2-nightly-pulp3-source-centos7 vagrant box. I allocate 16Gb to it. The VM is running on spinning disk. The host box is F33, running on Intel® Core™ i7-6700K CPU @ 4.00GHz × 8 , 32Gb.

Aaand I just ran that recreate-script up there 6 times, and had ZERO failures. Of course. So mdellweg - "trouble reproducing" is not just you. I'm wondering if the fixes applied to advisories recently have changed the timings enough that it's much harder to recreate on this path...

Actions #12

Updated by ggainey almost 3 years ago

And here's another data-point:

As noted about, running on master, I could not reproduce the failure.

Running on pulpcore-3.11-compatible code, I hit it four times on my first run.

To run on core-3.11, I used the following branches:

ROOT=/home/ggainey/github/Pulp3
cd $ROOT/pulpcore; git checkout 3.11
cd $ROOT/pulp_file; git checkout 1.6
cd $ROOT/pulp_rpm; git checkout 3.10
cd $ROOT/pulp_container; git checkout 2.5
cd $ROOT/pulp_deb; git checkout 2.11
cd $ROOT/pulp_installer; git checkout 3.11
cd $ROOT/pulp-certguard; git checkout 1.2
cd $ROOT/pulp-2to3-migration; git checkout 0.11

Then the following on my vagrant box:

cd /home/vagrant/devel
for d in pulpcore/ pulp_file/ pulp_rpm/ pulp_container/ pulp_deb/ pulp-certguard/ pulp-2to3-migration/
do
  echo $d
  pip install -e $d
done
pclean
prestart

Hopefully this will help us figure out wth is going on...

Actions #13

Updated by mdellweg almost 3 years ago

I managed to reproduce it very reliably by injecting a "time.sleep(1)" between the "exists" and the "get" of the stack trace (around lines 142-143) and dispatching a few tasks that run for 1 second each, requiring overlapping resources.

http --auth admin:password POST http://localhost:24817/pulp/api/v3/herminig/tasking_benchmark/ truncate_tasks=true count=5 resources_N=2 resources_K=1 sleep_secs=1

This hit the race condition 5/5 tries with one of those tasks.

Actions #14

Updated by mdellweg almost 3 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to mdellweg
Actions #15

Updated by pulpbot almost 3 years ago

  • Status changed from ASSIGNED to POST

Added by mdellweg almost 3 years ago

Revision 18978d85 | View on GitHub

Close a race condition with ReservedResource

During a small window between checking it's existance and looking up the ReservedResource, this object may be deleted in another thread. This condition will lead to task remaining in waiting state forever.

fixes #8708 https://pulp.plan.io/issues/8708

Actions #16

Updated by mdellweg almost 3 years ago

  • Status changed from POST to MODIFIED
Actions #17

Updated by mdellweg almost 3 years ago

  • Parent issue set to #8912
Actions #18

Updated by pulpbot almost 3 years ago

  • Sprint/Milestone set to 3.14.0
Actions #19

Updated by pulpbot almost 3 years ago

  • Status changed from MODIFIED to CLOSED - CURRENTRELEASE

Also available in: Atom PDF