Project

Profile

Help

Issue #8708

Issue #8912: [EPIC] Issues with the traditional tasking system

Tasking: Uploading many advisories in a row can have intermittent failures

Added by ggainey 5 months ago. Updated 3 months ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Sprint:
Quarter:

Description

Uploading ~350 advisories at the same time, and 'a few' of them fail with tracebacks like the following:

May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[59340]: pulp [None]: rq.worker:INFO: resource-manager: 69556e0b-19a6-4de8-98c2-e3731333b2bc
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com gunicorn[59342]: pulp [d620ef055ae940a894615c1c7989da44]: 127.0.0.1 - admin [05/May/2021:20:41:00 +0000] "POST /pulp/api/v3/content/rpm/adviso
ries/ HTTP/1.0" 202 67 "-" "HTTPie/2.4.0"
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: pulp [d620ef055ae940a894615c1c7989da44]: rq.worker:ERROR: Traceback (most recent call last):
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:   File "/usr/local/lib/pulp/lib64/python3.9/site-packages/rq/worker.py", line 1008, in perform_job
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:     rv = job.perform()
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:   File "/usr/local/lib/pulp/lib64/python3.9/site-packages/rq/job.py", line 706, in perform
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:     self._result = self._execute()
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:   File "/usr/local/lib/pulp/lib64/python3.9/site-packages/rq/job.py", line 729, in _execute
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:     result = self.func(*self.args, **self.kwargs)
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:   File "/home/vagrant/devel/pulpcore/pulpcore/tasking/tasks.py", line 144, in _queue_reserved_task
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:     reservation = worker.reservations.get(resource=resource)
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:   File "/usr/local/lib/pulp/lib64/python3.9/site-packages/django/db/models/manager.py", line 82, in manager_method
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:     return getattr(self.get_queryset(), name)(*args, **kwargs)
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:   File "/usr/local/lib/pulp/lib64/python3.9/site-packages/django/db/models/query.py", line 406, in get
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:     raise self.model.DoesNotExist(
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: pulpcore.app.models.task.ReservedResource.DoesNotExist: ReservedResource matching query does not exist.
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: Traceback (most recent call last):
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:   File "/usr/local/lib/pulp/lib64/python3.9/site-packages/rq/worker.py", line 1008, in perform_job
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:     rv = job.perform()
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:   File "/usr/local/lib/pulp/lib64/python3.9/site-packages/rq/job.py", line 706, in perform
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:     self._result = self._execute()
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:   File "/usr/local/lib/pulp/lib64/python3.9/site-packages/rq/job.py", line 729, in _execute
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:     result = self.func(*self.args, **self.kwargs)
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:   File "/home/vagrant/devel/pulpcore/pulpcore/tasking/tasks.py", line 144, in _queue_reserved_task
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:     reservation = worker.reservations.get(resource=resource)
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:   File "/usr/local/lib/pulp/lib64/python3.9/site-packages/django/db/models/manager.py", line 82, in manager_method
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:     return getattr(self.get_queryset(), name)(*args, **kwargs)
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:   File "/usr/local/lib/pulp/lib64/python3.9/site-packages/django/db/models/query.py", line 406, in get
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]:     raise self.model.DoesNotExist(
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: pulpcore.app.models.task.ReservedResource.DoesNotExist: ReservedResource matching query does not exist.

Using the advisories in this repo: https://github.com/ggainey/pulp_startup/tree/main/lero_tests/tmp/bla

And this to do the uploads:

cd lero_tests/tmp/bla
REPO_HREF=`pulp rpm repository create --name bar | jq -r '.pulp_href'`
for x in *; do http --form POST :/pulp/api/v3/content/rpm/advisories/ file@./$x repository=$REPO_HREF; done

Each advisory-upload is its own task. Occasionally, in the logs, you'll see the traceback above. Also, check the count of advisories in the repository - it should be 356, and will occasionally be less:

pulp rpm repository version show --repository bar | jq '.content_summary.present."rpm.advisory".count'
352

NOTE: uploading an advisory, that already exists, will not only fail - it will make it impossible to upload further advisories, until https://pulp.plan.io/issues/8683 gets fixed. Addressing this issue probably can/needs-to wait until after 8683 is fixed/merged.


Related issues

Related to RPM Support - Issue #8683: re-uploading an existing advisory makes new advisory upload failsCLOSED - CURRENTRELEASE<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>
Related to RPM Support - Issue #8716: Advisory-digest-calculation can result in the 'same' advisory having different digestsCLOSED - CURRENTRELEASE<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>

Associated revisions

Revision 18978d85 View on GitHub
Added by mdellweg 3 months ago

Close a race condition with ReservedResource

During a small window between checking it's existance and looking up the ReservedResource, this object may be deleted in another thread. This condition will lead to task remaining in waiting state forever.

fixes #8708 https://pulp.plan.io/issues/8708

History

#1 Updated by pulpbot 5 months ago

  • Status changed from NEW to POST

#2 Updated by ggainey 5 months ago

  • Project changed from Pulp to RPM Support

#3 Updated by ggainey 5 months ago

  • Status changed from POST to NEW

I associated this with PR#1982 INCORRECTLY - there is no fix for this, yet!

#4 Updated by ggainey 5 months ago

  • Related to Issue #8683: re-uploading an existing advisory makes new advisory upload fails added

#5 Updated by ggainey 5 months ago

  • Related to Issue #8716: Advisory-digest-calculation can result in the 'same' advisory having different digests added

#6 Updated by mdellweg 4 months ago

I was not able to reproduce running the steps described above. The machine is used is pulp3-sandbox-debian-testing with core=3.12.2 and rpm=3.12.0.

#7 Updated by ggainey 4 months ago

It took me a couple of tries, but I was able to see it happen (once) on pulp2-nightly-pulp3-source-centos7

[vagrant@pulp2-nightly-pulp3-source-centos7 ~]$ sudo journalctl -u pulp\* -f | grep ReservedResource.DoesNotExist
Jun 02 21:29:54 pulp2-nightly-pulp3-source-centos7.padre-fedora.example.com pulpcore-worker[2676]: pulpcore.app.models.task.ReservedResource.DoesNotExist: ReservedResource matching query does not exist.
...
(pulp) [vagrant@pulp2-nightly-pulp3-source-centos7 bla]$ pulp rpm repository version show --repository bar | jq '.content_summary.present."rpm.advisory".count'
355

It's possible the window is smaller than it used to be - I used to hit this Every Single Time - but whatever this window is, it's still there.

#8 Updated by knzivid 4 months ago

Slightly unrelated because this comes from the pulp_deb plugin.

I triggered a sync on 8 repositories in parallel and I have four workers. One task was still stuck in waiting state. When I checked the resource manager logs, I see a similar error.

Even though the workers eventually became free, the task was stuck waiting.

resource-manager_1  | 07:15:56 Traceback (most recent call last):
resource-manager_1  |   File "/opt/pulp/lib/python3.7/site-packages/rq/worker.py", line 1013, in perform_job
resource-manager_1  |     rv = job.perform()
resource-manager_1  |   File "/opt/pulp/lib/python3.7/site-packages/rq/job.py", line 709, in perform
resource-manager_1  |     self._result = self._execute()
resource-manager_1  |   File "/opt/pulp/lib/python3.7/site-packages/rq/job.py", line 732, in _execute
resource-manager_1  |     result = self.func(*self.args, **self.kwargs)
resource-manager_1  |   File "/opt/pulp/lib/python3.7/site-packages/pulpcore/tasking/tasks.py", line 144, in _queue_reserved_task
resource-manager_1  |     reservation = worker.reservations.get(resource=resource)
resource-manager_1  |   File "/opt/pulp/lib/python3.7/site-packages/django/db/models/manager.py", line 82, in manager_method
resource-manager_1  |     return getattr(self.get_queryset(), name)(*args, **kwargs)
resource-manager_1  |   File "/opt/pulp/lib/python3.7/site-packages/django/db/models/query.py", line 408, in get
resource-manager_1  |     self.model._meta.object_name
resource-manager_1  | pulpcore.app.models.task.ReservedResource.DoesNotExist: ReservedResource matching query does not exist.

Package versions

pulp-deb                        2.13.0
pulp-file                       1.7.0
pulpcore                        3.13.0

#9 Updated by dalley 4 months ago

  • Project changed from RPM Support to Pulp
  • Triaged changed from No to Yes

#10 Updated by dalley 4 months ago

  • Subject changed from Uploading many advisories in a row can have intermittent failures to Tasking: Uploading many advisories in a row can have intermittent failures

#11 Updated by ggainey 4 months ago

Adding data about my env here - I'm running the pulp2-nightly-pulp3-source-centos7 vagrant box. I allocate 16Gb to it. The VM is running on spinning disk. The host box is F33, running on Intel® Core™ i7-6700K CPU @ 4.00GHz × 8 , 32Gb.

Aaand I just ran that recreate-script up there 6 times, and had ZERO failures. Of course. So mdellweg - "trouble reproducing" is not just you. I'm wondering if the fixes applied to advisories recently have changed the timings enough that it's much harder to recreate on this path...

#12 Updated by ggainey 4 months ago

And here's another data-point:

As noted about, running on master, I could not reproduce the failure.

Running on pulpcore-3.11-compatible code, I hit it four times on my first run.

To run on core-3.11, I used the following branches:

ROOT=/home/ggainey/github/Pulp3
cd $ROOT/pulpcore; git checkout 3.11
cd $ROOT/pulp_file; git checkout 1.6
cd $ROOT/pulp_rpm; git checkout 3.10
cd $ROOT/pulp_container; git checkout 2.5
cd $ROOT/pulp_deb; git checkout 2.11
cd $ROOT/pulp_installer; git checkout 3.11
cd $ROOT/pulp-certguard; git checkout 1.2
cd $ROOT/pulp-2to3-migration; git checkout 0.11

Then the following on my vagrant box:

cd /home/vagrant/devel
for d in pulpcore/ pulp_file/ pulp_rpm/ pulp_container/ pulp_deb/ pulp-certguard/ pulp-2to3-migration/
do
  echo $d
  pip install -e $d
done
pclean
prestart

Hopefully this will help us figure out wth is going on...

#13 Updated by mdellweg 3 months ago

I managed to reproduce it very reliably by injecting a "time.sleep(1)" between the "exists" and the "get" of the stack trace (around lines 142-143) and dispatching a few tasks that run for 1 second each, requiring overlapping resources.

http --auth admin:password POST http://localhost:24817/pulp/api/v3/herminig/tasking_benchmark/ truncate_tasks=true count=5 resources_N=2 resources_K=1 sleep_secs=1

This hit the race condition 5/5 tries with one of those tasks.

#14 Updated by mdellweg 3 months ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to mdellweg

#15 Updated by pulpbot 3 months ago

  • Status changed from ASSIGNED to POST

#16 Updated by mdellweg 3 months ago

  • Status changed from POST to MODIFIED

#17 Updated by mdellweg 3 months ago

  • Parent task set to #8912

#18 Updated by pulpbot 3 months ago

  • Sprint/Milestone set to 3.14.0

#19 Updated by pulpbot 3 months ago

  • Status changed from MODIFIED to CLOSED - CURRENTRELEASE

Please register to edit this issue

Also available in: Atom PDF