Issue #8708
closedIssue #8912: [EPIC] Issues with the traditional tasking system
Tasking: Uploading many advisories in a row can have intermittent failures
Description
Uploading ~350 advisories at the same time, and 'a few' of them fail with tracebacks like the following:
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[59340]: pulp [None]: rq.worker:INFO: resource-manager: 69556e0b-19a6-4de8-98c2-e3731333b2bc
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com gunicorn[59342]: pulp [d620ef055ae940a894615c1c7989da44]: 127.0.0.1 - admin [05/May/2021:20:41:00 +0000] "POST /pulp/api/v3/content/rpm/adviso
ries/ HTTP/1.0" 202 67 "-" "HTTPie/2.4.0"
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: pulp [d620ef055ae940a894615c1c7989da44]: rq.worker:ERROR: Traceback (most recent call last):
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: File "/usr/local/lib/pulp/lib64/python3.9/site-packages/rq/worker.py", line 1008, in perform_job
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: rv = job.perform()
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: File "/usr/local/lib/pulp/lib64/python3.9/site-packages/rq/job.py", line 706, in perform
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: self._result = self._execute()
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: File "/usr/local/lib/pulp/lib64/python3.9/site-packages/rq/job.py", line 729, in _execute
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: result = self.func(*self.args, **self.kwargs)
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: File "/home/vagrant/devel/pulpcore/pulpcore/tasking/tasks.py", line 144, in _queue_reserved_task
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: reservation = worker.reservations.get(resource=resource)
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: File "/usr/local/lib/pulp/lib64/python3.9/site-packages/django/db/models/manager.py", line 82, in manager_method
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: return getattr(self.get_queryset(), name)(*args, **kwargs)
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: File "/usr/local/lib/pulp/lib64/python3.9/site-packages/django/db/models/query.py", line 406, in get
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: raise self.model.DoesNotExist(
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: pulpcore.app.models.task.ReservedResource.DoesNotExist: ReservedResource matching query does not exist.
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: Traceback (most recent call last):
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: File "/usr/local/lib/pulp/lib64/python3.9/site-packages/rq/worker.py", line 1008, in perform_job
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: rv = job.perform()
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: File "/usr/local/lib/pulp/lib64/python3.9/site-packages/rq/job.py", line 706, in perform
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: self._result = self._execute()
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: File "/usr/local/lib/pulp/lib64/python3.9/site-packages/rq/job.py", line 729, in _execute
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: result = self.func(*self.args, **self.kwargs)
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: File "/home/vagrant/devel/pulpcore/pulpcore/tasking/tasks.py", line 144, in _queue_reserved_task
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: reservation = worker.reservations.get(resource=resource)
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: File "/usr/local/lib/pulp/lib64/python3.9/site-packages/django/db/models/manager.py", line 82, in manager_method
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: return getattr(self.get_queryset(), name)(*args, **kwargs)
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: File "/usr/local/lib/pulp/lib64/python3.9/site-packages/django/db/models/query.py", line 406, in get
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: raise self.model.DoesNotExist(
May 05 20:41:00 pulp3-source-fedora33.padre-fedora.example.com rq[62301]: pulpcore.app.models.task.ReservedResource.DoesNotExist: ReservedResource matching query does not exist.
Using the advisories in this repo: https://github.com/ggainey/pulp_startup/tree/main/lero_tests/tmp/bla
And this to do the uploads:
cd lero_tests/tmp/bla
REPO_HREF=`pulp rpm repository create --name bar | jq -r '.pulp_href'`
for x in *; do http --form POST :/pulp/api/v3/content/rpm/advisories/ file@./$x repository=$REPO_HREF; done
Each advisory-upload is its own task. Occasionally, in the logs, you'll see the traceback above. Also, check the count of advisories in the repository - it should be 356, and will occasionally be less:
pulp rpm repository version show --repository bar | jq '.content_summary.present."rpm.advisory".count'
352
NOTE: uploading an advisory, that already exists, will not only fail - it will make it impossible to upload further advisories, until https://pulp.plan.io/issues/8683 gets fixed. Addressing this issue probably can/needs-to wait until after 8683 is fixed/merged.
Related issues
Updated by ggainey over 3 years ago
- Status changed from POST to NEW
I associated this with PR#1982 INCORRECTLY - there is no fix for this, yet!
Updated by ggainey over 3 years ago
- Related to Issue #8683: re-uploading an existing advisory makes new advisory upload fails added
Updated by ggainey over 3 years ago
- Related to Issue #8716: Advisory-digest-calculation can result in the 'same' advisory having different digests added
Updated by mdellweg over 3 years ago
I was not able to reproduce running the steps described above. The machine is used is pulp3-sandbox-debian-testing with core=3.12.2 and rpm=3.12.0.
Updated by ggainey over 3 years ago
It took me a couple of tries, but I was able to see it happen (once) on pulp2-nightly-pulp3-source-centos7
[vagrant@pulp2-nightly-pulp3-source-centos7 ~]$ sudo journalctl -u pulp\* -f | grep ReservedResource.DoesNotExist
Jun 02 21:29:54 pulp2-nightly-pulp3-source-centos7.padre-fedora.example.com pulpcore-worker[2676]: pulpcore.app.models.task.ReservedResource.DoesNotExist: ReservedResource matching query does not exist.
...
(pulp) [vagrant@pulp2-nightly-pulp3-source-centos7 bla]$ pulp rpm repository version show --repository bar | jq '.content_summary.present."rpm.advisory".count'
355
It's possible the window is smaller than it used to be - I used to hit this Every Single Time - but whatever this window is, it's still there.
Updated by knzivid over 3 years ago
Slightly unrelated because this comes from the pulp_deb
plugin.
I triggered a sync on 8 repositories in parallel and I have four workers. One task was still stuck in waiting
state. When I checked the resource manager logs, I see a similar error.
Even though the workers eventually became free, the task was stuck waiting.
resource-manager_1 | 07:15:56 Traceback (most recent call last):
resource-manager_1 | File "/opt/pulp/lib/python3.7/site-packages/rq/worker.py", line 1013, in perform_job
resource-manager_1 | rv = job.perform()
resource-manager_1 | File "/opt/pulp/lib/python3.7/site-packages/rq/job.py", line 709, in perform
resource-manager_1 | self._result = self._execute()
resource-manager_1 | File "/opt/pulp/lib/python3.7/site-packages/rq/job.py", line 732, in _execute
resource-manager_1 | result = self.func(*self.args, **self.kwargs)
resource-manager_1 | File "/opt/pulp/lib/python3.7/site-packages/pulpcore/tasking/tasks.py", line 144, in _queue_reserved_task
resource-manager_1 | reservation = worker.reservations.get(resource=resource)
resource-manager_1 | File "/opt/pulp/lib/python3.7/site-packages/django/db/models/manager.py", line 82, in manager_method
resource-manager_1 | return getattr(self.get_queryset(), name)(*args, **kwargs)
resource-manager_1 | File "/opt/pulp/lib/python3.7/site-packages/django/db/models/query.py", line 408, in get
resource-manager_1 | self.model._meta.object_name
resource-manager_1 | pulpcore.app.models.task.ReservedResource.DoesNotExist: ReservedResource matching query does not exist.
Package versions
pulp-deb 2.13.0
pulp-file 1.7.0
pulpcore 3.13.0
Updated by dalley over 3 years ago
- Project changed from RPM Support to Pulp
- Triaged changed from No to Yes
Updated by dalley over 3 years ago
- Subject changed from Uploading many advisories in a row can have intermittent failures to Tasking: Uploading many advisories in a row can have intermittent failures
Updated by ggainey over 3 years ago
Adding data about my env here - I'm running the pulp2-nightly-pulp3-source-centos7 vagrant box. I allocate 16Gb to it. The VM is running on spinning disk. The host box is F33, running on Intel® Core™ i7-6700K CPU @ 4.00GHz × 8 , 32Gb.
Aaand I just ran that recreate-script up there 6 times, and had ZERO failures. Of course. So mdellweg - "trouble reproducing" is not just you. I'm wondering if the fixes applied to advisories recently have changed the timings enough that it's much harder to recreate on this path...
Updated by ggainey over 3 years ago
And here's another data-point:
As noted about, running on master, I could not reproduce the failure.
Running on pulpcore-3.11-compatible code, I hit it four times on my first run.
To run on core-3.11, I used the following branches:
ROOT=/home/ggainey/github/Pulp3
cd $ROOT/pulpcore; git checkout 3.11
cd $ROOT/pulp_file; git checkout 1.6
cd $ROOT/pulp_rpm; git checkout 3.10
cd $ROOT/pulp_container; git checkout 2.5
cd $ROOT/pulp_deb; git checkout 2.11
cd $ROOT/pulp_installer; git checkout 3.11
cd $ROOT/pulp-certguard; git checkout 1.2
cd $ROOT/pulp-2to3-migration; git checkout 0.11
Then the following on my vagrant box:
cd /home/vagrant/devel
for d in pulpcore/ pulp_file/ pulp_rpm/ pulp_container/ pulp_deb/ pulp-certguard/ pulp-2to3-migration/
do
echo $d
pip install -e $d
done
pclean
prestart
Hopefully this will help us figure out wth is going on...
Updated by mdellweg over 3 years ago
I managed to reproduce it very reliably by injecting a "time.sleep(1)" between the "exists" and the "get" of the stack trace (around lines 142-143) and dispatching a few tasks that run for 1 second each, requiring overlapping resources.
http --auth admin:password POST http://localhost:24817/pulp/api/v3/herminig/tasking_benchmark/ truncate_tasks=true count=5 resources_N=2 resources_K=1 sleep_secs=1
This hit the race condition 5/5 tries with one of those tasks.
Updated by mdellweg over 3 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to mdellweg
Updated by pulpbot over 3 years ago
- Status changed from ASSIGNED to POST
Added by mdellweg over 3 years ago
Updated by mdellweg over 3 years ago
- Status changed from POST to MODIFIED
Applied in changeset pulpcore|18978d85d510d50b10990f509fae5a6a07d30f67.
Updated by pulpbot over 3 years ago
- Status changed from MODIFIED to CLOSED - CURRENTRELEASE
Close a race condition with ReservedResource
During a small window between checking it's existance and looking up the ReservedResource, this object may be deleted in another thread. This condition will lead to task remaining in waiting state forever.
fixes #8708 https://pulp.plan.io/issues/8708