Issue #9011
closedbatch_regenerate_applicability tasks are never assigned a worker and pulp is stuck until restart
Description
Ticket moved to GitHub: "pulp/pulpcore/2024":https://github.com/pulp/pulpcore/issues/2024
Some information: We have observed this on various systems using pulp via Katello.
I actually saw this in Pulp 2.21.4, but that version is not available on plan.io. We also observed the issue in systems using 2.21.5.
Symptoms from Katello:¶
This happens sporadically with various RPM based repos, and usually not the same repo twice in a row. However it happens consistently enough for Katello systems with daily sync plans to get stuck essentially every day. From the Katello side the Katello sync task is simply stuck on the Actions::Pulp::Repository::RegenerateApplicability
for ever. Once pulp is restarted things are unstuck, and the next round of syncs succeeds.
Symptoms within Pulp:¶
It looks like the underlying batch_regenerate_applicability
Pulp tasks are never assigned to a worker. We can find several instances of the following tasks within mongo:
> db.task_status.find({"group_id": {$ne: null},"state": {$ne: "finished"}}).pretty()
{
"_id" : ObjectId("60e136cf61272888f8460e49"),
"task_id" : "280fcb08-db6f-4a96-9f9a-4146f59eb77e",
"exception" : null,
"task_type" : "pulp.server.managers.consumer.applicability.batch_regenerate_applicability",
"tags" : [ ],
"progress_report" : {
},
"worker_name" : null,
"group_id" : BinData(3,"DaJey7wqRG2XSnLKM8WS3g=="),
"finish_time" : null,
"start_time" : null,
"traceback" : null,
"spawned_tasks" : [ ],
"state" : "waiting",
"result" : null,
"error" : null
}
Updated by quba42 over 3 years ago
Some more information:
It looks like the task actually blocking a worker is a sync task (and not regenerate applicability). We found this by running celery -A pulp.server.async.app inspect active
on a stuck system.
When we then look for the stuck path in mongo we see the following:
> db.task_status.find({"task_id": "b508b78d-ac87-41e8-851a-16a2d398155d"}).pretty()
{
"_id" : ObjectId("60e00e8861272888f8feff1e"),
"task_id" : "b508b78d-ac87-41e8-851a-16a2d398155d",
"exception" : null,
"task_type" : "pulp.server.managers.repo.sync.sync",
"tags" : [
"pulp:repository:612b1c28-35f9-4c46-9c89-ed9f3556829c",
"pulp:action:sync"
],
"finish_time" : null,
"traceback" : null,
"spawned_tasks" : [ ],
"progress_report" : {
"yum_importer" : {
"modules" : {
"state" : "NOT_STARTED"
},
"content" : {
"size_total" : 21250313,
"items_left" : 5,
"items_total" : 6,
"state" : "IN_PROGRESS",
"size_left" : 13868180,
"details" : {
"rpm_total" : 2,
"rpm_done" : 1,
"drpm_total" : 4,
"drpm_done" : 0
},
"error_details" : [
{
"url" : "https://updates.suse.com/SUSE/Updates/SLE-Module-Legacy/15-SP2/x86_64/update/src/ntp-4.2.8p15-4.19.1.src.rpm?AUTH_TOKEN_REDACTED!",
"errors" : [
null
]
}
]
},
"comps" : {
"state" : "NOT_STARTED"
},
"purge_duplicates" : {
"state" : "NOT_STARTED"
},
"distribution" : {
"items_total" : 0,
"state" : "NOT_STARTED",
"error_details" : [ ],
"items_left" : 0
},
"errata" : {
"state" : "NOT_STARTED"
},
"metadata" : {
"state" : "FINISHED"
}
}
},
"worker_name" : "reserved_resource_worker-5@host002.example.com",
"result" : null,
"error" : null,
"group_id" : null,
"state" : "canceled",
"start_time" : "2021-07-03T07:15:42Z"
It looks a bit like a package failed to download but instead of handling this, everything is stuck until the worker is restarted...
Updated by mbucher over 3 years ago
Also, celery -A pulp.server.async.app inspect reserved
showed a lot of pulp.server.controllers.repository.download_deferred
tasks for the worker in question as well as various pulp.server.managers.consumer.applicability.batch_regenerate_applicability
tasks. The latter allegedly being the ones showed hanging in ForemanTasks-UI.
We assume they are somehow blocked by the one hanging Task showed above https://pulp.plan.io/issues/9011#note-1
Updated by quba42 over 3 years ago
I just wanted to add another update from our End:
We were now able to manage the problem by rigorously spreading out our Katello sync plans. (The affected system went from daily occurrence to no more occurrences).
The probability of running into this issue appears to be heavily dependent on starting a lot of pulp tasks within a short time. We are not aware that system load/resources (e.g.: running out of memory) was a factor as well.
The most likely course of action is to keep "managing the symptoms" until we switch to Pulp 3.
Updated by pulpbot almost 3 years ago
- Description updated (diff)
- Status changed from NEW to CLOSED - DUPLICATE