Project

Profile

Help

Issue #9011

closed

batch_regenerate_applicability tasks are never assigned a worker and pulp is stuck until restart

Added by quba42 over 3 years ago. Updated almost 3 years ago.

Status:
CLOSED - DUPLICATE
Priority:
Normal
Assignee:
-
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
2.21.1
Platform Release:
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Quarter:

Description

Ticket moved to GitHub: "pulp/pulpcore/2024":https://github.com/pulp/pulpcore/issues/2024


Some information: We have observed this on various systems using pulp via Katello.

I actually saw this in Pulp 2.21.4, but that version is not available on plan.io. We also observed the issue in systems using 2.21.5.

Symptoms from Katello:

This happens sporadically with various RPM based repos, and usually not the same repo twice in a row. However it happens consistently enough for Katello systems with daily sync plans to get stuck essentially every day. From the Katello side the Katello sync task is simply stuck on the Actions::Pulp::Repository::RegenerateApplicability for ever. Once pulp is restarted things are unstuck, and the next round of syncs succeeds.

Symptoms within Pulp:

It looks like the underlying batch_regenerate_applicability Pulp tasks are never assigned to a worker. We can find several instances of the following tasks within mongo:

> db.task_status.find({"group_id": {$ne: null},"state": {$ne: "finished"}}).pretty()
{
"_id" : ObjectId("60e136cf61272888f8460e49"),
"task_id" : "280fcb08-db6f-4a96-9f9a-4146f59eb77e",
"exception" : null,
"task_type" : "pulp.server.managers.consumer.applicability.batch_regenerate_applicability",
"tags" : [ ],
"progress_report" : {

},
"worker_name" : null,
"group_id" : BinData(3,"DaJey7wqRG2XSnLKM8WS3g=="),
"finish_time" : null,
"start_time" : null,
"traceback" : null,
"spawned_tasks" : [ ],
"state" : "waiting",
"result" : null,
"error" : null
}
Actions #1

Updated by quba42 over 3 years ago

Some more information:

It looks like the task actually blocking a worker is a sync task (and not regenerate applicability). We found this by running celery -A pulp.server.async.app inspect active on a stuck system.

When we then look for the stuck path in mongo we see the following:

> db.task_status.find({"task_id": "b508b78d-ac87-41e8-851a-16a2d398155d"}).pretty()
{
        "_id" : ObjectId("60e00e8861272888f8feff1e"),
        "task_id" : "b508b78d-ac87-41e8-851a-16a2d398155d",
        "exception" : null,
        "task_type" : "pulp.server.managers.repo.sync.sync",
        "tags" : [
                "pulp:repository:612b1c28-35f9-4c46-9c89-ed9f3556829c",
                "pulp:action:sync"
        ],
        "finish_time" : null,
        "traceback" : null,
        "spawned_tasks" : [ ],
        "progress_report" : {
                "yum_importer" : {
                        "modules" : {
                                "state" : "NOT_STARTED"
                        },
                        "content" : {
                                "size_total" : 21250313,
                                "items_left" : 5,
                                "items_total" : 6,
                                "state" : "IN_PROGRESS",
                                "size_left" : 13868180,
                                "details" : {
                                        "rpm_total" : 2,
                                        "rpm_done" : 1,
                                        "drpm_total" : 4,
                                        "drpm_done" : 0
                                },
                                "error_details" : [
                                        {
                                                "url" : "https://updates.suse.com/SUSE/Updates/SLE-Module-Legacy/15-SP2/x86_64/update/src/ntp-4.2.8p15-4.19.1.src.rpm?AUTH_TOKEN_REDACTED!",
                                                "errors" : [
                                                        null
                                                ]
                                        }
                                ]
                        },
                        "comps" : {
                                "state" : "NOT_STARTED"
                        },
                        "purge_duplicates" : {
                                "state" : "NOT_STARTED"
                        },
                        "distribution" : {
                                "items_total" : 0,
                                "state" : "NOT_STARTED",
                                "error_details" : [ ],
                                "items_left" : 0
                        },
                        "errata" : {
                                "state" : "NOT_STARTED"
                        },
                        "metadata" : {
                                "state" : "FINISHED"
                        }
                }
        },
        "worker_name" : "reserved_resource_worker-5@host002.example.com",
        "result" : null,
        "error" : null,
        "group_id" : null,
        "state" : "canceled",
        "start_time" : "2021-07-03T07:15:42Z"

It looks a bit like a package failed to download but instead of handling this, everything is stuck until the worker is restarted...

Actions #2

Updated by mbucher over 3 years ago

Also, celery -A pulp.server.async.app inspect reserved showed a lot of pulp.server.controllers.repository.download_deferred tasks for the worker in question as well as various pulp.server.managers.consumer.applicability.batch_regenerate_applicability tasks. The latter allegedly being the ones showed hanging in ForemanTasks-UI.

We assume they are somehow blocked by the one hanging Task showed above https://pulp.plan.io/issues/9011#note-1

Actions #3

Updated by dkliban@redhat.com over 3 years ago

  • Triaged changed from No to Yes
Actions #4

Updated by quba42 over 3 years ago

I just wanted to add another update from our End:

We were now able to manage the problem by rigorously spreading out our Katello sync plans. (The affected system went from daily occurrence to no more occurrences).

The probability of running into this issue appears to be heavily dependent on starting a lot of pulp tasks within a short time. We are not aware that system load/resources (e.g.: running out of memory) was a factor as well.

The most likely course of action is to keep "managing the symptoms" until we switch to Pulp 3.

Actions #5

Updated by pulpbot almost 3 years ago

  • Description updated (diff)
  • Status changed from NEW to CLOSED - DUPLICATE

Also available in: Atom PDF