Project

Profile

Help

Issue #9011

batch_regenerate_applicability tasks are never assigned a worker and pulp is stuck until restart

Added by quba42 3 months ago. Updated 2 months ago.

Status:
NEW
Priority:
Normal
Assignee:
-
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
2.21.1
Platform Release:
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Quarter:

Description

Some information: We have observed this on various systems using pulp via Katello.

I actually saw this in Pulp 2.21.4, but that version is not available on plan.io. We also observed the issue in systems using 2.21.5.

Symptoms from Katello:

This happens sporadically with various RPM based repos, and usually not the same repo twice in a row. However it happens consistently enough for Katello systems with daily sync plans to get stuck essentially every day. From the Katello side the Katello sync task is simply stuck on the Actions::Pulp::Repository::RegenerateApplicability for ever. Once pulp is restarted things are unstuck, and the next round of syncs succeeds.

Symptoms within Pulp:

It looks like the underlying batch_regenerate_applicability Pulp tasks are never assigned to a worker. We can find several instances of the following tasks within mongo:

> db.task_status.find({"group_id": {$ne: null},"state": {$ne: "finished"}}).pretty()
{
"_id" : ObjectId("60e136cf61272888f8460e49"),
"task_id" : "280fcb08-db6f-4a96-9f9a-4146f59eb77e",
"exception" : null,
"task_type" : "pulp.server.managers.consumer.applicability.batch_regenerate_applicability",
"tags" : [ ],
"progress_report" : {

},
"worker_name" : null,
"group_id" : BinData(3,"DaJey7wqRG2XSnLKM8WS3g=="),
"finish_time" : null,
"start_time" : null,
"traceback" : null,
"spawned_tasks" : [ ],
"state" : "waiting",
"result" : null,
"error" : null
}

History

#1 Updated by quba42 3 months ago

Some more information:

It looks like the task actually blocking a worker is a sync task (and not regenerate applicability). We found this by running celery -A pulp.server.async.app inspect active on a stuck system.

When we then look for the stuck path in mongo we see the following:

> db.task_status.find({"task_id": "b508b78d-ac87-41e8-851a-16a2d398155d"}).pretty()
{
        "_id" : ObjectId("60e00e8861272888f8feff1e"),
        "task_id" : "b508b78d-ac87-41e8-851a-16a2d398155d",
        "exception" : null,
        "task_type" : "pulp.server.managers.repo.sync.sync",
        "tags" : [
                "pulp:repository:612b1c28-35f9-4c46-9c89-ed9f3556829c",
                "pulp:action:sync"
        ],
        "finish_time" : null,
        "traceback" : null,
        "spawned_tasks" : [ ],
        "progress_report" : {
                "yum_importer" : {
                        "modules" : {
                                "state" : "NOT_STARTED"
                        },
                        "content" : {
                                "size_total" : 21250313,
                                "items_left" : 5,
                                "items_total" : 6,
                                "state" : "IN_PROGRESS",
                                "size_left" : 13868180,
                                "details" : {
                                        "rpm_total" : 2,
                                        "rpm_done" : 1,
                                        "drpm_total" : 4,
                                        "drpm_done" : 0
                                },
                                "error_details" : [
                                        {
                                                "url" : "https://updates.suse.com/SUSE/Updates/SLE-Module-Legacy/15-SP2/x86_64/update/src/ntp-4.2.8p15-4.19.1.src.rpm?AUTH_TOKEN_REDACTED!",
                                                "errors" : [
                                                        null
                                                ]
                                        }
                                ]
                        },
                        "comps" : {
                                "state" : "NOT_STARTED"
                        },
                        "purge_duplicates" : {
                                "state" : "NOT_STARTED"
                        },
                        "distribution" : {
                                "items_total" : 0,
                                "state" : "NOT_STARTED",
                                "error_details" : [ ],
                                "items_left" : 0
                        },
                        "errata" : {
                                "state" : "NOT_STARTED"
                        },
                        "metadata" : {
                                "state" : "FINISHED"
                        }
                }
        },
        "worker_name" : "reserved_resource_worker-5@host002.example.com",
        "result" : null,
        "error" : null,
        "group_id" : null,
        "state" : "canceled",
        "start_time" : "2021-07-03T07:15:42Z"

It looks a bit like a package failed to download but instead of handling this, everything is stuck until the worker is restarted...

#2 Updated by mbucher 3 months ago

Also, celery -A pulp.server.async.app inspect reserved showed a lot of pulp.server.controllers.repository.download_deferred tasks for the worker in question as well as various pulp.server.managers.consumer.applicability.batch_regenerate_applicability tasks. The latter allegedly being the ones showed hanging in ForemanTasks-UI.

We assume they are somehow blocked by the one hanging Task showed above https://pulp.plan.io/issues/9011#note-1

#3 Updated by dkliban@redhat.com 2 months ago

  • Triaged changed from No to Yes

#4 Updated by quba42 2 months ago

I just wanted to add another update from our End:

We were now able to manage the problem by rigorously spreading out our Katello sync plans. (The affected system went from daily occurrence to no more occurrences).

The probability of running into this issue appears to be heavily dependent on starting a lot of pulp tasks within a short time. We are not aware that system load/resources (e.g.: running out of memory) was a factor as well.

The most likely course of action is to keep "managing the symptoms" until we switch to Pulp 3.

Please register to edit this issue

Also available in: Atom PDF