Actions

Send by e-mail Copy link

Issue #9011

closed

batch_regenerate_applicability tasks are never assigned a worker and pulp is stuck until restart

Added by quba42 over 3 years ago. Updated about 3 years ago.

Status:

CLOSED - DUPLICATE

Priority:

Normal

Assignee:

Category:

Sprint/Milestone:

Start date:

Due date:

Estimated time:

Severity:

2. Medium

Version:

2.21.1

Platform Release:

OS:

Triaged:

Yes

Groomed:

Sprint Candidate:

Tags:

Pulp 2

Sprint:

Quarter:

Description

Ticket moved to GitHub: "pulp/pulpcore/2024":https://github.com/pulp/pulpcore/issues/2024

Some information: We have observed this on various systems using pulp via Katello.

I actually saw this in Pulp 2.21.4, but that version is not available on plan.io. We also observed the issue in systems using 2.21.5.

Symptoms from Katello:¶

This happens sporadically with various RPM based repos, and usually not the same repo twice in a row. However it happens consistently enough for Katello systems with daily sync plans to get stuck essentially every day. From the Katello side the Katello sync task is simply stuck on the Actions::Pulp::Repository::RegenerateApplicability for ever. Once pulp is restarted things are unstuck, and the next round of syncs succeeds.

Symptoms within Pulp:¶

It looks like the underlying batch_regenerate_applicability Pulp tasks are never assigned to a worker. We can find several instances of the following tasks within mongo:

> db.task_status.find({"group_id": {$ne: null},"state": {$ne: "finished"}}).pretty()
{
"_id" : ObjectId("60e136cf61272888f8460e49"),
"task_id" : "280fcb08-db6f-4a96-9f9a-4146f59eb77e",
"exception" : null,
"task_type" : "pulp.server.managers.consumer.applicability.batch_regenerate_applicability",
"tags" : [ ],
"progress_report" : {

},
"worker_name" : null,
"group_id" : BinData(3,"DaJey7wqRG2XSnLKM8WS3g=="),
"finish_time" : null,
"start_time" : null,
"traceback" : null,
"spawned_tasks" : [ ],
"state" : "waiting",
"result" : null,
"error" : null
}

Actions

Copy link

Updated by quba42 over 3 years ago

Some more information:

It looks like the task actually blocking a worker is a sync task (and not regenerate applicability). We found this by running celery -A pulp.server.async.app inspect active on a stuck system.

When we then look for the stuck path in mongo we see the following:

> db.task_status.find({"task_id": "b508b78d-ac87-41e8-851a-16a2d398155d"}).pretty()
{
        "_id" : ObjectId("60e00e8861272888f8feff1e"),
        "task_id" : "b508b78d-ac87-41e8-851a-16a2d398155d",
        "exception" : null,
        "task_type" : "pulp.server.managers.repo.sync.sync",
        "tags" : [
                "pulp:repository:612b1c28-35f9-4c46-9c89-ed9f3556829c",
                "pulp:action:sync"
        ],
        "finish_time" : null,
        "traceback" : null,
        "spawned_tasks" : [ ],
        "progress_report" : {
                "yum_importer" : {
                        "modules" : {
                                "state" : "NOT_STARTED"
                        },
                        "content" : {
                                "size_total" : 21250313,
                                "items_left" : 5,
                                "items_total" : 6,
                                "state" : "IN_PROGRESS",
                                "size_left" : 13868180,
                                "details" : {
                                        "rpm_total" : 2,
                                        "rpm_done" : 1,
                                        "drpm_total" : 4,
                                        "drpm_done" : 0
                                },
                                "error_details" : [
                                        {
                                                "url" : "https://updates.suse.com/SUSE/Updates/SLE-Module-Legacy/15-SP2/x86_64/update/src/ntp-4.2.8p15-4.19.1.src.rpm?AUTH_TOKEN_REDACTED!",
                                                "errors" : [
                                                        null
                                                ]
                                        }
                                ]
                        },
                        "comps" : {
                                "state" : "NOT_STARTED"
                        },
                        "purge_duplicates" : {
                                "state" : "NOT_STARTED"
                        },
                        "distribution" : {
                                "items_total" : 0,
                                "state" : "NOT_STARTED",
                                "error_details" : [ ],
                                "items_left" : 0
                        },
                        "errata" : {
                                "state" : "NOT_STARTED"
                        },
                        "metadata" : {
                                "state" : "FINISHED"
                        }
                }
        },
        "worker_name" : "reserved_resource_worker-5@host002.example.com",
        "result" : null,
        "error" : null,
        "group_id" : null,
        "state" : "canceled",
        "start_time" : "2021-07-03T07:15:42Z"

It looks a bit like a package failed to download but instead of handling this, everything is stuck until the worker is restarted...

Actions

Copy link

Updated by mbucher over 3 years ago

Also, celery -A pulp.server.async.app inspect reserved showed a lot of pulp.server.controllers.repository.download_deferred tasks for the worker in question as well as various pulp.server.managers.consumer.applicability.batch_regenerate_applicability tasks. The latter allegedly being the ones showed hanging in ForemanTasks-UI.

We assume they are somehow blocked by the one hanging Task showed above https://pulp.plan.io/issues/9011#note-1

Actions

Copy link

Updated by dkliban@redhat.com over 3 years ago

Triaged changed from No to Yes

Actions

Copy link

Updated by quba42 over 3 years ago

I just wanted to add another update from our End:

We were now able to manage the problem by rigorously spreading out our Katello sync plans. (The affected system went from daily occurrence to no more occurrences).

The probability of running into this issue appears to be heavily dependent on starting a lot of pulp tasks within a short time. We are not aware that system load/resources (e.g.: running out of memory) was a factor as well.

The most likely course of action is to keep "managing the symptoms" until we switch to Pulp 3.

Actions

Copy link

Updated by pulpbot about 3 years ago

Description updated (diff)
Status changed from NEW to CLOSED - DUPLICATE

Actions

Send by e-mail Copy link

Also available in: Atom PDF

Project

Profile

Help

Pulp

Agile boards

Custom queries

Issue #9011

batch_regenerate_applicability tasks are never assigned a worker and pulp is stuck until restart

Symptoms from Katello:¶

Symptoms within Pulp:¶

Updated by quba42 over 3 years ago

Updated by mbucher over 3 years ago

Updated by dkliban@redhat.com over 3 years ago

Updated by quba42 over 3 years ago

Updated by pulpbot about 3 years ago