Issue #4263
closedTasks Stuck in Running
Description
My infrastructure has three independent Pulp servers, one of which is the "Master" and is where new packages are uploaded and promoted, as well as the point where external repositories are synced down from the internet. The two other Pulp servers reside in different data centers, and sync nightly from this master server. All three servers are virtual machines running CentOS 7.2 on KVM hosts. They have 8 vCPUs and 32GB of memory each, and they serve both mirrors of external repositories (e.g. CentOS, EPEL, etc) and internal-only repositories for our custom developed packages. Each data center is pointed to its appropriate Pulp server for Yum using DNS.
I recently upgraded all three servers to Pulp 2.18. The Master server is operating normally, but the two "Slave" servers seem to be unable to complete Sync tasks. The tasks will queue up, but they never complete. As a result, the Pulp server enters what I would call a death spiral where tasks build up endlessly and system resources are consumed by MongoDB.
The tasks that are stuck in a Running state typically look like the following:
[root@app01.pulp.prod.sea02 jrogers] # pulp-admin tasks details --task-id 17979b7c-eb2c-4779-9d51-9429239408eb
--------------------------------------------------------------------
Task Details
--------------------------------------------------------------------
Operations: sync
Resources: signalsciences-6-unstable (repository)
State: Running
Start Time: 2018-12-11T17:00:17Z
Finish Time: Incomplete
Result: Incomplete
Task Id: 17979b7c-eb2c-4779-9d51-9429239408eb
Worker Name: reserved_resource_worker-7@app01.pulp.prod.sea02.rsglab.com
Progress Report:
Yum Importer:
Comps:
State: NOT_STARTED
Content:
Details:
Drpm Done: 0
Drpm Total: 0
Rpm Done: 0
Rpm Total: 0
Error Details:
Items Left: 0
Items Total: 0
Size Left: 0
Size Total: 0
State: FINISHED
Distribution:
Error Details:
Items Left: 0
Items Total: 0
State: FINISHED
Errata:
State: NOT_STARTED
Metadata:
State: FINISHED
Modules:
State: IN_PROGRESS
Purge Duplicates:
State: NOT_STARTED
Here is some output from /var/log/messages (all Pulp logging is set up to go to syslog) regarding the above task:
Dec 11 16:54:54 app01 pulp: celery.app.trace:INFO: [03c5521c] Task pulp.server.tasks.repository.sync_with_auto_publish[03c5521c-7365-40e6-a8fc-108da065e1fe] succeeded in 0.0105378520093s: <AsyncResult: 17979b7c-eb2c-4779-9d51-9429239408eb>
Dec 11 17:00:17 app01 pulp: celery.worker.strategy:INFO: Received task: pulp.server.managers.repo.sync.sync[17979b7c-eb2c-4779-9d51-9429239408eb]
Dec 11 17:00:17 app01 pulp: pulp_rpm.plugins.importers.yum.sync:INFO: [17979b7c] Downloading metadata from http://yum.rsglab.com//signalsciences/6/x86_64/unstable/.
Dec 11 17:00:18 app01 pulp: pulp_rpm.plugins.importers.yum.sync:INFO: [17979b7c] Parsing metadata.
Dec 11 17:00:18 app01 pulp: pulp_rpm.plugins.importers.yum.sync:INFO: [17979b7c] Downloading metadata from http://yum.rsglab.com//signalsciences/6/x86_64/unstable/.
Dec 11 17:00:19 app01 pulp: pulp_rpm.plugins.importers.yum.sync:INFO: [17979b7c] Parsing metadata.
Dec 11 17:00:19 app01 pulp: pulp_rpm.plugins.importers.yum.sync:INFO: [17979b7c] Downloading metadata from http://yum.rsglab.com//signalsciences/6/x86_64/unstable/.
Dec 11 17:00:20 app01 pulp: pulp_rpm.plugins.importers.yum.sync:INFO: [17979b7c] Parsing metadata.
Dec 11 17:00:20 app01 pulp: pulp_rpm.plugins.importers.yum.sync:INFO: [17979b7c] Downloading metadata from http://yum.rsglab.com//signalsciences/6/x86_64/unstable/.
Dec 11 17:00:21 app01 pulp: pulp_rpm.plugins.importers.yum.sync:INFO: [17979b7c] Parsing metadata.
Dec 11 17:00:21 app01 pulp: pulp_rpm.plugins.importers.yum.sync:INFO: [17979b7c] Downloading metadata files.
Dec 11 17:00:22 app01 pulp: pulp_rpm.plugins.importers.yum.sync:INFO: [17979b7c] Generating metadata databases.
Dec 11 17:00:22 app01 pulp: pulp_rpm.plugins.importers.yum.sync:INFO: [17979b7c] Determining which units need to be downloaded.
Dec 11 17:00:27 app01 pulp: pulp_rpm.plugins.importers.yum.sync:INFO: [17979b7c] Downloading 0 RPMs.
Dec 11 17:00:27 app01 pulp: pulp_rpm.plugins.importers.yum.repomd.alternate:INFO: [17979b7c] The content container reported: {'downloads': {}, 'total_sources': 0} for base URL: http://yum.rsglab.com//signalsciences/6/x86_64/unstable/
Dec 11 17:00:27 app01 pulp: pulp_rpm.plugins.importers.yum.purge:INFO: [17979b7c] Removing missing units.
Dec 11 17:00:28 app01 pulp: pulp_rpm.plugins.importers.yum.sync:INFO: [17979b7c] Downloading additional units.
Dec 11 17:00:31 app01 pulp: pulp.server.async.tasks:INFO: [e5e8b4ce] Task failed : [17979b7c-eb2c-4779-9d51-9429239408eb] : Worker terminated abnormally while processing task 17979b7c-eb2c-4779-9d51-9429239408eb. Check the logs for details
Dec 11 20:48:22 app01 pulp: pulp.server.async.tasks:INFO: Task canceled: 17979b7c-eb2c-4779-9d51-9429239408eb.
Note that in this case, there was no change to the upstream repository (yum.rsglab.com points to the Master pulp server), and so there was nothing to do but the task failed anyway. Also of note is that although the log says the task was canceled, it was still in the queue.
Related issues
Updated by quantumfoam almost 6 years ago
Also discovered this error is happening:
Dec 12 16:06:27 app01 pulp: celery.worker.strategy:INFO: Received task: pulp.server.tasks.repository.sync_with_auto_publish[1705af4f-33ca-407e-8787-a9d8bb42ec97]
Dec 12 16:06:27 app01 pulp: pulp.server.db.connection:INFO: Attempting to connect to localhost:27017
Dec 12 16:06:27 app01 pulp: pulp.server.db.connection:INFO: Attempting to connect to localhost:27017
Dec 12 16:06:28 app01 pulp: pulp.server.db.connection:INFO: Write concern for Mongo connection: {}
Dec 12 16:06:28 app01 pulp: py.warnings:WARNING: [6b7aa1f9] (6849-00416) /usr/lib64/python2.7/site-packages/pymongo/topology.py:74: UserWarning: MongoClient opened before fork. Create MongoClient with connect=False, or create client after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#using-pymongo-with-multiprocessing>
Dec 12 16:06:28 app01 pulp: py.warnings:WARNING: [6b7aa1f9] (6849-00416) "MongoClient opened before fork. Create MongoClient "
Dec 12 16:06:28 app01 pulp: py.warnings:WARNING: [6b7aa1f9] (6849-00416)
Dec 12 16:06:28 app01 pulp: pulp.server.async.tasks:INFO: [6b7aa1f9] Task failed : [61097499-5130-4e24-a617-0d041d94da4c] : Worker terminated abnormally while processing task 61097499-5130-4e24-a617-0d041d94da4c. Check the logs for details
Dec 12 16:06:28 app01 pulp: celery.app.trace:ERROR: [6b7aa1f9] (6849-00416) Task pulp.server.async.tasks._release_resource[6b7aa1f9-687e-4f88-a728-99925eb0a83d] raised unexpected: AttributeError("'NoneType' object has no attribute 'top'",)
Dec 12 16:06:28 app01 pulp: celery.app.trace:ERROR: [6b7aa1f9] (6849-00416) Traceback (most recent call last):
Dec 12 16:06:28 app01 pulp: celery.app.trace:ERROR: [6b7aa1f9] (6849-00416) File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 382, in trace_task
Dec 12 16:06:28 app01 pulp: celery.app.trace:ERROR: [6b7aa1f9] (6849-00416) R = retval = fun(*args, **kwargs)
Dec 12 16:06:28 app01 pulp: celery.app.trace:ERROR: [6b7aa1f9] (6849-00416) File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 107, in call
Dec 12 16:06:28 app01 pulp: celery.app.trace:ERROR: [6b7aa1f9] (6849-00416) return super(PulpTask, self).__call__(*args, **kwargs)
Dec 12 16:06:28 app01 pulp: celery.app.trace:ERROR: [6b7aa1f9] (6849-00416) File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 641, in protected_call
Dec 12 16:06:28 app01 pulp: celery.app.trace:ERROR: [6b7aa1f9] (6849-00416) return self.run(*args, **kwargs)
Dec 12 16:06:28 app01 pulp: celery.app.trace:ERROR: [6b7aa1f9] (6849-00416) File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 296, in _release_resource
Dec 12 16:06:28 app01 pulp: celery.app.trace:ERROR: [6b7aa1f9] (6849-00416) new_task.on_failure(exception, task_id, (), {}, MyEinfo)
Dec 12 16:06:28 app01 pulp: celery.app.trace:ERROR: [6b7aa1f9] (6849-00416) File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 602, in on_failure
Dec 12 16:06:28 app01 pulp: celery.app.trace:ERROR: [6b7aa1f9] (6849-00416) if not self.request.called_directly:
Dec 12 16:06:28 app01 pulp: celery.app.trace:ERROR: [6b7aa1f9] (6849-00416) File "/usr/lib/python2.7/site-packages/celery/app/task.py", line 973, in _get_request
Dec 12 16:06:28 app01 pulp: celery.app.trace:ERROR: [6b7aa1f9] (6849-00416) req = self.request_stack.top
Dec 12 16:06:28 app01 pulp: celery.app.trace:ERROR: [6b7aa1f9] (6849-00416) AttributeError: 'NoneType' object has no attribute 'top'
Updated by dalley almost 6 years ago
While I don't think this is "related", I will point out that I've seen the "AttributeError: 'NoneType' object has no attribute 'top'" traceback before.
https://pulp.plan.io/issues/2990
As far as I can tell, it just happens when tasks are exited uncleanly sometimes, and probably isn't related to this specific issue.
To me, this looks like 2 bugs:
- The task clearly failed as per the logs, but it's state in Pulp is still "running" indefinitely, causing the tasks to back up
- Whatever caused the task to terminate abnormally in the first place, seemingly something in the modules portion of the sync
And regarding the second issue, it seems to only affect Pulp syncing from other instances of Pulp, so the issue might not be with sync itself, it could be that be that Pulp is publishing something incorrectly, and then failing on the subsequent Pulp-to-Pulp sync.
Updated by dalley almost 6 years ago
- Triaged changed from No to Yes
- Sprint Candidate changed from No to Yes
- Sprint set to Sprint 46
Updated by dalley almost 6 years ago
I've been attempting to reproduce this for a few hours and haven't been successful thus far. The main theory for problem #2 is that if a repo has no modules metadata, then Pulp might create an empty modules archive which could cause issues when being re-ingested. I confirmed that it's creating an empty archive for modules.yaml, but this didn't cause any problems when syncing onto the second box.
I did notice that re-publishing the same repository even if nothing has changed will create multiple modules archives, which is almost certainly incorrect behavior which I'll file a bug for. But likewise I wasn't able to make syncing such a repository fail.
@quantumfoam, do you happen to know if this is the case for any/all repositories you try to sync, or only this one in particular? Could you verify the version of the libmodulemd package you have installed?
Could you look at the "master" Pulp repo metadata for the repo which is being problematic, and attach any modules.yaml.gz metadata files?
edit: in order to find out if it's multiple repositories, you would need to unblock one of the "slave" boxes by cancelling the stuck task
Updated by dalley almost 6 years ago
Here's the duplicate modules metadata archives issue I mentioned https://pulp.plan.io/issues/4268
Updated by milan almost 6 years ago
@quantumfoam I witnessed the none type has no top error when I managed to induce a segfault in the libsolv library (that Pulp now uses since 2.17). May not be relevant but another thing to try: enable core dumps (thru systemd). I found some SO instructions ...
Updated by quantumfoam almost 6 years ago
It appears it is only happening with certain repositories (or at the very least, I am seeing some tasks succeed in the log). My theory is that we had repositories that were failing to sync properly before, but this was not causing any problems because the tasks would terminate, and then clear. The tasks are not clearing now, unless they are manually cancelled. The two "slave" Pulp servers are in fact fully self-contained servers that simply have the "master" set as their upstream for their repositories, with a nightly sync that pulls down updated repo content. In the event that someone uploads a new package and needs to promote it and re-publish the repository on each server, they run a manual sync on that repo. That is currently not working. Running a manual sync on seemingly any repo from one of the slave servers results in the sync job getting stuck on the Modules section.
Notably, the "master" server is functioning normally. It is pulling down content from internet repositories. It can accept new packages, promote them, and publish repositories. Therefore, it does seem that this is isolated to other Pulp servers syncing down from it.
Updated by dalley almost 6 years ago
@quantumfoam Interesting. I should have explicitly mentioned this but I was trying to reproduce with 2 Pulp boxes, one syncing from the other.
Do some repositories consistently work and others consistently fail?
Updated by quantumfoam almost 6 years ago
I see tasks flowing through that are completing, but it just occurred to me that most (all?) of what I was seeing were publish tasks. Let me go dig some more and see what I can get for you.
Updated by dalley almost 6 years ago
@quantumfoam, is this similar to the issue you're experiencing? https://pulp.plan.io/issues/4333
Updated by dalley almost 6 years ago
- Related to Issue #4333: repo syncs fail causing worker to crash with /lib64/libmodulemd.so.1: undefined symbol: g_log_structured_standard added
Updated by dalley almost 6 years ago
@quantumfoam, According to some @jsherrill, the solution is to install libmodulemd 1.7.0 from here. One other katello dev also experienced this problem and verified the fix
https://kojipkgs.fedoraproject.org/packages/libmodulemd/1.7.0/1.el7/x86_64/
sudo dnf install https://kojipkgs.fedoraproject.org/packages/libmodulemd/1.7.0/1.el7/x86_64/libmodulemd-1.7.0-1.el7.x86_64.rpm
Could you let me know if that also works for you?
We will fix this upstream in 2.18.1
Updated by dalley almost 6 years ago
- Category deleted (
15) - Status changed from ASSIGNED to CLOSED - DUPLICATE
- Assignee deleted (
dalley)
Closing this one because the new issue contains more actionable information
Updated by dalley almost 6 years ago
- Related to deleted (Issue #4333: repo syncs fail causing worker to crash with /lib64/libmodulemd.so.1: undefined symbol: g_log_structured_standard)
Updated by dalley almost 6 years ago
- Has duplicate Issue #4333: repo syncs fail causing worker to crash with /lib64/libmodulemd.so.1: undefined symbol: g_log_structured_standard added