Issue #2584
closedMongo cursor times out during task pulp.server.managers.content.orphan.delete_all_orphans
Description
On a very busy system, a user saw the cursor timeout:
Feb 2 04:13:52 satellite1 pulp: py.warnings:WARNING: (1955-62752) /usr/lib/python2.7/site-packages/pulp/server/managers/content/orphan.py:228: DeprecationWarning: remove is deprecated. Use delete_one or delete_many instead.
Feb 2 04:13:52 satellite1 pulp: py.warnings:WARNING: (1955-62752) content_units_collection.remove(content_unit['_id'])
Feb 2 04:13:52 satellite1 pulp: py.warnings:WARNING: (1955-62752)
Feb 2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752) Task pulp.server.managers.content.orphan.delete_all_orphans[ece93a94-907b-4c1a-b4a4-474a99893be2] raised unexpected: CursorNotFound('Cursor not found, cursor id: 163452381930',)
Feb 2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752) Traceback (most recent call last):
Feb 2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752) File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
Feb 2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752) R = retval = fun(*args, **kwargs)
Feb 2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752) File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 473, in __call__
Feb 2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752) return super(Task, self).__call__(*args, **kwargs)
Feb 2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752) File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 103, in __call__
Feb 2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752) return super(PulpTask, self).__call__(*args, **kwargs)
Feb 2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752) File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 437, in __protected_call__
Feb 2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752) return self.run(*args, **kwargs)
Feb 2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752) File "/usr/lib/python2.7/site-packages/pulp/server/managers/content/orphan.py", line 170, in delete_all_orphans
Feb 2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752) OrphanManager.delete_orphans_by_type(content_type_id)
Feb 2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752) File "/usr/lib/python2.7/site-packages/pulp/server/managers/content/orphan.py", line 219, in delete_orphans_by_type
Feb 2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752) fields=['_id', '_storage_path']):
Feb 2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752) File "/usr/lib/python2.7/site-packages/pulp/server/managers/content/orphan.py", line 106, in generate_orphans_by_type
Feb 2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752) for content_unit in content_units_collection.find({}, projection=fields):
Feb 2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752) File "/usr/lib64/python2.7/site-packages/pymongo/cursor.py", line 1097, in next
Feb 2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752) if len(self.__data) or self._refresh():
Feb 2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752) File "/usr/lib64/python2.7/site-packages/pymongo/cursor.py", line 1039, in _refresh
Feb 2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752) self.__max_await_time_ms))
Feb 2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752) File "/usr/lib64/python2.7/site-packages/pymongo/cursor.py", line 903, in __send_message
Feb 2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752) codec_options=self.__codec_options)
Feb 2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752) File "/usr/lib64/python2.7/site-packages/pymongo/helpers.py", line 114, in _unpack_response
Feb 2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752) raise CursorNotFound(msg, 43, errobj)
Feb 2 04:28:43 satellite1 pulp: celery.worker.job:ERROR: (1358-62752) CursorNotFound: Cursor not found, cursor id: 163452381930
Related issues
Updated by mhrivnak about 7 years ago
While investigating options, I found that setting a batch size of 20 caused about a 10% slowdown of the task when iterating over about 50,000 RPMs, and not finding any orphans. That performance penalty might be worth it for the safety.
Otherwise, we could run with no batch size by default, catch the exception, and re-start the iteration with a batch size in case the cursor is lost. That would be more complex, but still fairly straight-forward. It would also mean the task taking a lot more time when encountering the error, but less time in the usual case when there is no error.
Updated by bizhang about 7 years ago
- Sprint/Milestone set to 33
- Triaged changed from No to Yes
Updated by mhrivnak about 7 years ago
Using a batch size of 100, I see virtually no measurable performance impact vs. not setting a batch size. Assuming the default 10 minute cursor timeout, that allows 6 seconds per loop iteration, which should be plenty. That would be an easy way to resolve this issue.
Updated by ttereshc about 7 years ago
+1 to batch size of 100. I had the similar experience in terms of performance while iterating over unit collections cursor with such batch size.
Updated by ttereshc about 7 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to ttereshc
Added by ttereshc about 7 years ago
Added by ttereshc about 7 years ago
Revision d85f6647 | View on GitHub
Set batch_size for the cursor in orphan removal tasks
On a busy system a cursor for iterating over all units in a collection could be timed out during orphan removal. The batch size is set to 100 to make sure cursor is active, it also should not introduce any noticable performance impact.
Updated by ttereshc about 7 years ago
- Status changed from ASSIGNED to POST
Updated by ttereshc about 7 years ago
- Status changed from POST to MODIFIED
Applied in changeset pulp|d85f66472be63ed1a25bcb6166bc384246013c41.
Updated by bizhang about 7 years ago
- Status changed from 5 to CLOSED - CURRENTRELEASE
Updated by daviddavis almost 6 years ago
- Related to Issue #3854: Mongo cursor times out during task pulp.server.managers.content.orphan.delete_all_orphans added
Set batch_size for the cursor in orphan removal tasks
On a busy system a cursor for iterating over all units in a collection could be timed out during orphan removal. The batch size is set to 100 to make sure cursor is active, it also should not introduce any noticable performance impact.
closes #2584 https://pulp.plan.io/issues/2584