Issue #2959
closedTask results can be lost due to race condition
Description
There seems to be a bug in pulp.server.async.tasks which allows some fields on TaskStatus to be lost in certain rare cases.
We've been seeing repo associate tasks sometimes complete successfully but with certain fields unexpectedly null (e.g. finish_time, result). After running mongo with detailed logging enabled we captured following query logs (annotated):
# conn83: httpd inserted new TaskStatus object
2017-08-06T04:52:16.456+0000 I WRITE [conn83] update pulp_database.task_status query: { task_id: "dcc17e72-a512-4845-9c78-89075fe52db4" } update: { $set: { exception: null, task_type: "pulp.server.managers.repo.unit_association.associate_from_repo", tags: [ "pulp:repository:rhel-6-for-power-optional-rpms__6Server__ppc64", "pulp:repository:test_default_42833840", "pulp:action:associate" ], finish_time: null, _ns: "task_status", traceback: null, spawned_tasks: [], progress_report: {}, worker_name: null, result: null, error: null, id: null }, $setOnInsert: { state: "waiting", start_time: null } } nscanned:0 nscannedObjects:0 nMatched:1 nModified:1 upsert:1 keyUpdates:0 writeConflicts:0 numYields:0 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, MMAPV1Journal: { acquireCount: { w: 2 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { W: 1 } } } 0ms
# conn80: resource manager assigns task to a worker
2017-08-06T04:52:16.537+0000 I WRITE [conn80] update pulp_database.reserved_resources query: { _id: "dcc17e72-a512-4845-9c78-89075fe52db4" } update: { _id: "dcc17e72-a512-4845-9c78-89075fe52db4", worker_name: "reserved_resource_worker-0@4ac31dbd3485", resource_id: "repository:rhel-6-for-power-optional-rpms__6Server__ppc64", _ns: "reserved_resources" } nscanned:0 nscannedObjects:0 nMatched:1 nModified:1 upsert:1 keyUpdates:0 writeConflicts:0 numYields:0 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, MMAPV1Journal: { acquireCount: { w: 2 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { W: 1 } } } 0ms
# conn82: worker sets state to running
2017-08-06T04:52:16.568+0000 I WRITE [conn82] update pulp_database.task_status query: { task_id: "dcc17e72-a512-4845-9c78-89075fe52db4" } update: { $set: { start_time: "2017-08-06T04:52:16Z", state: "running" } } nscanned:1 nscannedObjects:1 nMatched:1 nModified:1 keyUpdates:1 writeConflicts:0 numYields:0 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, MMAPV1Journal: { acquireCount: { w: 2 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { W: 1 } } } 0ms
# conn82: worker sets state to finished, with a result and finish_time
2017-08-06T04:52:16.582+0000 I WRITE [conn82] update pulp_database.task_status query: { _id: ObjectId('5986a080d4e44b661e89f944') } update: { $set: { finish_time: "2017-08-06T04:52:16Z", state: "finished", result: { units_successful: [ { unit_key: { id: "RHBA-2013:1738" }, type_id: "erratum" } ] } } } nscanned:1 nscannedObjects:1 nmoved:1 nMatched:1 nModified:1 keyUpdates:0 writeConflicts:0 numYields:0 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, MMAPV1Journal: { acquireCount: { w: 2 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { W: 1 } } } 0ms
# conn80: resource manager sets result, finish_time, other fields to null/empty
2017-08-06T04:52:16.584+0000 I WRITE [conn80] update pulp_database.task_status query: { task_id: "dcc17e72-a512-4845-9c78-89075fe52db4" } update: { $set: { exception: null, task_type: "pulp.server.managers.repo.unit_association.associate_from_repo", tags: [ "pulp:repository:rhel-6-for-power-optional-rpms__6Server__ppc64", "pulp:repository:test_default_42833840", "pulp:action:associate" ], finish_time: null, _ns: "task_status", traceback: null, spawned_tasks: [], progress_report: {}, worker_name: "reserved_resource_worker-0@4ac31dbd3485", result: null, error: null, id: null }, $setOnInsert: { state: "waiting", start_time: null } } nscanned:1 nscannedObjects:1 nMatched:1 nModified:1 keyUpdates:0 writeConflicts:0 numYields:0 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, MMAPV1Journal: { acquireCount: { w: 2 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { W: 1 } } } 0ms
After the above sequence of queries, the task was left finished but with null finish_time and result fields.
It looks like this happens when a worker picks up and completes a task very quickly.
There seems to be a race here (annotated with ####):
class Task(PulpTask, ReservedTaskMixin):
...
def apply_async(self, *args, **kwargs):
....
#### This is running in resource_manager.
#### This enqueues the task for a worker in qpid:
try:
async_result = super(Task, self).apply_async(*args, **kwargs)
except:
if 'task_id' in kwargs:
TaskStatus.objects(task_id=kwargs['task_id']).update(
state=constants.CALL_ERROR_STATE
)
raise
#### The assigned worker is now eligible to pick up the task at any time,
#### therefore is now racing with resource_manager.
#### The task might be completed already by now!
...
# Create a new task status with the task id and tags.
task_status = TaskStatus(
task_id=async_result.id, task_type=self.name,
state=constants.CALL_WAITING_STATE, worker_name=routing_key, tags=tag_list,
group_id=group_id)
# To avoid the race condition where __call__ method below is called before
# this change is propagated to all db nodes, using an 'upsert' here and setting
# the task state to 'waiting' only on an insert.
task_status.save_with_set_on_insert(fields_to_set_on_insert=['state', 'start_time'])
#### BUG! If worker already completed its task before the above call to save_with_set_on_insert,
#### then finish_time and result have been overwritten with null.
It's interesting that there's already a comment about avoiding a race condition where call is invoked before the DB update is done, but this scenario - where call and on_success both complete before the DB update is done - seems to be missed.
This was found in Pulp 2.7, but by code inspection, 2.14-dev as at e2c7152a41dfae1 still seems to have the problem.
Fix task results lost due to race condition
A task might have completed or failed by the time apply_async attempts to upsert a TaskStatus record. We have to avoid touching the fields set on success/failure (such as finish_time, result) during update, otherwise we'll overwrite them with empty values.
closes #2959 https://pulp.plan.io/issues/2959