Project

Profile

Help

Issue #2959

closed

Task results can be lost due to race condition

Added by rmcgover over 6 years ago. Updated almost 5 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
2.14.1
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Quarter:

Description

There seems to be a bug in pulp.server.async.tasks which allows some fields on TaskStatus to be lost in certain rare cases.

We've been seeing repo associate tasks sometimes complete successfully but with certain fields unexpectedly null (e.g. finish_time, result). After running mongo with detailed logging enabled we captured following query logs (annotated):


# conn83: httpd inserted new TaskStatus object

2017-08-06T04:52:16.456+0000 I WRITE    [conn83] update pulp_database.task_status query: { task_id: "dcc17e72-a512-4845-9c78-89075fe52db4" } update: { $set: { exception: null, task_type: "pulp.server.managers.repo.unit_association.associate_from_repo", tags: [ "pulp:repository:rhel-6-for-power-optional-rpms__6Server__ppc64", "pulp:repository:test_default_42833840", "pulp:action:associate" ], finish_time: null, _ns: "task_status", traceback: null, spawned_tasks: [], progress_report: {}, worker_name: null, result: null, error: null, id: null }, $setOnInsert: { state: "waiting", start_time: null } } nscanned:0 nscannedObjects:0 nMatched:1 nModified:1 upsert:1 keyUpdates:0 writeConflicts:0 numYields:0 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, MMAPV1Journal: { acquireCount: { w: 2 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { W: 1 } } } 0ms

# conn80: resource manager assigns task to a worker

2017-08-06T04:52:16.537+0000 I WRITE    [conn80] update pulp_database.reserved_resources query: { _id: "dcc17e72-a512-4845-9c78-89075fe52db4" } update: { _id: "dcc17e72-a512-4845-9c78-89075fe52db4", worker_name: "reserved_resource_worker-0@4ac31dbd3485", resource_id: "repository:rhel-6-for-power-optional-rpms__6Server__ppc64", _ns: "reserved_resources" } nscanned:0 nscannedObjects:0 nMatched:1 nModified:1 upsert:1 keyUpdates:0 writeConflicts:0 numYields:0 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, MMAPV1Journal: { acquireCount: { w: 2 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { W: 1 } } } 0ms

# conn82: worker sets state to running

2017-08-06T04:52:16.568+0000 I WRITE    [conn82] update pulp_database.task_status query: { task_id: "dcc17e72-a512-4845-9c78-89075fe52db4" } update: { $set: { start_time: "2017-08-06T04:52:16Z", state: "running" } } nscanned:1 nscannedObjects:1 nMatched:1 nModified:1 keyUpdates:1 writeConflicts:0 numYields:0 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, MMAPV1Journal: { acquireCount: { w: 2 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { W: 1 } } } 0ms

# conn82: worker sets state to finished, with a result and finish_time

2017-08-06T04:52:16.582+0000 I WRITE    [conn82] update pulp_database.task_status query: { _id: ObjectId('5986a080d4e44b661e89f944') } update: { $set: { finish_time: "2017-08-06T04:52:16Z", state: "finished", result: { units_successful: [ { unit_key: { id: "RHBA-2013:1738" }, type_id: "erratum" } ] } } } nscanned:1 nscannedObjects:1 nmoved:1 nMatched:1 nModified:1 keyUpdates:0 writeConflicts:0 numYields:0 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, MMAPV1Journal: { acquireCount: { w: 2 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { W: 1 } } } 0ms

# conn80: resource manager sets result, finish_time, other fields to null/empty 

2017-08-06T04:52:16.584+0000 I WRITE    [conn80] update pulp_database.task_status query: { task_id: "dcc17e72-a512-4845-9c78-89075fe52db4" } update: { $set: { exception: null, task_type: "pulp.server.managers.repo.unit_association.associate_from_repo", tags: [ "pulp:repository:rhel-6-for-power-optional-rpms__6Server__ppc64", "pulp:repository:test_default_42833840", "pulp:action:associate" ], finish_time: null, _ns: "task_status", traceback: null, spawned_tasks: [], progress_report: {}, worker_name: "reserved_resource_worker-0@4ac31dbd3485", result: null, error: null, id: null }, $setOnInsert: { state: "waiting", start_time: null } } nscanned:1 nscannedObjects:1 nMatched:1 nModified:1 keyUpdates:0 writeConflicts:0 numYields:0 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, MMAPV1Journal: { acquireCount: { w: 2 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { W: 1 } } } 0ms

After the above sequence of queries, the task was left finished but with null finish_time and result fields.

It looks like this happens when a worker picks up and completes a task very quickly.

There seems to be a race here (annotated with ####):

class Task(PulpTask, ReservedTaskMixin):
    ...
    def apply_async(self, *args, **kwargs):
        ....

        #### This is running in resource_manager.
        #### This enqueues the task for a worker in qpid:
        try:
            async_result = super(Task, self).apply_async(*args, **kwargs)
        except:
            if 'task_id' in kwargs:
                TaskStatus.objects(task_id=kwargs['task_id']).update(
                    state=constants.CALL_ERROR_STATE
                )
            raise

        #### The assigned worker is now eligible to pick up the task at any time,
        #### therefore is now racing with resource_manager.
        #### The task might be completed already by now!

        ...

        # Create a new task status with the task id and tags.
        task_status = TaskStatus(
            task_id=async_result.id, task_type=self.name,
            state=constants.CALL_WAITING_STATE, worker_name=routing_key, tags=tag_list,
            group_id=group_id)
        # To avoid the race condition where __call__ method below is called before
        # this change is propagated to all db nodes, using an 'upsert' here and setting
        # the task state to 'waiting' only on an insert.
        task_status.save_with_set_on_insert(fields_to_set_on_insert=['state', 'start_time'])

        #### BUG!  If worker already completed its task before the above call to save_with_set_on_insert,
        #### then finish_time and result have been overwritten with null.

It's interesting that there's already a comment about avoiding a race condition where call is invoked before the DB update is done, but this scenario - where call and on_success both complete before the DB update is done - seems to be missed.

This was found in Pulp 2.7, but by code inspection, 2.14-dev as at e2c7152a41dfae1 still seems to have the problem.

Also available in: Atom PDF