Project

Profile

Help

Issue #2959

closed

Task results can be lost due to race condition

Added by rmcgover over 6 years ago. Updated almost 5 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
2.14.1
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Quarter:

Description

There seems to be a bug in pulp.server.async.tasks which allows some fields on TaskStatus to be lost in certain rare cases.

We've been seeing repo associate tasks sometimes complete successfully but with certain fields unexpectedly null (e.g. finish_time, result). After running mongo with detailed logging enabled we captured following query logs (annotated):


# conn83: httpd inserted new TaskStatus object

2017-08-06T04:52:16.456+0000 I WRITE    [conn83] update pulp_database.task_status query: { task_id: "dcc17e72-a512-4845-9c78-89075fe52db4" } update: { $set: { exception: null, task_type: "pulp.server.managers.repo.unit_association.associate_from_repo", tags: [ "pulp:repository:rhel-6-for-power-optional-rpms__6Server__ppc64", "pulp:repository:test_default_42833840", "pulp:action:associate" ], finish_time: null, _ns: "task_status", traceback: null, spawned_tasks: [], progress_report: {}, worker_name: null, result: null, error: null, id: null }, $setOnInsert: { state: "waiting", start_time: null } } nscanned:0 nscannedObjects:0 nMatched:1 nModified:1 upsert:1 keyUpdates:0 writeConflicts:0 numYields:0 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, MMAPV1Journal: { acquireCount: { w: 2 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { W: 1 } } } 0ms

# conn80: resource manager assigns task to a worker

2017-08-06T04:52:16.537+0000 I WRITE    [conn80] update pulp_database.reserved_resources query: { _id: "dcc17e72-a512-4845-9c78-89075fe52db4" } update: { _id: "dcc17e72-a512-4845-9c78-89075fe52db4", worker_name: "reserved_resource_worker-0@4ac31dbd3485", resource_id: "repository:rhel-6-for-power-optional-rpms__6Server__ppc64", _ns: "reserved_resources" } nscanned:0 nscannedObjects:0 nMatched:1 nModified:1 upsert:1 keyUpdates:0 writeConflicts:0 numYields:0 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, MMAPV1Journal: { acquireCount: { w: 2 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { W: 1 } } } 0ms

# conn82: worker sets state to running

2017-08-06T04:52:16.568+0000 I WRITE    [conn82] update pulp_database.task_status query: { task_id: "dcc17e72-a512-4845-9c78-89075fe52db4" } update: { $set: { start_time: "2017-08-06T04:52:16Z", state: "running" } } nscanned:1 nscannedObjects:1 nMatched:1 nModified:1 keyUpdates:1 writeConflicts:0 numYields:0 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, MMAPV1Journal: { acquireCount: { w: 2 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { W: 1 } } } 0ms

# conn82: worker sets state to finished, with a result and finish_time

2017-08-06T04:52:16.582+0000 I WRITE    [conn82] update pulp_database.task_status query: { _id: ObjectId('5986a080d4e44b661e89f944') } update: { $set: { finish_time: "2017-08-06T04:52:16Z", state: "finished", result: { units_successful: [ { unit_key: { id: "RHBA-2013:1738" }, type_id: "erratum" } ] } } } nscanned:1 nscannedObjects:1 nmoved:1 nMatched:1 nModified:1 keyUpdates:0 writeConflicts:0 numYields:0 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, MMAPV1Journal: { acquireCount: { w: 2 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { W: 1 } } } 0ms

# conn80: resource manager sets result, finish_time, other fields to null/empty 

2017-08-06T04:52:16.584+0000 I WRITE    [conn80] update pulp_database.task_status query: { task_id: "dcc17e72-a512-4845-9c78-89075fe52db4" } update: { $set: { exception: null, task_type: "pulp.server.managers.repo.unit_association.associate_from_repo", tags: [ "pulp:repository:rhel-6-for-power-optional-rpms__6Server__ppc64", "pulp:repository:test_default_42833840", "pulp:action:associate" ], finish_time: null, _ns: "task_status", traceback: null, spawned_tasks: [], progress_report: {}, worker_name: "reserved_resource_worker-0@4ac31dbd3485", result: null, error: null, id: null }, $setOnInsert: { state: "waiting", start_time: null } } nscanned:1 nscannedObjects:1 nMatched:1 nModified:1 keyUpdates:0 writeConflicts:0 numYields:0 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, MMAPV1Journal: { acquireCount: { w: 2 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { W: 1 } } } 0ms

After the above sequence of queries, the task was left finished but with null finish_time and result fields.

It looks like this happens when a worker picks up and completes a task very quickly.

There seems to be a race here (annotated with ####):

class Task(PulpTask, ReservedTaskMixin):
    ...
    def apply_async(self, *args, **kwargs):
        ....

        #### This is running in resource_manager.
        #### This enqueues the task for a worker in qpid:
        try:
            async_result = super(Task, self).apply_async(*args, **kwargs)
        except:
            if 'task_id' in kwargs:
                TaskStatus.objects(task_id=kwargs['task_id']).update(
                    state=constants.CALL_ERROR_STATE
                )
            raise

        #### The assigned worker is now eligible to pick up the task at any time,
        #### therefore is now racing with resource_manager.
        #### The task might be completed already by now!

        ...

        # Create a new task status with the task id and tags.
        task_status = TaskStatus(
            task_id=async_result.id, task_type=self.name,
            state=constants.CALL_WAITING_STATE, worker_name=routing_key, tags=tag_list,
            group_id=group_id)
        # To avoid the race condition where __call__ method below is called before
        # this change is propagated to all db nodes, using an 'upsert' here and setting
        # the task state to 'waiting' only on an insert.
        task_status.save_with_set_on_insert(fields_to_set_on_insert=['state', 'start_time'])

        #### BUG!  If worker already completed its task before the above call to save_with_set_on_insert,
        #### then finish_time and result have been overwritten with null.

It's interesting that there's already a comment about avoiding a race condition where call is invoked before the DB update is done, but this scenario - where call and on_success both complete before the DB update is done - seems to be missed.

This was found in Pulp 2.7, but by code inspection, 2.14-dev as at e2c7152a41dfae1 still seems to have the problem.

Actions #1

Updated by ttereshc over 6 years ago

  • Triaged changed from No to Yes
Actions #3

Updated by bmbouter over 6 years ago

The reason this code exists is because in some cases (when apply_async) is called directly it is supposed to create the TaskStatus record. When apply_async is called indirectly as a result of an apply_async_with_reservation() call, it should not be adding data to the TaskStatus record at all.

To fix this, I recommend removing save_with_set_on_insert() altogether. With that removed, the save will be a normal save and not an upsert which causes the save() to overwrite data unexpectedly. In cases where apply_async() is supposed to create the record this will work correctly. In cases where apply_async() is not supposed to write that record, the save() will emit a not unique exception due to the task.id already existing and that field being required as unique. That exception can be caught and ignored. This is very similar to the pulp3 implementation here.

Added by rmcgover over 6 years ago

Revision 7a5a5295 | View on GitHub

Fix task results lost due to race condition

A task might have completed or failed by the time apply_async attempts to upsert a TaskStatus record. We have to avoid touching the fields set on success/failure (such as finish_time, result) during update, otherwise we'll overwrite them with empty values.

closes #2959 https://pulp.plan.io/issues/2959

Actions #5

Updated by ipanova@redhat.com over 6 years ago

  • Status changed from NEW to POST
  • Assignee set to rmcgover
Actions #6

Updated by rmcgover over 6 years ago

  • Status changed from POST to MODIFIED
Actions #7

Updated by pcreech over 6 years ago

  • Platform Release set to 2.14.1
Actions #8

Updated by pcreech over 6 years ago

  • Status changed from MODIFIED to CLOSED - CURRENTRELEASE
Actions #9

Updated by bmbouter almost 5 years ago

  • Tags Pulp 2 added

Also available in: Atom PDF