Project

Profile

Help

Story #898

closed

As a user I can run multiple pulp_resource_managers concurrently with all of them actively participating

Added by bmbouter almost 9 years ago. Updated almost 5 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
% Done:

100%

Estimated time:
Platform Release:
2.10.0
Groomed:
Yes
Sprint Candidate:
Yes
Tags:
Pulp 2
Sprint:
Sprint 5
Quarter:

Description

This story has a some database schema changes, a small coding part, and a good deal of testing. The FCFS ordering should be tested first before any of the coding changes happen.

Current ReservedResource Design

The ReservedResource collection needs to have its database schema adjusted so that we can have an opportunity to interact with it in a race-condition free way. Currently the mongoengine model defines:

<task_id, worker_name, resource_id>

The only restriction is that task_id needs to be unique. Generally one record is created for each task id.

Schema Changes

Having multiple records entered prevents us from adding the necessary indexes which would make writing atomic code for this collection more feasible. Consider this schema:

class ReservedResource(Document):
    resource_id = StringField(primary_key=True, unique=True) # ensures correctness that each resource_type can only have one reservation across all workers
    worker_name = StringField(required=True) # ensures that each worker can only receive one reservation at a time
    task_ids = ListField() # notice, this is now a list

    # For backward compatibility
    _ns = StringField(default='reserved_resources')

    meta = {'collection': 'reserved_resources', allow_inheritance': False}

The existing indexes would all be removed.

Coding Deliverables

while:
    # assume we need to update an existing reservation
    find an existing worker by querying for an existing ReservedResource by the resource_id
    if there is a worker found:
        attempt to add the task_id that needs reservation for any found existing worker, but do not allow upsert
        assert that the number of documents updated is either 0 or 1. If not it should raise a loud exception
        if 1:
            query for the worker name that goes with the reservation just updated
    else if no worker found or if no worker was updated:
        # find an available worker, relying on uniqueness constraints to ensure atomic operation 
        query for the list of workers
        for each worker in the list:
            try:
                create a new ReservedResource(worker_name, resource_id, taskids=[task_id]).save()
            except ConstraintViolation:
                continue 
    if a ReservationResource was updated:
        break
    else:
        sleep(0.25)

Testing

This needs to be tested to ensure two main this:

1. That a highly concurrent environment calling reserve resource and release resource maintains correctness. This should be done with a simple concurrency driver which would be left to run in a bunch of processes in a test environment for a few days.

2. That the FCFS ordering is strictly preserved even when failures occur.


Related issues

Blocks Pulp - Issue #1114: If an instance of pulp_resource_manager dies unexpectedly, Pulp incorrectly tries to "cancel all tasks in its queue"CLOSED - WONTFIXActions
Actions #1

Updated by bmbouter almost 9 years ago

  • Subject changed from [WIP] As a user I can run multiple pulp_resource_managers concurrently with all of them actively participating to As a user I can run multiple pulp_resource_managers concurrently with all of them actively participating
  • Description updated (diff)
Actions #2

Updated by bmbouter almost 9 years ago

  • Tags Sprint Candidate added
Actions #3

Updated by bmbouter almost 9 years ago

  • Sprint Candidate set to Yes
  • Tags deleted (Sprint Candidate)
Actions #4

Updated by bmbouter almost 9 years ago

  • Groomed set to No
Actions #5

Updated by mhrivnak over 8 years ago

  • Groomed changed from No to Yes
Actions #6

Updated by sbhawsin over 8 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to sbhawsin
  • Platform Release set to master
Actions #7

Updated by bmbouter over 8 years ago

  • Assignee changed from sbhawsin to bmbouter
Actions #8

Updated by bmbouter over 8 years ago

  • Blocks Issue #1114: If an instance of pulp_resource_manager dies unexpectedly, Pulp incorrectly tries to "cancel all tasks in its queue" added
Actions #9

Updated by mhrivnak about 8 years ago

  • Status changed from ASSIGNED to NEW
  • Assignee deleted (bmbouter)
  • Platform Release deleted (master)
  • Tags Easy Fix added
Actions #10

Updated by bmbouter about 8 years ago

  • Description updated (diff)
Actions #11

Updated by sbhawsin about 8 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to sbhawsin
  • Tags deleted (Easy Fix)
Actions #12

Updated by bmbouter almost 8 years ago

  • Status changed from ASSIGNED to NEW
  • Assignee deleted (sbhawsin)
Actions #13

Updated by mhrivnak almost 8 years ago

  • Sprint Candidate changed from Yes to No
Actions #14

Updated by bmbouter almost 8 years ago

  • Sprint Candidate changed from No to Yes
Actions #15

Updated by mhrivnak almost 8 years ago

  • Sprint/Milestone set to 21
Actions #16

Updated by mhrivnak almost 8 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to dkliban@redhat.com
Actions #17

Updated by dalley over 7 years ago

Implementation Plan:

====================

The resource manager entry point - server.async.app - will be modified such that it checks whether an instance of the pulp_resource_manager already exists. If one does not exist, it will start normally. If one does exist, it will enter a loop in which it periodically checks whether the pre-existing resource mananger has died, using the same CELERY_CHECK_INTERVAL constant as the Celery process monitor. Should this happen, it will exit the loop and take over for the dead resource monitor.

Actions #18

Updated by dkliban@redhat.com over 7 years ago

The workers collections needs to have a uniqueness constraint added so that 'pulp_resource_manager' can only have one entry in it. This may require adding a new field to the document. The entry point for 'pulp_resource_manager' will then attempt to write to the 'workers' collection. When a write is successful, the celery process will continue loading. However, if an exception about the uniqueness constraint is raised, the pulp_resource_manager will sleep and try again in 30 seconds. This retry logic needs to be added to the correct celery signal handler. Celery emits multiple signals throughout it's life cycle. It is necessary to have this handled as early as possible.

Actions #19

Updated by dalley over 7 years ago

  • Assignee changed from dkliban@redhat.com to dalley
Actions #20

Updated by bmbouter over 7 years ago

Having 1 entry for each resource_manager would be good because it's those entries which drive the /status/ API.

Would the code added to the entry point (server.async.app) run at import time? It would be good if no code is run at import time although this design may require it. We should think about where in server.async.app the if-statement-halting could be put to avoid having it run at import time.

Actions #21

Updated by dkliban@redhat.com over 7 years ago

A new model called ResourceManagerLock needs to be added. This model should be very similar to the CelerybeatLock model we already have.

As long as celery doesn't mind that a signal handler is taking a long time to return we should be able to add the following logic to the celeryd_after_setup signal handler:

While ResourceManagerLock has not been acquired, try to acquire it. If successful, return, and resource_manager is running like it's intended. If lock cannot be acquired, update the Worker collection in Mongo for this resource_manager and then sleep for 60 seconds. Try to acquire the lock again in 60 seconds and continue indefinitely to do so until the lock is acquired and signal handler finally returns.

Actions #22

Updated by mhrivnak over 7 years ago

  • Sprint/Milestone changed from 21 to 22
Actions #23

Updated by mhrivnak over 7 years ago

  • Sprint/Milestone changed from 22 to 23
Actions #24

Updated by dalley over 7 years ago

  • Status changed from ASSIGNED to POST
Actions #25

Updated by dalley over 7 years ago

Actions #26

Updated by dalley over 7 years ago

Added by dalley over 7 years ago

Revision 0128ae2c | View on GitHub

Added mechanism to prevent two resource managers from being active at once - duplicate resource managers will wait until the original dies off. closes #898

Added by dalley over 7 years ago

Revision 0128ae2c | View on GitHub

Added mechanism to prevent two resource managers from being active at once - duplicate resource managers will wait until the original dies off. closes #898

Actions #27

Updated by dalley over 7 years ago

  • Status changed from POST to MODIFIED
  • % Done changed from 0 to 100
Actions #28

Updated by mhrivnak over 7 years ago

  • Platform Release set to 2.10.0
Actions #29

Updated by dalley over 7 years ago

Actions #30

Updated by dalley over 7 years ago

Actions #31

Updated by semyers over 7 years ago

  • Status changed from MODIFIED to 5
Actions #32

Updated by semyers over 7 years ago

  • Status changed from 5 to CLOSED - CURRENTRELEASE
Actions #33

Updated by bmbouter almost 6 years ago

  • Sprint set to Sprint 5
Actions #34

Updated by bmbouter almost 6 years ago

  • Sprint/Milestone deleted (23)
Actions #35

Updated by bmbouter almost 5 years ago

  • Tags Pulp 2 added

Also available in: Atom PDF