Project

Profile

Help

Story #898

Updated by bmbouter over 9 years ago

This story has a some database schema changes, a small coding part, and a good deal of testing. The FCFS ordering should be tested first before any of the coding changes happen. 

 h2. Current ReservedResource Design 

 The ReservedResource collection needs to have its database schema adjusted so that we can have an opportunity to interact with it in a race-condition free way. Currently the mongoengine model defines: <pre><task_id, worker_name, resource_id></pre> 

 The only restriction is that task_id needs to be unique. Generally one record is created for each task id. 


 h2. Schema Changes 

 Having multiple records entered prevents us from adding the necessary indexes which would make writing atomic code for this collection more feasible. Consider this schema: 

 <pre> 
 class ReservedResource(Document): 
     resource_id = StringField(primary_key=True, unique=True) # ensures correctness that each resource_type can only have one reservation across all workers 
     worker_name = StringField(required=True) # ensures that each worker can only receive one reservation at a time 
     task_ids = ListField() # notice, this is now a list 

     # For backward compatibility 
     _ns = StringField(default='reserved_resources') 

     meta = {'collection': 'reserved_resources', allow_inheritance': False} 
 </pre> 

 The existing indexes would all be removed. 


 h2. Coding Deliverables 

 * Rewrite the "reservation creation code":https://github.com/pulp/pulp/blob/fdb470b2fd7b6b14e594418a343a3b783580c598/server/pulp/server/async/tasks.py#L58-L76 so that it has roughly the following logic: 

 <pre> 
 worker_name = None 
 while: 
     # assume we need to update an existing reservation 
     find an existing worker by querying for an existing ReservedResource by the resource_id 
     if there is a worker found: 
         attempt to 
       add the task_id that needs reservation for any found to an existing worker, record, but do not allow upsert 
         assert that 
       check the number of documents updated is updated. It needs to be either 0 or 1. If not it should raise a loud exception 
         
       if 1: 
             
           query for the worker name that goes with the reservation just updated 
     
       else if no worker found or if no worker was updated: 
         # find an available worker, relying on uniqueness constraints to ensure atomic operation  
         0: 
           query for the list of workers 
         
           for each worker in the list: 
             
               try: 
                 
                   create a new ReservedResource(worker_name, resource_id, taskids=[task_id]).save() 
             
               except ConstraintViolation: 
                 
                   continue  
     
               else: 
                   query for the worker name that goes with the reservation 
       else: 
         raise a loud exception because it should only be 0 or 1 
       if a ReservationResource was updated: 
         worker_name is not None: 
           break 
     else: 
         sleep(0.25) 
 </pre> 

 * Verify that the pymongo updates produced by the mongoengine code uses all atomic operations with pymongo 

 * Delete the "NoWorkers exception":https://github.com/pulp/pulp/blob/fdb470b2fd7b6b14e594418a343a3b783580c598/server/pulp/server/exceptions.py#L222 from the codebase. There is no way it can be raised to the user since 2.5. 

 * Add a release note about this great improvement 

 * Update the docs all over to say that it is safe to run multiple of the pulp_resource_manager processes 

 * Update the release resource to an atomic operation 


 h2. Testing 

 This needs to be tested to ensure two main this: 

 1. That a highly concurrent environment calling reserve resource and release resource maintains correctness. This should be done with a simple concurrency driver which would be drive and left to run in a bunch of processes in a test environment computer for a few days. 

 2. That the FCFS ordering is strictly preserved even when failures occur. 

Back