Story #7824

Updated by pulpbot over 1 year ago


 **Ticket moved to GitHub**: "pulp/pulpcore/1946": 


 It is currently disallowed to bulk_create() multi-table inherited Django models, due to the technical challenges with doing so, however it is not necessarily impossible in theory. 

 As a Pulp developer, if we are able to make assumptions about the way it will be used, then we can avoid most of the problems that prevent a generic implementation.    For instance, only one level of inheritance. 

 I developed a proof of concept strategy that is unfortunately made more difficult by Django's proxy model behavior and the fact that there's no class which represents *just* the subclass table.    So, this code emulates how multi-table inherited models set up their internal relationships, but doesn't actually use model inheritance. 


 * Transactions don't check the integrity of foreign keys until they are committed 
 * So save our child table first with a random uuid as the content_ptr, in bulk 
 * Then go back and save the parent model, in bulk 

 ~~~ python 

 class PulpBase(models.Model): 
     pulp_id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False) 
     pulp_created = models.DateTimeField(auto_now_add=True) 
     pulp_type = models.TextField(null=False, default=None) 

     class Meta: 
         abstract = True 

 class NewContent(PulpBase): 
     pulp_type = models.TextField(null=False, default=None) 

 class NewPulpManager(models.Manager): 
     # Ignore the hacky workarounds for stuff like pulp_type, I was trying to keep things simple 
     def bulk_get_or_create(self, objs, batch_size=None): 
         with transaction.atomic(): 
             pulp_type = self.model.get_pulp_type() 

             q = models.Q(pk__in=[]) 
             unsaved_idxs_by_nat_key = defaultdict(list) 
             for idx, obj in enumerate(objs): 
                 content_already_saved = not obj._state.adding 
                 if not content_already_saved: 
                     q |= models.Q(**obj.natural_key_dict()) 

             existing_objs = self.model.objects.filter(q) 
             for existing_obj in existing_objs.iterator(chunk_size=batch_size or 2000): 
                 for idx in unsaved_idxs_by_nat_key[existing_objs.natural_key()]: 
                     objs[idx] = existing_obj 

             new_base_content = [] 
             for obj in objs: 
                 content_already_saved = not obj._state.adding 
                 if not content_already_saved: 

             self.bulk_create(objs, batch_size=batch_size) 
             NewContent.objects.bulk_create(new_base_content, batch_size=batch_size) 

         return objs 

 class ContentBase(models.Model): 
     content_ptr = models.OneToOneField(NewContent, primary_key=True, default=uuid.uuid4, on_delete=models.CASCADE) 

     objects = NewPulpManager() 

     def get_pulp_type(cls): 
         return cls.TYPE 

     def natural_key_fields(cls): 
         Returns a tuple of the natural key fields which usually equates to unique_together fields 
         return tuple(chain.from_iterable(cls._meta.unique_together)) 

     def natural_key(self): 
         Get the model's natural key based on natural_key_fields. 

             tuple: The natural key. 
         return tuple(getattr(self, f) for f in self.natural_key_fields()) 

     def natural_key_dict(self): 
         Get the model's natural key as a dictionary of keys and values. 
         to_return = {} 
         for key in self.natural_key_fields(): 
             to_return[key] = getattr(self, key) 
         return to_return 

     class Meta: 

 class NewFileContent(ContentBase): 
     TYPE = "file.file" 

     relative_path = models.TextField(null=False) 
     digest = models.CharField(max_length=64, null=False) 

     class Meta: 
         unique_together = ("relative_path", "digest") 


 The speedup is about 3x vs what we currently do. 


 In [5]: %time PulpFileContent.objects.bulk_get_or_create(old_content) 
 CPU times: user 555 ms, sys: 88.6 ms, total: 643 ms 
 Wall time: 1.77 s 



 In [7]: %time NewFileContent.objects.bulk_get_or_create(new_content) 
 CPU times: user 648 ms, sys: 1.39 ms, total: 649 ms 
 Wall time: 694 ms