Story #7824
Updated by pulpbot over 1 year ago
**Ticket moved to GitHub**: "pulp/pulpcore/1946":https://github.com/pulp/pulpcore/issues/1946 ---- It is currently disallowed to bulk_create() multi-table inherited Django models, due to the technical challenges with doing so, however it is not necessarily impossible in theory. https://code.djangoproject.com/ticket/28821 As a Pulp developer, if we are able to make assumptions about the way it will be used, then we can avoid most of the problems that prevent a generic implementation. For instance, only one level of inheritance. I developed a proof of concept strategy that is unfortunately made more difficult by Django's proxy model behavior and the fact that there's no class which represents *just* the subclass table. So, this code emulates how multi-table inherited models set up their internal relationships, but doesn't actually use model inheritance. Strategy: * Transactions don't check the integrity of foreign keys until they are committed * So save our child table first with a random uuid as the content_ptr, in bulk * Then go back and save the parent model, in bulk ~~~ python class PulpBase(models.Model): pulp_id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False) pulp_created = models.DateTimeField(auto_now_add=True) pulp_type = models.TextField(null=False, default=None) class Meta: abstract = True class NewContent(PulpBase): pulp_type = models.TextField(null=False, default=None) class NewPulpManager(models.Manager): # Ignore the hacky workarounds for stuff like pulp_type, I was trying to keep things simple def bulk_get_or_create(self, objs, batch_size=None): with transaction.atomic(): pulp_type = self.model.get_pulp_type() q = models.Q(pk__in=[]) unsaved_idxs_by_nat_key = defaultdict(list) for idx, obj in enumerate(objs): content_already_saved = not obj._state.adding if not content_already_saved: unsaved_idxs_by_nat_key[obj.natural_key()].append(idx) q |= models.Q(**obj.natural_key_dict()) existing_objs = self.model.objects.filter(q) for existing_obj in existing_objs.iterator(chunk_size=batch_size or 2000): for idx in unsaved_idxs_by_nat_key[existing_objs.natural_key()]: objs[idx] = existing_obj new_base_content = [] for obj in objs: content_already_saved = not obj._state.adding if not content_already_saved: new_base_content.append(NewContent(pulp_type=pulp_type, pulp_id=obj.pk)) self.bulk_create(objs, batch_size=batch_size) NewContent.objects.bulk_create(new_base_content, batch_size=batch_size) return objs class ContentBase(models.Model): content_ptr = models.OneToOneField(NewContent, primary_key=True, default=uuid.uuid4, on_delete=models.CASCADE) objects = NewPulpManager() @classmethod def get_pulp_type(cls): return cls.TYPE @classmethod def natural_key_fields(cls): """ Returns a tuple of the natural key fields which usually equates to unique_together fields """ return tuple(chain.from_iterable(cls._meta.unique_together)) def natural_key(self): """ Get the model's natural key based on natural_key_fields. Returns: tuple: The natural key. """ return tuple(getattr(self, f) for f in self.natural_key_fields()) def natural_key_dict(self): """ Get the model's natural key as a dictionary of keys and values. """ to_return = {} for key in self.natural_key_fields(): to_return[key] = getattr(self, key) return to_return class Meta: abstract=True class NewFileContent(ContentBase): TYPE = "file.file" relative_path = models.TextField(null=False) digest = models.CharField(max_length=64, null=False) class Meta: unique_together = ("relative_path", "digest") ~~~ The speedup is about 3x vs what we currently do. Old ~~~ In [5]: %time PulpFileContent.objects.bulk_get_or_create(old_content) CPU times: user 555 ms, sys: 88.6 ms, total: 643 ms Wall time: 1.77 s ~~~ New ~~~ In [7]: %time NewFileContent.objects.bulk_get_or_create(new_content) CPU times: user 648 ms, sys: 1.39 ms, total: 649 ms Wall time: 694 ms ~~~