Project

Profile

Help

Story #7824

As a developer, I can bulk_create subclasses of Content

Added by dalley 12 months ago. Updated 9 months ago.

Status:
NEW
Priority:
Normal
Assignee:
-
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Platform Release:
Groomed:
No
Sprint Candidate:
No
Tags:
Wishlist
Sprint:
Quarter:

Description

It is currently disallowed to bulk_create() multi-table inherited Django models, due to the technical challenges with doing so, however it is not necessarily impossible in theory.

https://code.djangoproject.com/ticket/28821

As a Pulp developer, if we are able to make assumptions about the way it will be used, then we can avoid most of the problems that prevent a generic implementation. For instance, only one level of inheritance.

I developed a proof of concept strategy that is unfortunately made more difficult by Django's proxy model behavior and the fact that there's no class which represents just the subclass table. So, this code emulates how multi-table inherited models set up their internal relationships, but doesn't actually use model inheritance.

Strategy:

  • Transactions don't check the integrity of foreign keys until they are committed
  • So save our child table first with a random uuid as the content_ptr, in bulk
  • Then go back and save the parent model, in bulk

class PulpBase(models.Model):
    pulp_id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
    pulp_created = models.DateTimeField(auto_now_add=True)
    pulp_type = models.TextField(null=False, default=None)

    class Meta:
        abstract = True

class NewContent(PulpBase):
    pulp_type = models.TextField(null=False, default=None)

class NewPulpManager(models.Manager):
    # Ignore the hacky workarounds for stuff like pulp_type, I was trying to keep things simple
    def bulk_get_or_create(self, objs, batch_size=None):
        with transaction.atomic():
            pulp_type = self.model.get_pulp_type()

            q = models.Q(pk__in=[])
            unsaved_idxs_by_nat_key = defaultdict(list)
            for idx, obj in enumerate(objs):
                content_already_saved = not obj._state.adding
                if not content_already_saved:
                    unsaved_idxs_by_nat_key[obj.natural_key()].append(idx)
                    q |= models.Q(**obj.natural_key_dict())

            existing_objs = self.model.objects.filter(q)
            for existing_obj in existing_objs.iterator(chunk_size=batch_size or 2000):
                for idx in unsaved_idxs_by_nat_key[existing_objs.natural_key()]:
                    objs[idx] = existing_obj

            new_base_content = []
            for obj in objs:
                content_already_saved = not obj._state.adding
                if not content_already_saved:
                    new_base_content.append(NewContent(pulp_type=pulp_type, pulp_id=obj.pk))

            self.bulk_create(objs, batch_size=batch_size)
            NewContent.objects.bulk_create(new_base_content, batch_size=batch_size)

        return objs


class ContentBase(models.Model):
    content_ptr = models.OneToOneField(NewContent, primary_key=True, default=uuid.uuid4, on_delete=models.CASCADE)

    objects = NewPulpManager()

    @classmethod
    def get_pulp_type(cls):
        return cls.TYPE

    @classmethod
    def natural_key_fields(cls):
        """
        Returns a tuple of the natural key fields which usually equates to unique_together fields
        """
        return tuple(chain.from_iterable(cls._meta.unique_together))

    def natural_key(self):
        """
        Get the model's natural key based on natural_key_fields.

        Returns:
            tuple: The natural key.
        """
        return tuple(getattr(self, f) for f in self.natural_key_fields())

    def natural_key_dict(self):
        """
        Get the model's natural key as a dictionary of keys and values.
        """
        to_return = {}
        for key in self.natural_key_fields():
            to_return[key] = getattr(self, key)
        return to_return


    class Meta:
        abstract=True

class NewFileContent(ContentBase):
    TYPE = "file.file"

    relative_path = models.TextField(null=False)
    digest = models.CharField(max_length=64, null=False)

    class Meta:
        unique_together = ("relative_path", "digest")

The speedup is about 3x vs what we currently do.

Old

In [5]: %time PulpFileContent.objects.bulk_get_or_create(old_content)
CPU times: user 555 ms, sys: 88.6 ms, total: 643 ms
Wall time: 1.77 s

New

In [7]: %time NewFileContent.objects.bulk_get_or_create(new_content)
CPU times: user 648 ms, sys: 1.39 ms, total: 649 ms
Wall time: 694 ms

History

#1 Updated by dkliban@redhat.com 10 months ago

  • Status changed from NEW to CLOSED - WONTFIX

#2 Updated by dalley 9 months ago

  • Status changed from CLOSED - WONTFIX to NEW
  • Tags Wishlist added

Please register to edit this issue

Also available in: Atom PDF