Project

Profile

Help

Task #1872

closed

Profile Django ORM instantiation cost

Added by semyers over 8 years ago. Updated over 5 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Platform Release:
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Quarter:

Description

We've experienced a surprising performance regression when moving from pymongo to MongoEngine related to the "hydration" of a mongo result row into a MongoEngine model instance. We would like to profile Django to similarly measure the cost of hydrating model instances with a relational backend.


Related issues

Related to Pulp - Task #1803: Plan replacement of mongodb with postgresCLOSED - CURRENTRELEASEsemyers

Actions
Actions #1

Updated by semyers over 8 years ago

  • Related to Task #1803: Plan replacement of mongodb with postgres added
Actions #2

Updated by semyers over 8 years ago

This was inspired by the apparently high cost of model instance instatiation in MongoEngine, recorded in #1714.

To test this, I'll probably end up creating a little combo of django and mongoengine with each backend populated with a few thousand rows of data that are as identical as possible, and time how long it takes to make mongoengine objects, pymongo dicts, django objects, and django dicts (presumably using QuerySet.values). Thanks to jortel for some idea about testing this.

Actions #3

Updated by semyers over 8 years ago

  • Status changed from ASSIGNED to CLOSED - CURRENTRELEASE

In order to do this, I timed some of the queries seen in #1714 on the models currently in the relational pulp project, using 10k RPMs. I haven't reproduced the Mongo end of these tests on the same hardware as the Django tests, so the Mongo vs. Django stats aren't comparable.

These tests, from https://pulp.plan.io/issues/1714#note-11, focused simply on instantiating MongoEngine Documents versus returning pymongo dicts sans MongoEngine. Instantiating MongoEngine RPM unit Documents with 12k RPMs was reported to take 24 seconds. Returning dicts from pymongo on the same unit set was reported to take 3 seconds, making the MongoEngine instantation cost roughly a factor of 8 based on the results in that comment.

# pymongo returns
total = 0
for rpm in RPM._get_collection().find():
    total += 1
print '{0} RPMs found'.format(total)
# 3 seconds reported

# RPM Document returns
total = 0
for rpm in RPM.objects.all()
    total += 1
print '{0} RPMs found'.format(total)
# 24 seconds reported

In Django:

with timer:
    total = 0
    for rpm in RPM.objects.all():
        total += 1
    print(total)
10000
time: 0.19265484809875488 seconds

Django doesn't really expose a way to get at the "raw" return from the DB through a Model, but it does provide a mechanism to only retrieve specific fields, and a way to get the field names for a Model, so with those two bits combined we can quickly create a dict representation of DB rows, similar to a pymongo return value:

fieldnames = [f.name for f in RPM._meta.fields]                             
with timer:
    total = 0
    for rpm in RPM.objects.all().values_list(*fieldnames):
        total += 1
    print(total)
10000
time: 0.10883545875549316 seconds

From this, we can (sorta) estimate Django instantiation cost versus just returning dicts: It takes about 1.8 times longer to instantiate a Django model instance than it does to return a dict of {field: value} mappings. Compared to MongoEngine's 8, Django looks to be roughly four times faster at instantiating model instances. Because we never really identified what specific behavior was
triggering MongoEngine's slowdown, it's difficult at this point to say if future additions to our Django ContentUnit will slow it down for similar reasons.

I think all of these numbers are pretty sketchy, but at the very least we can conclude that: There is (of course) a cost to instantiate Django Model instances from a DB row, and that cost is apparently lower than the cost of instantating MongoEngine ContentUnit Documents from a DB row in Pulp 2.

While the Mongo vs. Django numbers are incomparable, it would be worth setting up some tests to test both platforms on the same hardware to dig into the apparent improvement in speed, since it looks like Django in general might be faster than MongoEngine at retrieving objects.

Some notes

It was tricky to find the best behavioral analogs from one framework to another. I consider Django's ".values()" method to be the best alternative to the dicts returned by pymongo's find method, rather than going to the DB cursor directly, because it kindly converts the results to dicts for us, which I'm treating as a stand-in for pymongo's BSON -> Python serialization cost. For reference, here's a representative sample using the raw DB cursor, which returns tuples:

with timer:
    total = 0
    cursor.execute('SELECT * FROM "pulp_rpm_rpm" INNER JOIN "pulp_contentunit" ON ( "pulp_rpm_rpm"."contentunit_ptr_id" = "pulp_contentunit"."uuid" )')
    for row in cursor.fetchall():
        total += 1
    print(total)
10000
time: 0.10186195373535156 seconds

The JOIN which is needed to composite the RPM Model with its ContentUnit base. This example is a little faster than the Django values() example above. It is a representative sample, but the minimum and maximum observed times were similarly very close to the min/max on timed calls to Django's .values() method with no method being clearly better in my testing (...so that's pretty cool). I apologize for not collecting the data into something graphable.

Finally, I accidentally left the print statement in my timer blocks above. Here's a representative sample of its impact on the time results in my test environment:

with timer:
    print(total)
10000
time: 0.00010037422180175781 seconds
Actions #4

Updated by bmbouter over 8 years ago

I really enjoyed reading this. @smyers, great job!

Actions #5

Updated by bmbouter over 5 years ago

  • Tags Pulp 2 added

Also available in: Atom PDF