Project

Profile

Help

Task #4290

Investigate performance of Pulp with UUID PKs

Added by dalley 10 months ago. Updated 6 months ago.

Status:
CLOSED - COMPLETE
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
Start date:
Due date:
% Done:

0%

Platform Release:
Blocks Release:
Backwards Incompatible:
No
Groomed:
Yes
Sprint Candidate:
Yes
Tags:
QA Contact:
Complexity:
Smash Test:
Verified:
No
Verification Required:
No
Sprint:
Sprint 49

Description

On the pulp-dev mailing list, we discussed how the way we are using bulk_create only works on certain databases, such as Postgres, and not on others, such as MySQL.

https://www.redhat.com/archives/pulp-dev/2018-November/msg00019.html

The summary of this issue is that - when we use bulk_create(), we need to get back the PKs of the created objects. With database-generated autoincrement integer IDs, Postgres can do this but MySQL cannot.

A proposed solution to this problem is to use UUID PKs, which can be generated in the application and therefore do not need to be returned from the database, because we already have them.

However, using UUIDs as PKs in a database often carries performance implications. It inflates the size of the database (due to being 128 bits, vs a 32 or 64 bit integer), and the randomness and lack of natural ordering makes them difficult to properly index.

We should verify that this will not cause performance issues.

We should test the following:

  • Size of the database/database indexes
  • Memory usage while running?
  • Insert performance, and the degree to which it gets progressively worse with database size
  • Query performance, and the degree to which it gets progressively worse with database size

For the following suggested procedure, use additive sync, NOT mirror sync, and lazy sync, NOT immediate sync.

Suggested procedure:

  • Create 5 file fixture repositories, each with 100,000 random file content units, using the script below (modifications needed)
  • Create 5 file fixture repositories, each with 1,000 random file content units, using the same script
  • Create 2 Pulp repositories
  • With the first repository, repeat the following process 4 times using different fixture repositories and record the results
  • Sync a 100,000 unit repository
  • Sync a 1,000 unit repository
  • Inside a Python shell, time how long it takes execute FileContent.objects.get(pk=.....) using the PK of a unit present in the DB
  • Inside a Python shell, time how long it takes to execute the following script for the latest repository version {c['_type']: c['count'] for c in repo_version.content.values('_type').annotate(count=Count('_type'))}
  • Inside a Python shell, time how long it takes to execute Content.objects.filter(pk__in=repo_version.content)
  • Then repeat that process once, but using the second repository, and only the last (one) set of fixtures and record the results
  • Sync EPEL using the RPM plugin and record the results
  • Measure the total size of the database and record the results

If neither insert/query/storage performance is impacted significantly enough to be a blocking issue, there are additional tests we can run as well.

Bonus Task: Perform these same tests on MySQL / MariaDB, to see how performance compares against PostgreSQL.

Bonus Task: UUID PKs may make it possible to implement a smarter version of "bulk_create()" which can speed up saving Content models with multi-table inheritance. We could attempt to implement this.

Basic testing script

sudo dnf install nginx
sudo systemctl start nginx
cd /usr/share/nginx/html
mkdir large_repository
chmod -R 777 large_repository
cd large_repository

for i in {1..70000}; do echo `uuidgen` > $i; done
for file in `ls`; do echo $file,`sha256sum $file | awk '{ print $1 }'`,`stat -L -c '%s' $file`; done > PULP_MANIFEST

# open up PULP_MANIFEST and remove the line for PULP_MANIFEST
vim PULP_MANIFEST

# now try to fetch PULP_MANIFEST just to make sure everything's working
curl http://localhost/large_repository/PULP_MANIFEST

http POST :8000/pulp/api/v3/repositories/ name=foo
http POST :8000/pulp/api/v3/remotes/file/ name=bar url=http://localhost/large_repository/PULP_MANIFEST policy='on_demand'
http POST :8000/pulp/api/v3/remotes/file/1/sync/ repository=/pulp/api/v3/repositories/1/

benchmarking_results_uuid.txt (5.76 KB) dalley, 02/26/2019 07:29 PM benchmarking_results_uuid.txt
benchmarking_results_autoincrement.txt (7.62 KB) dalley, 02/26/2019 07:29 PM benchmarking_results_autoincrement.txt
benchmark_steps.txt (5.33 KB) dalley, 02/26/2019 07:29 PM benchmark_steps.txt
benchmark_steps_uuid.txt (5.65 KB) dalley, 02/26/2019 07:29 PM benchmark_steps_uuid.txt
benchmarking_results_uuid_mysql.txt (1.92 KB) dalley, 02/26/2019 11:36 PM benchmarking_results_uuid_mysql.txt

Related issues

Blocks Pulp - Task #4270: Add support for MariaDB and MySQL MODIFIED Actions

History

#1 Updated by dalley 10 months ago

  • Tracker changed from Issue to Task
  • % Done set to 0

#2 Updated by daviddavis 10 months ago

  • Blocked by Issue #4288: _mysql_exceptions.OperationalError: (1170, "BLOB/TEXT column 'name' used in key specification without a key length") added

#3 Updated by dkliban@redhat.com 10 months ago

  • Groomed changed from No to Yes
  • Sprint Candidate changed from No to Yes

#4 Updated by bmbouter 9 months ago

  • Sprint set to Sprint 47

#5 Updated by dalley 9 months ago

  • Subject changed from Investigate performance of Pulp with UUID PKs on MariaDB/MySQL to Investigate performance of Pulp with UUID PKs
  • Description updated (diff)

#6 Updated by dalley 9 months ago

  • Description updated (diff)

#7 Updated by dalley 9 months ago

  • Description updated (diff)

#8 Updated by dalley 9 months ago

  • Description updated (diff)

#9 Updated by bmbouter 9 months ago

I think the rewrite of this is great. Also can we remove the blocking relationship?

#10 Updated by dalley 9 months ago

  • Blocked by deleted (Issue #4288: _mysql_exceptions.OperationalError: (1170, "BLOB/TEXT column 'name' used in key specification without a key length"))

#11 Updated by dalley 9 months ago

  • Description updated (diff)

#12 Updated by dalley 9 months ago

  • Description updated (diff)

#13 Updated by daviddavis 9 months ago

  • Tags Pulp 3 RC Blocker added

#14 Updated by dalley 9 months ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to dalley

#15 Updated by rchan 9 months ago

  • Sprint changed from Sprint 47 to Sprint 48

#16 Updated by rchan 8 months ago

  • Sprint changed from Sprint 48 to Sprint 49

#17 Updated by daviddavis 8 months ago

  • Blocks Task #4270: Add support for MariaDB and MySQL added

#18 Updated by dalley 8 months ago

Procedures and results attached.

Summary:

UUIDs on Postgres are about 30% slower than autoincrement integer IDs at scale. For some reason, for the first 2 rounds of large syncs, there didn't seem to be a huge difference, but then the balance of things changed such that one became apparent.

MariaDB behaves kind of weirdly. The first sync was 80% slower than the first sync with PostgreSQL, but every subsequent sync was faster. This is despite some of the important underlying operations being several times slower than on PostgreSQL, when testing code snippets.

There was one showstopper performance problem with MariaDB. Whereas

        %timeit {c['_type']: c['count'] for c in repo_version.content.values('_type').annotate(count=Count('_type'))}

completed on Postgres w/ UUID pks in only 129 milliseconds, the same snippet took >20 minutes on MariaDB. I don't know how long it would have taken to complete because I cut it off prematurely.

I'm not sure what caused this. Possibly the database got corrupted somehow? When I tried it with RepoVersion(number=1), it did actually complete, and it took about 1.3 seconds, which is several times slower than PostgreSQL but not anywhere close to the 20+ minute hanging I was seeing with the others.

#19 Updated by daviddavis 6 months ago

  • Sprint/Milestone set to 3.0

#20 Updated by bmbouter 6 months ago

  • Tags deleted (Pulp 3, Pulp 3 RC Blocker)

Please register to edit this issue

Also available in: Atom PDF