Task #4290
Updated by dalley almost 6 years ago
On the pulp-dev mailing list, we discussed how the way we are using bulk_create only works on certain databases, such as Postgres, and not on others, such as MySQL. https://www.redhat.com/archives/pulp-dev/2018-November/msg00019.html The summary of this issue is that - when we use bulk_create(), we need to get back the PKs of the created objects. With database-generated autoincrement integer IDs, Postgres can do this but MySQL cannot. A proposed solution to this problem is to use UUID PKs, which can be generated in the application and therefore do not need to be returned from the database, because we already have them. However, using UUIDs as PKs in a database often carries performance implications. It inflates the size of the database (due same thread it was also discussed that MySQL seems to being 128 bits, vs a 32 or 64 bit integer), and have severe performance issues with UUID PKs due to the randomness and lack of natural ordering makes them difficult way it does indexes, whereas Postgres can do indexes on UUID PKs properly. Since porting Pulp to properly index. We use UUID PKs does not require much work, we should verify that do this will not cause performance issues. We should and then test the following: * Size of the database * Insert performance, and the degree Whether UUID PKs impact performance on Postgres in any significant way (unlikely to which it gets progressively worse with database size be an issue) * Query performance, and Whether UUID PKs cause significant enough performance degradation on MySQL/MariaDB that the degree platform is not usable beyond a certain scale. This test should probably try to which it gets progressively worse emulate a medium-to-large installation, with database size at least 100,000 content units. For these tests, I recommend the following suggested procedure, use measurements on each platform (Postgres and MariaDB/MySQL: 1. Time required for a lazy sync, NOT immediate sync. Suggested procedure: * Create 5 file fixture repositories, each with sync of 100,000 random file "file" content units, using the script below (modifications needed) * Create 5 file fixture repositories, each with 1,000 random file 2. Additive, on top of #1, time required for lazy syncing "rpm" content units, using the same script from EPEL * Create 2 Pulp repositories * With the first repository, repeat the following process 4 times using different fixture repositories 3. Additive, on top of #1 and record the results #2, time required for lazy syncing 50,000 more "file" content > * Sync a 100,000 unit repository > * Sync a 1,000 unit repository > * Inside a Python shell, time how long it takes execute FileContent.objects.get(pk=.....) using the PK 4. Additive, on top of a unit present in the DB > * Inside a Python shell, #1,2,3, time how long it takes required for lazy syncing 100 "file" content Perhaps this procedure is unnecessarily complex, but my reasoning is to execute capture non-linearity in the following script for performance. If the latest repository version {c['_type']: c['count'] for c in repo_version.content.values('_type').annotate(count=Count('_type'))} > * Inside a Python shell, time how first 100,000 units take twice as long it takes to execute Content.objects.filter(pk__in=repo_version.content) * Then repeat that process, but using the second repository, sync, and only for the last set of fixtures * Sync EPEL using the RPM plugin * Measure the total size of the database If neither insert/query/storage performance is impacted significantly enough 100 units take 5x as long to be a blocking issue, there are additional tests sync, then we can run as well. Bonus Task: Perform these same tests on MySQL / MariaDB, to see how performance compares against PostgreSQL. Bonus Task: UUID PKs may make know that it possible to implement a smarter version of "bulk_create()" which can speed up saving models gets significantly worse with multi-table inheritance. We could attempt to implement this. Basic testing script <pre> sudo dnf install nginx sudo systemctl start nginx cd /usr/share/nginx/html mkdir large_repository chmod -R 777 large_repository cd large_repository for i in {1..70000}; do echo `uuidgen` > $i; done for file in `ls`; do echo $file,`sha256sum $file | awk '{ print $1 }'`,`stat -L -c '%s' $file`; done > PULP_MANIFEST # open up PULP_MANIFEST and remove the line for PULP_MANIFEST vim PULP_MANIFEST # now try to fetch PULP_MANIFEST just to make sure everything's working curl http://localhost/large_repository/PULP_MANIFEST http POST :8000/pulp/api/v3/repositories/ name=foo http POST :8000/pulp/api/v3/remotes/file/ name=bar url=http://localhost/large_repository/PULP_MANIFEST policy='on_demand' http POST :8000/pulp/api/v3/remotes/file/1/sync/ repository=/pulp/api/v3/repositories/1/ </pre> large installations.