Project

Profile

Help

Task #4290

Updated by dalley almost 6 years ago

On the pulp-dev mailing list, we discussed how the way we are using bulk_create only works on certain databases, such as Postgres, and not on others, such as MySQL. 

 https://www.redhat.com/archives/pulp-dev/2018-November/msg00019.html 

 The summary of this issue is that - when we use bulk_create(), we need to get back the PKs of the created objects.    With database-generated autoincrement integer IDs, Postgres can do this but MySQL cannot. 

 A proposed solution to this problem is to use UUID PKs, which can be generated in the application and therefore do not need to be returned from the database, because we already have them. 

 However, using UUIDs as PKs in a database often carries performance implications.    It inflates the size of the database (due to being 128 bits, vs a 32 or 64 bit integer), and the randomness and lack of natural ordering makes them difficult to properly index. 

 We should verify that this will not cause performance issues. 

 We should test the following: 

 * Size of the database 
 * Insert performance, and the degree to which it gets progressively worse with database size 
 * Query performance, and the degree to which it gets progressively worse with database size 

 For the following suggested procedure, use additive sync, NOT mirror sync, and lazy sync, NOT immediate sync. 

 Suggested procedure: 

 * Create 5 file fixture repositories, each with 100,000 random file content units, using the script below (modifications needed) 
 * Create 5 file fixture repositories, each with 1,000 random file content units, using the same script 
 * Create 2 Pulp repositories 
 * With the first repository, repeat the following process 4 times using different fixture repositories and record the results 
 > * Sync a 100,000 unit repository 
 > * Sync a 1,000 unit repository 
 > * Inside a Python shell, time how long it takes execute FileContent.objects.get(pk=.....) using the PK of a unit present in the DB 
 > * Inside a Python shell, time how long it takes to execute the following script for the latest repository version {c['_type']: c['count'] for c in repo_version.content.values('_type').annotate(count=Count('_type'))} 
 > * Inside a Python shell, time how long it takes to execute Content.objects.filter(pk__in=repo_version.content) 
 * Then repeat that process *once*, but using the second repository, and only the last (one) set of fixtures and record the results 
 * Sync EPEL using the RPM plugin and record the results 
 * Measure the total size of the database and record the results 


 If neither insert/query/storage performance is impacted significantly enough to be a blocking issue, there are additional tests we can run as well. 

 Bonus Task:    Perform these same tests on MySQL / MariaDB, to see how performance compares against PostgreSQL. 

 Bonus Task:    UUID PKs may make it possible to implement a smarter version of "bulk_create()" which can speed up saving Content models with multi-table inheritance.    We could attempt to implement this. 


 Basic testing script 

 <pre> 
 sudo dnf install nginx 
 sudo systemctl start nginx 
 cd /usr/share/nginx/html 
 mkdir large_repository 
 chmod -R 777 large_repository 
 cd large_repository 

 for i in {1..70000}; do echo `uuidgen` > $i; done 
 for file in `ls`; do echo $file,`sha256sum $file | awk '{ print $1 }'`,`stat -L -c '%s' $file`; done > PULP_MANIFEST 

 # open up PULP_MANIFEST and remove the line for PULP_MANIFEST 
 vim PULP_MANIFEST 

 # now try to fetch PULP_MANIFEST just to make sure everything's working 
 curl http://localhost/large_repository/PULP_MANIFEST 

 http POST :8000/pulp/api/v3/repositories/ name=foo 
 http POST :8000/pulp/api/v3/remotes/file/ name=bar url=http://localhost/large_repository/PULP_MANIFEST policy='on_demand' 
 http POST :8000/pulp/api/v3/remotes/file/1/sync/ repository=/pulp/api/v3/repositories/1/ 

 </pre>

Back