Issue #4404
closedSync performance degradation with RemoteArtifactSaver stage
Description
Problem¶
The new RemoteArtifactSaver
stage see issue #4246) slows down sync, especially
when re-syncing an existing repo.
Some performance impact was expected, as the RemoteArtifactSaver
stage has
to query for existing RemoteArtifacts
in all cases (previously, the
RemoteArtifacts
were saved unconditionally when saving content units)
Measurements¶
Duration of a lazy sync of Chef Supermarket (ca. 23700 content units &
remote artifacts) using pulp_cookbook. Uses a fresh Pulp instance (just started) with an empty
database.
Before RemoteArtifactSaver
stage (i.e. before https://github.com/pulp/pulpcore-plugin/pull/36):
Initial sync: ca. 90 seconds
Re-sync: ca. 14 seconds
With separate RemoteArtifactSaver
stage:
Initial sync: ca. 160 seconds
Re-sync: ca. 140 seconds
Initial sync time almost doubles, re-sync time increases by a factor of 10!
Solution¶
The root cause is the high number of data base operations in the new stage.
In a test (see https://gist.github.com/gmbnomis/07b6c7d13a313dbcfcaa81ff026b96f8), the stage causes ca. 300 DB queries for a batch of 100 content units (to create around 50 RemoteArtifacts).
Using prefetching, the stage can be implemented using 2 DB queries per
batch. The WHERE clauses use pks only. For example:
[{'sql': 'SELECT "pulp_app_contentartifact"."_id", '
'"pulp_app_contentartifact"."_created", '
'"pulp_app_contentartifact"."_last_updated", '
'"pulp_app_contentartifact"."artifact_id", '
'"pulp_app_contentartifact"."content_id", '
'"pulp_app_contentartifact"."relative_path" FROM '
'"pulp_app_contentartifact" WHERE '
'"pulp_app_contentartifact"."content_id" IN (1, 2, 3, 4, 5, 6, 7, 8, '
'9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, '
'26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, '
'43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, '
'60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, '
'77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, '
'94, 95, 96, 97, 98, 99, 100)',
'time': '0.008'},
{'sql': 'SELECT "pulp_app_remoteartifact"."_id", '
'"pulp_app_remoteartifact"."_created", '
'"pulp_app_remoteartifact"."_last_updated", '
'"pulp_app_remoteartifact"."url", "pulp_app_remoteartifact"."size", '
'"pulp_app_remoteartifact"."md5", "pulp_app_remoteartifact"."sha1", '
'"pulp_app_remoteartifact"."sha224", '
'"pulp_app_remoteartifact"."sha256", '
'"pulp_app_remoteartifact"."sha384", '
'"pulp_app_remoteartifact"."sha512", '
'"pulp_app_remoteartifact"."content_artifact_id", '
'"pulp_app_remoteartifact"."remote_id" FROM "pulp_app_remoteartifact" '
'WHERE ("pulp_app_remoteartifact"."remote_id" IN (1, 3) AND '
'"pulp_app_remoteartifact"."content_artifact_id" IN (1, 2, 3, 4, 5, '
'6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, '
'24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, '
'41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, '
'58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, '
'75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, '
'92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, '
'107, 108, 109, 110, 111, 112, 113, 114, 115))',
'time': '0.001'}]
The queries are are little bit broader than before: The query for the
RemoteArtifacts includes RemoteArtifacts for all remotes seen in the current
batch (previously, it included the remotes seen per declarative_content).
For all practical purposes, this difference should be negligible (and no
difference for batches using a single remote).
Performance measurement:
Initial sync: ca. 100 seconds
Re-sync: ca. 23 seconds