Project

Profile

Help

Issue #4404

Sync performance degradation with RemoteArtifactSaver stage

Added by gmbnomis 8 months ago. Updated 6 months ago.

Status:
MODIFIED
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
Start date:
Due date:
Severity:
2. Medium
Version:
Platform Release:
Blocks Release:
OS:
Backwards Incompatible:
No
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
QA Contact:
Complexity:
Smash Test:
Verified:
No
Verification Required:
No
Sprint:
Sprint 48

Description

Problem

The new RemoteArtifactSaver stage see issue #4246) slows down sync, especially
when re-syncing an existing repo.

Some performance impact was expected, as the RemoteArtifactSaver stage has
to query for existing RemoteArtifacts in all cases (previously, the
RemoteArtifacts were saved unconditionally when saving content units)

Measurements

Duration of a lazy sync of Chef Supermarket (ca. 23700 content units &
remote artifacts) using pulp_cookbook. Uses a fresh Pulp instance (just started) with an empty
database.

Before RemoteArtifactSaver stage (i.e. before https://github.com/pulp/pulpcore-plugin/pull/36):

Initial sync: ca. 90 seconds
Re-sync: ca. 14 seconds

With separate RemoteArtifactSaver stage:

Initial sync: ca. 160 seconds
Re-sync: ca. 140 seconds

Initial sync time almost doubles, re-sync time increases by a factor of 10!

Solution

The root cause is the high number of data base operations in the new stage.
In a test (see https://gist.github.com/gmbnomis/07b6c7d13a313dbcfcaa81ff026b96f8), the stage causes ca. 300 DB queries for a batch of 100 content units (to create around 50 RemoteArtifacts).

Using prefetching, the stage can be implemented using 2 DB queries per
batch. The WHERE clauses use pks only. For example:

[{'sql': 'SELECT "pulp_app_contentartifact"."_id", '
         '"pulp_app_contentartifact"."_created", '
         '"pulp_app_contentartifact"."_last_updated", '
         '"pulp_app_contentartifact"."artifact_id", '
         '"pulp_app_contentartifact"."content_id", '
         '"pulp_app_contentartifact"."relative_path" FROM '
         '"pulp_app_contentartifact" WHERE '
         '"pulp_app_contentartifact"."content_id" IN (1, 2, 3, 4, 5, 6, 7, 8, '
         '9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, '
         '26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, '
         '43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, '
         '60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, '
         '77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, '
         '94, 95, 96, 97, 98, 99, 100)',
  'time': '0.008'},
 {'sql': 'SELECT "pulp_app_remoteartifact"."_id", '
         '"pulp_app_remoteartifact"."_created", '
         '"pulp_app_remoteartifact"."_last_updated", '
         '"pulp_app_remoteartifact"."url", "pulp_app_remoteartifact"."size", '
         '"pulp_app_remoteartifact"."md5", "pulp_app_remoteartifact"."sha1", '
         '"pulp_app_remoteartifact"."sha224", '
         '"pulp_app_remoteartifact"."sha256", '
         '"pulp_app_remoteartifact"."sha384", '
         '"pulp_app_remoteartifact"."sha512", '
         '"pulp_app_remoteartifact"."content_artifact_id", '
         '"pulp_app_remoteartifact"."remote_id" FROM "pulp_app_remoteartifact" '
         'WHERE ("pulp_app_remoteartifact"."remote_id" IN (1, 3) AND '
         '"pulp_app_remoteartifact"."content_artifact_id" IN (1, 2, 3, 4, 5, '
         '6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, '
         '24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, '
         '41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, '
         '58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, '
         '75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, '
         '92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, '
         '107, 108, 109, 110, 111, 112, 113, 114, 115))',
  'time': '0.001'}]

The queries are are little bit broader than before: The query for the
RemoteArtifacts includes RemoteArtifacts for all remotes seen in the current
batch (previously, it included the remotes seen per declarative_content).
For all practical purposes, this difference should be negligible (and no
difference for batches using a single remote).

Performance measurement:

Initial sync: ca. 100 seconds
Re-sync: ca. 23 seconds

Associated revisions

Revision 07267715 View on GitHub
Added by gmbnomis 8 months ago

Improve RemoteArtifactSaver stage

Using prefetching, the stage can be implemented using 2 DB queries per
batch. The WHERE clauses use pks only. For example:

```
[{'sql': 'SELECT "pulp_app_contentartifact"."_id", '
'"pulp_app_contentartifact"."_created", '
'"pulp_app_contentartifact"."_last_updated", '
'"pulp_app_contentartifact"."artifact_id", '
'"pulp_app_contentartifact"."content_id", '
'"pulp_app_contentartifact"."relative_path" FROM '
'"pulp_app_contentartifact" WHERE '
'"pulp_app_contentartifact"."content_id" IN (1, 2, 3, 4, 5, 6, 7, 8, '
'9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, '
'26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, '
'43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, '
'60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, '
'77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, '
'94, 95, 96, 97, 98, 99, 100)',
'time': '0.008'}, {'sql': 'SELECT "pulp_app_remoteartifact"."_id", '
'"pulp_app_remoteartifact"."_created", '
'"pulp_app_remoteartifact"."_last_updated", '
'"pulp_app_remoteartifact"."url", "pulp_app_remoteartifact"."size", '
'"pulp_app_remoteartifact"."md5", "pulp_app_remoteartifact"."sha1", '
'"pulp_app_remoteartifact"."sha224", '
'"pulp_app_remoteartifact"."sha256", '
'"pulp_app_remoteartifact"."sha384", '
'"pulp_app_remoteartifact"."sha512", '
'"pulp_app_remoteartifact"."content_artifact_id", '
'"pulp_app_remoteartifact"."remote_id" FROM "pulp_app_remoteartifact" '
'WHERE ("pulp_app_remoteartifact"."remote_id" IN (1, 3) AND '
'"pulp_app_remoteartifact"."content_artifact_id" IN (1, 2, 3, 4, 5, '
'6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, '
'24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, '
'41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, '
'58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, '
'75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, '
'92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, '
'107, 108, 109, 110, 111, 112, 113, 114, 115))',
'time': '0.001'}]
```

The queries are are little bit broader than before: The query for the
RemoteArtifacts includes RemoteArtifacts for all remotes seen in the current
batch (previously, it included the remotes seen per declarative_content).
For all practical purposes, this difference should be negligible (and no
difference for batches using a single remote).

closes #4404
https://pulp.plan.io/issues/4404

History

#2 Updated by ttereshc 8 months ago

  • Status changed from NEW to POST

#3 Updated by CodeHeeler 8 months ago

  • Triaged changed from No to Yes
  • Sprint set to Sprint 48
  • Tags Pulp 3 RC Blocker added

#4 Updated by gmbnomis 8 months ago

  • Status changed from POST to MODIFIED

#5 Updated by daviddavis 6 months ago

  • Sprint/Milestone set to 3.0

#6 Updated by bmbouter 6 months ago

  • Tags deleted (Pulp 3, Pulp 3 RC Blocker)

Please register to edit this issue

Also available in: Atom PDF