Project

Profile

Help

Issue #6121

closed

Syncs of very large repositories take longer than Pulp 2

Added by dalley over 2 years ago. Updated over 2 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Sprint:
Sprint 67
Quarter:

Description

A Pulp 2 immediate-mode sync of 150,000 files using this repository [0] takes 37.5 minutes

With Pulp 3, the same sync takes about 3x as long (about 2 hours).

An on_demand sync of the same repo in Pulp 3 takes about 90 minutes, still much longer than Pulp 2.

The first 50k files seem sync roughly as fast as they do for Pulp 2, but once the number grows higher to progress starts being made very very noticeably slower than at the beginning.

[0] http://quartet.usersys.redhat.com/pub/fake-repos/very_large_file_150k/


Files

sync-file-perf-cprofile.txt (195 KB) sync-file-perf-cprofile.txt The output is ordered by cumtime lmjachky, 02/21/2020 04:17 PM
sync-file-perf-cprofile.out (244 KB) sync-file-perf-cprofile.out gprof2dot -f pstats sync-file-perf-cprofile.out | dot -Tpng -o output.png lmjachky, 02/21/2020 04:17 PM
feb28-fix.out (240 KB) feb28-fix.out lmjachky, 02/28/2020 11:42 AM
feb28-fix.png (1.04 MB) feb28-fix.png lmjachky, 02/28/2020 11:42 AM
feb28-master.png (585 KB) feb28-master.png lmjachky, 02/28/2020 11:42 AM
feb28-master.out (239 KB) feb28-master.out lmjachky, 02/28/2020 11:42 AM

Related issues

Related to File Support - Issue #6104: pulp_file performance test takes too long for TravisCLOSED - DUPLICATEActions
Actions #1

Updated by dalley over 2 years ago

  • Subject changed from Immediate mode syncs of very large repositories fail to Syncs of very large repositories take longer than Pulp 2
  • Description updated (diff)
Actions #2

Updated by daviddavis over 2 years ago

  • Related to Issue #6104: pulp_file performance test takes too long for Travis added
Actions #3

Updated by daviddavis over 2 years ago

Once we fix this performance problem, we should increase the number of files in the file-perf fixture. We had to decrease the number to 20K due to timeouts in Travis[0].

[0] https://pulp.plan.io/issues/6104

Actions #4

Updated by fao89 over 2 years ago

  • Triaged changed from No to Yes
  • Sprint set to Sprint 66
Actions #5

Updated by fao89 over 2 years ago

I believe the solution for performance problems is to make StagesAPI customized by plugin users, on the past, batches had a maximum size of 100 and a minimum of 50, with this PR I increased the minimum to 500 and maximum to 1000: https://github.com/pulp/pulpcore/pull/440/files For a repo with 150000 files, we will use StagesAPI machinery about 150 times, so I think the batch size should be larger. I think in 2 options:

  • Make it adjustable by the user/plugin
  • Define batch size as a fraction of the total content, like 10%
Actions #6

Updated by bmbouter over 2 years ago

Is my understanding correct that policy='on_demand' or policy='streamed' performance of Pulp3 is faster than Pulp2? And this ticket is for policy='immediate' only?

Actions #7

Updated by bmbouter over 2 years ago

As a next step on this I believe we need to return to profiling a pre-built test. If we needd to load Pulp up with a lot of units first we can do that too, but we need to understand with cprofile where the time is going. Is the increase in units increasing the postgresql query runtime. If so which queries are taking the longest and can we optimize those?

In terms of batching, currently the batch size is configurable so I think it's mostly around setting the right default. The larger the batch the larger the memory usage of that stage so we need to be careful to not load it up too much. This is why I think we should set the value and not make it variable based on the unit counts in the entire repo.

Actions #8

Updated by lmjachky over 2 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to lmjachky
Actions #9

Updated by rchan over 2 years ago

  • Sprint changed from Sprint 66 to Sprint 67
Actions #10

Updated by lmjachky over 2 years ago

I have attached the output of cprofile. I ran performance tests for the plugin pulp_file on my local machine and profiled the sync. There were loaded 20,000 files into Pulp where each of the files had a size of 50 bytes.

This is just a status update. I will continue working on the issue and I will try to increase a number of files to see how the performance changes over time with the additional load.

Added by Lubos Mjachky over 2 years ago

Revision fdaa0718

Fetch excluded content directly from the database

fixes #6121 https://pulp.plan.io/issues/6121

Actions #11

Updated by lmjachky over 2 years ago

I concur that syncs of very large repositories take a tremendous amount of time. I tried to sync a repository which contains 200k units. With the current version of pulpcore, the sync took more than 130 minutes even on a powerful machine.

-> Sync tasks => Waiting time (s): 6.965342 | Service time (s): 8330.099087

-> Resync tasks => Waiting time (s): 0.048282 | Service time (s): 2413.934992

-> Publication tasks => Waiting time (s): 0.047476 | Service time (s): 230.289603

-> Distribution tasks => Waiting time (s): 0.046366 | Service time (s): 0.103166

I observed that the most time is consumed by the method __iter__ of the class FlatValuesListIterable. The problematic part is shown below (https://github.com/pulp/pulpcore/blob/9614b16b795fa3aa0f6ad7ccbc1002b2119a2590/pulpcore/app/models/repository.py#L584-L586):

to_add = set(content.values_list('pk', flat=True))
for existing in batch_qs(self.content.order_by('pk').values_list('pk', flat=True)):
    to_add = to_add - set(existing.all())

I replaced these lines by one database call (https://github.com/pulp/pulpcore/pull/565) and the performance improved drastically.

-> Sync tasks => Waiting time (s): 0.052631 | Service time (s): 2304.212409

-> Resync tasks => Waiting time (s): 0.051633 | Service time (s): 2425.380134

-> Publication tasks => Waiting time (s): 0.047535 | Service time (s): 229.94097

-> Distribution tasks => Waiting time (s): 0.047981 | Service time (s): 0.103984
Actions #12

Updated by dalley over 2 years ago

Great work, that is a huge improvement! It's interesting that resyncs take a little longer than syncs, maybe there's still room to improve that (as part of a separate issue)

Actions #13

Updated by Anonymous over 2 years ago

  • Status changed from ASSIGNED to MODIFIED
Actions #14

Updated by ttereshc over 2 years ago

  • Status changed from MODIFIED to CLOSED - CURRENTRELEASE
  • Sprint/Milestone set to 3.3.0

Also available in: Atom PDF