Project

Profile

Help

Issue #6121

Syncs of very large repositories take longer than Pulp 2

Added by dalley 6 months ago. Updated 4 months ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Sprint:
Sprint 67

Description

A Pulp 2 immediate-mode sync of 150,000 files using this repository [0] takes 37.5 minutes

With Pulp 3, the same sync takes about 3x as long (about 2 hours).

An on_demand sync of the same repo in Pulp 3 takes about 90 minutes, still much longer than Pulp 2.

The first 50k files seem sync roughly as fast as they do for Pulp 2, but once the number grows higher to progress starts being made very very noticeably slower than at the beginning.

[0] http://quartet.usersys.redhat.com/pub/fake-repos/very_large_file_150k/

sync-file-perf-cprofile.txt (195 KB) sync-file-perf-cprofile.txt The output is ordered by cumtime lmjachky, 02/21/2020 04:17 PM
sync-file-perf-cprofile.out (244 KB) sync-file-perf-cprofile.out gprof2dot -f pstats sync-file-perf-cprofile.out | dot -Tpng -o output.png lmjachky, 02/21/2020 04:17 PM
feb28-fix.out (240 KB) feb28-fix.out lmjachky, 02/28/2020 11:42 AM
feb28-fix.png (1.04 MB) feb28-fix.png lmjachky, 02/28/2020 11:42 AM
feb28-master.png (585 KB) feb28-master.png lmjachky, 02/28/2020 11:42 AM
feb28-master.out (239 KB) feb28-master.out lmjachky, 02/28/2020 11:42 AM
250
250

Related issues

Related to File Support - Issue #6104: pulp_file performance test takes too long for TravisNEW<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>

Associated revisions

Revision fdaa0718 View on GitHub
Added by Lubos Mjachky 6 months ago

Fetch excluded content directly from the database

fixes #6121 https://pulp.plan.io/issues/6121

History

#1 Updated by dalley 6 months ago

  • Subject changed from Immediate mode syncs of very large repositories fail to Syncs of very large repositories take longer than Pulp 2
  • Description updated (diff)

#2 Updated by daviddavis 6 months ago

  • Related to Issue #6104: pulp_file performance test takes too long for Travis added

#3 Updated by daviddavis 6 months ago

Once we fix this performance problem, we should increase the number of files in the file-perf fixture. We had to decrease the number to 20K due to timeouts in Travis[0].

[0] https://pulp.plan.io/issues/6104

#4 Updated by fao89 6 months ago

  • Triaged changed from No to Yes
  • Sprint set to Sprint 66

#5 Updated by fao89 6 months ago

I believe the solution for performance problems is to make StagesAPI customized by plugin users, on the past, batches had a maximum size of 100 and a minimum of 50, with this PR I increased the minimum to 500 and maximum to 1000: https://github.com/pulp/pulpcore/pull/440/files For a repo with 150000 files, we will use StagesAPI machinery about 150 times, so I think the batch size should be larger. I think in 2 options:

  • Make it adjustable by the user/plugin
  • Define batch size as a fraction of the total content, like 10%

#6 Updated by bmbouter 6 months ago

Is my understanding correct that policy='on_demand' or policy='streamed' performance of Pulp3 is faster than Pulp2? And this ticket is for policy='immediate' only?

#7 Updated by bmbouter 6 months ago

As a next step on this I believe we need to return to profiling a pre-built test. If we needd to load Pulp up with a lot of units first we can do that too, but we need to understand with cprofile where the time is going. Is the increase in units increasing the postgresql query runtime. If so which queries are taking the longest and can we optimize those?

In terms of batching, currently the batch size is configurable so I think it's mostly around setting the right default. The larger the batch the larger the memory usage of that stage so we need to be careful to not load it up too much. This is why I think we should set the value and not make it variable based on the unit counts in the entire repo.

#8 Updated by lmjachky 6 months ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to lmjachky

#9 Updated by rchan 6 months ago

  • Sprint changed from Sprint 66 to Sprint 67

#10 Updated by lmjachky 6 months ago

I have attached the output of cprofile. I ran performance tests for the plugin pulp_file on my local machine and profiled the sync. There were loaded 20,000 files into Pulp where each of the files had a size of 50 bytes.

This is just a status update. I will continue working on the issue and I will try to increase a number of files to see how the performance changes over time with the additional load.

#11 Updated by lmjachky 6 months ago

250
250

I concur that syncs of very large repositories take a tremendous amount of time. I tried to sync a repository which contains 200k units. With the current version of pulpcore, the sync took more than 130 minutes even on a powerful machine.

-> Sync tasks => Waiting time (s): 6.965342 | Service time (s): 8330.099087

-> Resync tasks => Waiting time (s): 0.048282 | Service time (s): 2413.934992

-> Publication tasks => Waiting time (s): 0.047476 | Service time (s): 230.289603

-> Distribution tasks => Waiting time (s): 0.046366 | Service time (s): 0.103166

I observed that the most time is consumed by the method __iter__ of the class FlatValuesListIterable. The problematic part is shown below (https://github.com/pulp/pulpcore/blob/9614b16b795fa3aa0f6ad7ccbc1002b2119a2590/pulpcore/app/models/repository.py#L584-L586):

to_add = set(content.values_list('pk', flat=True))
for existing in batch_qs(self.content.order_by('pk').values_list('pk', flat=True)):
    to_add = to_add - set(existing.all())

I replaced these lines by one database call (https://github.com/pulp/pulpcore/pull/565) and the performance improved drastically.

-> Sync tasks => Waiting time (s): 0.052631 | Service time (s): 2304.212409

-> Resync tasks => Waiting time (s): 0.051633 | Service time (s): 2425.380134

-> Publication tasks => Waiting time (s): 0.047535 | Service time (s): 229.94097

-> Distribution tasks => Waiting time (s): 0.047981 | Service time (s): 0.103984

#12 Updated by dalley 5 months ago

Great work, that is a huge improvement! It's interesting that resyncs take a little longer than syncs, maybe there's still room to improve that (as part of a separate issue)

#13 Updated by Anonymous 5 months ago

  • Status changed from ASSIGNED to MODIFIED

#14 Updated by ttereshc 4 months ago

  • Status changed from MODIFIED to CLOSED - CURRENTRELEASE
  • Sprint/Milestone set to 3.3.0

Please register to edit this issue

Also available in: Atom PDF