Project

Profile

Help

Issue #4151

closed

Improve MetadataStep performance

Added by quba42 over 5 years ago. Updated over 3 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
3. High
Version - Debian:
Platform Release:
Target Release - Debian:
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Quarter:

Description

There appears to be a memory leak relating to the way pulp_deb interacts with python-debpkgr during the MetadataStep.
In particular, the current implementation of the MetadataStep, loads the metadata associated with a repository from the db into memory, and is then processed to gain a list of all packages associated with each architecture for each component for each release in the repository. Particularly packages with architecture = "all" will be appended to every other architecture's list (as well as a list of their own), meaning these packages will appear in multiple lists (and as a result be parsed multiple times).

Each list is then processed by a call to python-debpkgr, which proceeds to access every package file in the list, mostly to regenerate the metadata we already loaded into memory from our db. For repositories containing thousands or tens of thousands of packages (often within a single debpkgr call) this is a severe resource drain (particularly memory).

For large Debian repositories (like Ubuntu Xenial or Debian Stretch) this routinely leads to failures because the Kernel will kill the celery worker. (This has been observed on systems with 32GiB and more and typically happens after several hours of waiting for a sync.)

In practice this makes pulp_deb unusable (or at least painfully slow) for large repositories on all but the most powerful systems.

See https://community.theforeman.org/t/pulp-deb-with-celery-cannot-allocate-memory-and-out-of-memory/11789 for an example.

Also available in: Atom PDF