Improve MetadataStep performance
There appears to be a memory leak relating to the way pulp_deb interacts with python-debpkgr during the MetadataStep.
In particular, the current implementation of the MetadataStep, loads the metadata associated with a repository from the db into memory, and is then processed to gain a list of all packages associated with each architecture for each component for each release in the repository. Particularly packages with architecture = "all" will be appended to every other architecture's list (as well as a list of their own), meaning these packages will appear in multiple lists (and as a result be parsed multiple times).
Each list is then processed by a call to python-debpkgr, which proceeds to access every package file in the list, mostly to regenerate the metadata we already loaded into memory from our db. For repositories containing thousands or tens of thousands of packages (often within a single debpkgr call) this is a severe resource drain (particularly memory).
For large Debian repositories (like Ubuntu Xenial or Debian Stretch) this routinely leads to failures because the Kernel will kill the celery worker. (This has been observed on systems with 32GiB and more and typically happens after several hours of waiting for a sync.)
In practice this makes pulp_deb unusable (or at least painfully slow) for large repositories on all but the most powerful systems.
Updated by quba42 over 3 years ago
The following pull request attempts to fix this problem by removing the python-debpkgr dependency from the MetadataStep.
Instead some of the functionality is handled directly in pulp, while some of it uses deb822 from python-debian (also used by python-debpkgr).