Issue #3172
closedCelery worker consumes large number of memory when regenerating applicability for a consumer that binds to many repositories with many errata.
Description
The celery worker is consuming about 60MB RAM initially. After running regenerating applicability for a consumer that binds to 9 repositories, it increased to about 350MB+ and the RAM will never be freed.
I think below are the reason of high memory consumption.
Pulp is fetching the pkglist from all the repositories that a particular Erratum is associated to. This is expensive and the results may contain a lot of duplicate pkglist.
For example, Pulp makes this query:
db.erratum_pkglists.find({"errata_id": "RHBA-2016:1886"}).count()
3
Instead of doing the following:
db.erratum_pkglists.find({"errata_id": "RHBA-2016:1886", "repo_id" : "my_org-Red_Hat_Enterprise_Linux_Server-Red_Hat_Satellite_Tools_6_2_for_RHEL_7_Server_RPMs_x86_64"}).count()
1
After amending the "erratum_pkglists" query to filter the errata by repository, the memory consumption and the speed are reduced by 80%
I think I understand why Pulp don't filter the pkglist by repository when regenerating applicability. It is due to the fact that one entry may not contain all the pkglist since an erratum can be copied accross repositories.
I made the following change to retrieve only the "nevra" of the errata pkglist when regenerating applicability for consumer. This patch can reduce the memory consumption by ~50% (350MB to 150MB) for a consumer with 9 repositories.
https://github.com/hao-yu/pulp_rpm/commit/9f5a52823afee80b31c1e3aa14f4f65fc85f9be9
Use aggregation to identify unique errata pkglists
To improve both performance and memory consumption of celery workers during applicability regeneration. Serializer for Errata now deals wit unique pkglists as well.
closes #3172 https://pulp.plan.io/issues/3172