As a user, I can specify the desired maximum amount of memory usage
Ticket moved to GitHub: "pulp/pulpcore/2069":https://github.com/pulp/pulpcore/issues/2069
It would be nice if users could specify a desired maximum amount of RAM to be used during sync. For example, a user can say I only want 1500 MB of RAM to be used max.
What is already in place¶
The stages pipeline restricts memory usage by only allowing 1000 declarative content objects between each stage (so for 8-9 stages that's 8000-9000 declarative content objects. This happens here.
Interestingly the docstring says this defaults to 100, but it seems to actually be 1000!
Also the stages perform batching, so they will only taking in a limited number of items (the batch size). That happens with minsize.
Why this isn't enough¶
These are count-based mechnisms and don't correspond to actual MB or GB of memory used. Some content units vary a lot in how much memory each DeclarativeContent objects take up.
Another lesser problem is that it doesn't help plugin writers restrict their usage of memory in FirstStage.
Add a new param called
max_mb to base Remote, which defaults to None. If specified, the user will be specifying the desired maximum MB used by process syncing.
Have the queues between the stages, and the bather implementation, both check the total memory the current process is using and asyncio.sleep() polling until it goes down. This should keep the maximum amount used by all objects roughly to that number.
Introduce a new
MBSizeQueue which is a wrapper around
asyncio.Queue used today. It will have the same
put() call, only wait if the amount of memory in use is greater than the remote is configured for.
Then introduce the same memory checking feature in the batcher. I'm not completely sure this second part is needed though.
We have to be very careful not to deadlock with this feature. For example, we have to account for the base case where even a single item is larger than the memory desired. Repos in pulp_rpm have had a single unit use more than 1.2G if I remember right, so if someone was syncing with 800 MB and we weren't careful to allow that unit to still flow through the pipeline we'd deadlock.....
Repos in pulp_rpm have had a single unit use more than 1.2G if I remember right, so if someone was syncing with 800 MB and we weren't careful to allow that unit to still flow through the pipeline we'd deadlock.....
This is true but:
- the metadata is very messed up - 13 million duplicate "files" are listed for that package.
- the postgresql maximum insert size is 1gb - so a single content unit exceeding that is a hard limitation regardless of anything else we do. Luckily I think that would be much much less frequent than an entire batch exceeding that limit, which I don't think we've ever seen happen either (but still a theoretical issue).