Project

Profile

Help

Story #9635

closed

As a user, I can specify the desired maximum amount of memory usage

Added by bmbouter over 2 years ago. Updated about 2 years ago.

Status:
CLOSED - DUPLICATE
Priority:
Normal
Assignee:
-
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Platform Release:
Groomed:
No
Sprint Candidate:
No
Tags:
Sprint:
Quarter:

Description

Ticket moved to GitHub: "pulp/pulpcore/2069":https://github.com/pulp/pulpcore/issues/2069


Motivation

It would be nice if users could specify a desired maximum amount of RAM to be used during sync. For example, a user can say I only want 1500 MB of RAM to be used max.

What is already in place

The stages pipeline restricts memory usage by only allowing 1000 declarative content objects between each stage (so for 8-9 stages that's 8000-9000 declarative content objects. This happens here.

Interestingly the docstring says this defaults to 100, but it seems to actually be 1000!

Also the stages perform batching, so they will only taking in a limited number of items (the batch size). That happens with minsize.

Why this isn't enough

These are count-based mechnisms and don't correspond to actual MB or GB of memory used. Some content units vary a lot in how much memory each DeclarativeContent objects take up.

Another lesser problem is that it doesn't help plugin writers restrict their usage of memory in FirstStage.

Idea

Add a new param called max_mb to base Remote, which defaults to None. If specified, the user will be specifying the desired maximum MB used by process syncing.

Have the queues between the stages, and the bather implementation, both check the total memory the current process is using and asyncio.sleep() polling until it goes down. This should keep the maximum amount used by all objects roughly to that number.

Details

Introduce a new MBSizeQueue which is a wrapper around asyncio.Queue used today. It will have the same put() call, only wait if the amount of memory in use is greater than the remote is configured for.

Then introduce the same memory checking feature in the batcher. I'm not completely sure this second part is needed though.

We have to be very careful not to deadlock with this feature. For example, we have to account for the base case where even a single item is larger than the memory desired. Repos in pulp_rpm have had a single unit use more than 1.2G if I remember right, so if someone was syncing with 800 MB and we weren't careful to allow that unit to still flow through the pipeline we'd deadlock.....

Actions #1

Updated by bmbouter over 2 years ago

  • Description updated (diff)
Actions #2

Updated by dalley over 2 years ago

Repos in pulp_rpm have had a single unit use more than 1.2G if I remember right, so if someone was syncing with 800 MB and we weren't careful to allow that unit to still flow through the pipeline we'd deadlock.....

This is true but:

  1. the metadata is very messed up - 13 million duplicate "files" are listed for that package.
  2. the postgresql maximum insert size is 1gb - so a single content unit exceeding that is a hard limitation regardless of anything else we do. Luckily I think that would be much much less frequent than an entire batch exceeding that limit, which I don't think we've ever seen happen either (but still a theoretical issue).
Actions #3

Updated by fao89 about 2 years ago

  • Description updated (diff)
  • Status changed from NEW to CLOSED - DUPLICATE

Also available in: Atom PDF