Refactor #2316: Batch save() calls to ProgressReport - Pulp

Actions

Send by e-mail Copy link

Refactor #2316

closed

Batch save() calls to ProgressReport

Added by bmbouter over 8 years ago. Updated about 5 years ago.

Status:

CLOSED - CURRENTRELEASE

Priority:

Normal

Assignee:

bizhang

Category:

Sprint/Milestone:

3.0.0

Start date:

Due date:

% Done:

100%

Estimated time:

Platform Release:

Groomed:

Yes

Sprint Candidate:

Yes

Tags:

Sprint:

Sprint 18

Quarter:

Description

This is a performance improvement story. As a Plugin developer when using an ancestor of pulp.app.models.ProgressReport every call to save() causes a round-trip write to the database.

For example as the documentation says:

>>> with ProgressSpinner('Publishing Metadata'):
>>>     publish_metadata()

After discussion on the ticket the following functionality is needed:

Have a call to increment() or save() skipped if it has been saved within the last 0.5 seconds. This implementation needs to also be for increment() in addition to save() because increment() will likely not use save() anymore after Issue #2318 is implemented. I believe update() does not call save().
Only perform this behavior when we know the context manager is being used. This should be set by enter() as an instance attribute on the in-memory ProgressReport object.
Update the class docstrings to identify this behavior for plugin writers

Actions

Copy link

Updated by mhrivnak over 8 years ago

Sprint Candidate changed from No to Yes
Tags Pulp 3 added

The biggest problem will be with incrementing progress. We have lots of tasks already that do at least dozens, if not hundreds or more operations per second, and we definitely do not want to save the progress object that frequently. That would certainly impact performance, so I think we should prioritize this as part of the pulp 3 MVP. I see it less as an optimization and more of an effort to avoid creating a problem. Especially since it only takes a few lines of simple code , let's just do it.

As an example, even a yum repo sync can download 50+ rpms per second on a decent connection. A publish can go through several hundred rpms per second.

Here is a simple implementation from pulp 2:

https://github.com/pulp/nectar/blob/4fe7327c4ad1bb2f8c040b75d80306ee047de615/nectar/downloaders/threaded.py#L301-L303

Similar logic exists in the pulp 2 step system.

If we just put similar logic to that in the increment method, and ensure exit does one last update, that should take care of it.

Actions

Copy link

Updated by bmbouter over 8 years ago

Sprint Candidate changed from Yes to No
Tags deleted (~~Pulp 3~~)

A correct design is not that simple. We aren't guaranteed that users will use the increment() or context manager. If we skipping calls to save() and the plugin writer is using the model directly how can we be guaranteed that we aren't missing important/necessary calls to save().

I went to #django to ask about this problem because surely others have had this issue. The response 2 users gave to me was to not worry about it. They also cautioned me about ignoring save() calls. We don't have a design that provides correctness and we don't know that we need it yet. That is why I'm proposing we do not do this work for the Pulp3 MVP.

Actions

Copy link

Updated by mhrivnak over 8 years ago

I'll take a closer look and make a more specific proposal. I don't think we should allow an individual task to do hundreds of DB writes per second when one or two would suffice. It's hard to imagine that not impacting performance, especially with multiple tasks running concurrently.

Actions

Copy link

Updated by bmbouter over 8 years ago

Sounds good. Skipping the writes is easy enough, I think the crux of the issue is ensuring that a save() will occur later if some of the most recent saves (or just the very last one) was skipped. I don't know how to do that without a separate thread, or epoll callback loop, etc.

Maybe where we are going wrong is that we are trying to hide this from the plugin writer. Note they could just save less often and they can ensure correctness.

What about this as an idea, we could make it do this fancy behavior only if it is being used inside the context manager.

Actions

Copy link

Updated by mhrivnak over 8 years ago

To be clear, I do not want to hide this behavior at all. I do want to choose default behaviors that make the most common use cases easy, but we should be very clear about this behavior and possibly allow the plugin writer to influence it.

I think we can wrap this save call in a little bit of logic that only saves at most once per some interval of time:

https://github.com/pulp/pulp/blob/f44599980274ec4cf69594f84882ca90273041cf/app/pulp/app/models/progress.py#L208

As you point out, we could only do the batching in the case where the object is being used as a context manager. That's a reasonable approach that I would be happy with.

To allow more control, we could optionally have a boolean property of the object, "batch_increments" or something like that, which defaults to False. A user could set it to True if they want, and it would also be set to True in "__enter__". This is probably overkill for a first cut, but it's an option for the future in case people want to do incrementing without using the object as a context manager.

I also think it would be fine to just do the batching in the increment() method, make sure that if "done == total" a save always happens, and clearly document that the user is responsible for doing a final save on the object when they are done incrementing it.

Any of these approaches is fine with me, but I do think we need to solve this for pulp 3. Can we add the pulp 3 and sprint candidate options back to this refactor?

Actions

Copy link

Updated by bmbouter over 8 years ago

Description updated (diff)
Sprint Candidate changed from No to Yes
Tags Pulp 3 added

I think having the "skip if saved recently" behavior when running within the context manager is a good compromise. I added the sprint candidate and Pulp 3 flags again. I updated the story too.

Actions

Copy link

Updated by mhrivnak over 8 years ago

Groomed changed from No to Yes

Actions

Copy link

Updated by mhrivnak about 8 years ago

Sprint/Milestone set to 29

Actions

Copy link

Updated by bizhang about 8 years ago

Status changed from NEW to ASSIGNED
Assignee set to bizhang

Actions

Copy link

#10

Updated by mhrivnak about 8 years ago

Sprint/Milestone changed from 29 to 30

Actions

Copy link

#11

Updated by mhrivnak almost 8 years ago

Sprint/Milestone changed from 30 to 36

Actions

Copy link

#12

Updated by bizhang almost 8 years ago

Status changed from ASSIGNED to POST

Actions

Copy link

#13

Updated by mhrivnak almost 8 years ago

Sprint/Milestone changed from 36 to 37

Actions

Copy link

#14

Updated by bizhang almost 8 years ago

PR: https://github.com/pulp/pulp/pull/2986

I removed `increment() skip-save logic` from the checklist, since increment() still calls save() because https://pulp.plan.io/issues/2318 was resolved via documentation.