Project

Profile

Help

Refactor #2316

Batch save() calls to ProgressReport

Added by bmbouter about 3 years ago. Updated 6 months ago.

Status:
MODIFIED
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
Start date:
Due date:
% Done:

100%

Platform Release:
Blocks Release:
Groomed:
Yes
Sprint Candidate:
Yes
Tags:
Complexity:
Verified:
No
Verification Required:
No
Sprint:
Sprint 18

Description

This is a performance improvement story. As a Plugin developer when using an ancestor of pulp.app.models.ProgressReport every call to save() causes a round-trip write to the database.

For example as the documentation says:

>>> with ProgressSpinner('Publishing Metadata'):
>>>     publish_metadata()

After discussion on the ticket the following functionality is needed:

  • Have a call to increment() or save() skipped if it has been saved within the last 0.5 seconds. This implementation needs to also be for increment() in addition to save() because increment() will likely not use save() anymore after Issue #2318 is implemented. I believe update() does not call save().
  • Only perform this behavior when we know the context manager is being used. This should be set by enter() as an instance attribute on the in-memory ProgressReport object.
  • Update the class docstrings to identify this behavior for plugin writers

Checklist

Associated revisions

Revision 54b0a6a1 View on GitHub
Added by werwty over 2 years ago

Batch save() calls to ProgressReport

closes #2316
https://pulp.plan.io/issues/2316

Revision 54b0a6a1 View on GitHub
Added by werwty over 2 years ago

Batch save() calls to ProgressReport

closes #2316
https://pulp.plan.io/issues/2316

Revision 54b0a6a1 View on GitHub
Added by werwty over 2 years ago

Batch save() calls to ProgressReport

closes #2316
https://pulp.plan.io/issues/2316

History

#1 Updated by mhrivnak about 3 years ago

  • Sprint Candidate changed from No to Yes
  • Tags Pulp 3 added

The biggest problem will be with incrementing progress. We have lots of tasks already that do at least dozens, if not hundreds or more operations per second, and we definitely do not want to save the progress object that frequently. That would certainly impact performance, so I think we should prioritize this as part of the pulp 3 MVP. I see it less as an optimization and more of an effort to avoid creating a problem. Especially since it only takes a few lines of simple code , let's just do it.

As an example, even a yum repo sync can download 50+ rpms per second on a decent connection. A publish can go through several hundred rpms per second.

Here is a simple implementation from pulp 2:

https://github.com/pulp/nectar/blob/4fe7327c4ad1bb2f8c040b75d80306ee047de615/nectar/downloaders/threaded.py#L301-L303

Similar logic exists in the pulp 2 step system.

If we just put similar logic to that in the increment method, and ensure exit does one last update, that should take care of it.

#2 Updated by bmbouter about 3 years ago

  • Sprint Candidate changed from Yes to No
  • Tags deleted (Pulp 3)

A correct design is not that simple. We aren't guaranteed that users will use the increment() or context manager. If we skipping calls to save() and the plugin writer is using the model directly how can we be guaranteed that we aren't missing important/necessary calls to save().

I went to #django to ask about this problem because surely others have had this issue. The response 2 users gave to me was to not worry about it. They also cautioned me about ignoring save() calls. We don't have a design that provides correctness and we don't know that we need it yet. That is why I'm proposing we do not do this work for the Pulp3 MVP.

#3 Updated by mhrivnak about 3 years ago

I'll take a closer look and make a more specific proposal. I don't think we should allow an individual task to do hundreds of DB writes per second when one or two would suffice. It's hard to imagine that not impacting performance, especially with multiple tasks running concurrently.

#4 Updated by bmbouter about 3 years ago

Sounds good. Skipping the writes is easy enough, I think the crux of the issue is ensuring that a save() will occur later if some of the most recent saves (or just the very last one) was skipped. I don't know how to do that without a separate thread, or epoll callback loop, etc.

Maybe where we are going wrong is that we are trying to hide this from the plugin writer. Note they could just save less often and they can ensure correctness.

What about this as an idea, we could make it do this fancy behavior only if it is being used inside the context manager.

#5 Updated by mhrivnak about 3 years ago

To be clear, I do not want to hide this behavior at all. I do want to choose default behaviors that make the most common use cases easy, but we should be very clear about this behavior and possibly allow the plugin writer to influence it.

I think we can wrap this save call in a little bit of logic that only saves at most once per some interval of time:

https://github.com/pulp/pulp/blob/f44599980274ec4cf69594f84882ca90273041cf/app/pulp/app/models/progress.py#L208

As you point out, we could only do the batching in the case where the object is being used as a context manager. That's a reasonable approach that I would be happy with.

To allow more control, we could optionally have a boolean property of the object, "batch_increments" or something like that, which defaults to False. A user could set it to True if they want, and it would also be set to True in "__enter__". This is probably overkill for a first cut, but it's an option for the future in case people want to do incrementing without using the object as a context manager.

I also think it would be fine to just do the batching in the increment() method, make sure that if "done == total" a save always happens, and clearly document that the user is responsible for doing a final save on the object when they are done incrementing it.

Any of these approaches is fine with me, but I do think we need to solve this for pulp 3. Can we add the pulp 3 and sprint candidate options back to this refactor?

#6 Updated by bmbouter about 3 years ago

  • Checklist item increment() skip-save logic added
  • Checklist item save() skip-save logic added
  • Checklist item docstring updates added
  • Description updated (diff)
  • Sprint Candidate changed from No to Yes
  • Tags Pulp 3 added

I think having the "skip if saved recently" behavior when running within the context manager is a good compromise. I added the sprint candidate and Pulp 3 flags again. I updated the story too.

#7 Updated by mhrivnak almost 3 years ago

  • Groomed changed from No to Yes

#8 Updated by mhrivnak almost 3 years ago

  • Sprint/Milestone set to 29

#9 Updated by bizhang almost 3 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to bizhang

#10 Updated by mhrivnak almost 3 years ago

  • Sprint/Milestone changed from 29 to 30

#11 Updated by mhrivnak over 2 years ago

  • Sprint/Milestone changed from 30 to 36

#12 Updated by bizhang over 2 years ago

  • Status changed from ASSIGNED to POST

#13 Updated by mhrivnak over 2 years ago

  • Sprint/Milestone changed from 36 to 37

#14 Updated by bizhang over 2 years ago

PR: https://github.com/pulp/pulp/pull/2986

I removed `increment() skip-save logic` from the checklist, since increment() still calls save() because https://pulp.plan.io/issues/2318 was resolved via documentation.

#15 Updated by werwty over 2 years ago

  • Status changed from POST to MODIFIED
  • % Done changed from 0 to 100

#16 Updated by bmbouter over 1 year ago

  • Sprint set to Sprint 18

#17 Updated by bmbouter over 1 year ago

  • Sprint/Milestone deleted (37)

#18 Updated by daviddavis 6 months ago

  • Sprint/Milestone set to 3.0

#19 Updated by bmbouter 6 months ago

  • Tags deleted (Pulp 3)

Please register to edit this issue

Also available in: Atom PDF