Project

Profile

Help

Story #5367

closed

As a user, I can have a sync option that retains the latest N packages of the same name

Added by paji@redhat.com over 4 years ago. Updated almost 4 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
-
Sprint/Milestone:
Start date:
Due date:
% Done:

100%

Estimated time:
Platform Release:
Groomed:
No
Sprint Candidate:
No
Tags:
Sprint:
Quarter:

Description

Goal

Provide a way to specify that for example only the latest N versions of every package with the name name are retained in the repository.

Sync time option retain-old-count

RPM sync will take a sync-time option named retain-old-count with an integer value. This will cause N versions of a package to be retained. Two packages with the same name are considered the same package.

When used the integer value needs to be validated as an Integer that is greater than 0 and raise a Validation error otherwise.

Determining latest

Latest is determine by comparison of the <epoch, version, and release> triplet.

Incompatibility with mirror=True

The mirror=True option requires the retained packages to be a mirror of the remote so mirror=True cannot to be used when the user is also specifying retain-old-count.

This needs to raise a Validation error if mirror=True and retain-old-count are specified together.

Pulp2 Equivalent

In Pulp2 it was called retain-old-count.

Implementation

The RpmDeclarativeVersion should implement a create method that will mimic the one from core The only difference in the RPM one is that instead of adding ContentUnassociation as the last stage a new custom RPM stage should be implemented. Let's call that stage ContentUnassociationRetainN (as a working name).

The ContentUnassociationRetainN stage works in place of the ContentUnassocaition in core and runs with similar assumptions. Specifically it receives queryset objects not DeclarativeContent objects like earlier stages in the pipeline. You can see those emitted from the stage before

The ContentUnassicationRetainN stage needs to further filter these unassociation querysets to filter out units that would be removed but shouldn't be due to their NEVRA. The content associated with the repository is already in place so between the queryset of items marked for removal and the content that is "known good" being outside of those querysets one should be able to compute the previous N versions somehow.

I think we can start with something inefficient but correct and improve it over time through profiling and the explain operator.

Katello Related Issue

https://projects.theforeman.org/issues/16154


Related issues

Blocked by RPM Support - Story #5402: As a developer, Package content allows ORDER_BY in postgres based on EVR comparisonsCLOSED - CURRENTRELEASEdalley

Actions
Actions #1

Updated by ipanova@redhat.com over 4 years ago

  • Description updated (diff)
Actions #2

Updated by bmbouter over 4 years ago

  • Subject changed from Pulp 3 Limit rpm packages to sync. to As a user, I can have a sync option that retains the latest N packages of the same name
  • Description updated (diff)

A rewrite to bring a first-cut design we can use to iterate on before implementing.

Actions #3

Updated by bmbouter over 4 years ago

  • Tracker changed from Issue to Story
  • % Done set to 0

converting to story.

Actions #4

Updated by ipanova@redhat.com over 4 years ago

Thank you for writing this down. General workflow is clear and straightforward, however the caveat that was brought up during the meeting consists exactly in the part of calculation of previous N versions. Any ideas on that?

Actions #5

Updated by bmbouter over 4 years ago

wrote:

Thank you for writing this down. General workflow is clear and straightforward, however the caveat that was brought up during the meeting consists exactly in the part of calculation of previous N versions. Any ideas on that?

This was not the caveat question that I had heard at the meeting. It is a good one though. Here's are some ideas. What do you think about these?

  1. slow option

Form a queryset that counts the number of packages per name. e.g. 'foo' v1, v1.2, v1.3 would yield a count of 3. Filter out any packages that are <= retain-old-count. These are the only package names that need additional consideration.

From there you could do the filtering in Python to determine which of the packages 'foo' needs removal. doing it in Python is probably too slow though.

  1. database speedups

To speed it up, create a postgreSQL trigger that pre-computes this version rank as an internal integer for each Package row as its inserted. Then Django filter to rank the results in a version-sorted order on this field. This would negate the need to Python-based filtering. Keep the first N (newest) and unassociate the remaining.

We have some triggers (different kinds but similar in concept) in pulp_ansible here for example: https://github.com/pulp/pulp_ansible/blob/master/pulp_ansible/app/migrations/0004_add_fulltext_search_indexes.py

Actions #6

Updated by ttereshc about 4 years ago

  • Priority changed from Normal to High
  • Sprint/Milestone set to Priority items (outside of planned milestones/releases)
Actions #7

Updated by ttereshc about 4 years ago

  • Blocked by Story #5950: Add database support for comparing EVRs added
Actions #8

Updated by ttereshc about 4 years ago

  • Blocked by deleted (Story #5950: Add database support for comparing EVRs)
Actions #9

Updated by ttereshc about 4 years ago

  • Blocked by Story #5402: As a developer, Package content allows ORDER_BY in postgres based on EVR comparisons added
Actions #10

Updated by ttereshc almost 4 years ago

  • Sprint/Milestone changed from Priority items (outside of planned milestones/releases) to Pulp 3.x RPM (Katello 4.1)
Actions #11

Updated by ggainey almost 4 years ago

Consider the implications for the things that depend-on NEVRAs (eg, errata, modulemd, etc). If the repo has only newest-rpms, what happens to the 'higher level' objects that know about the older ones only?

Actions #12

Updated by ggainey almost 4 years ago

What happens to RepositoryVersions/Publications/Distributions that contain not-to-be-retained RPMs?

Thought experiment: is it not the case that the 'real' usecase here is not 'last N RPMs', but rather 'last N RepositoryVersions'?

Actions #13

Updated by ttereshc almost 4 years ago

  • Priority changed from High to Normal
Actions #14

Updated by dalley almost 4 years ago

ggainey, I think nothing would "happen" to them, they would be dropped from the new repository version being created but not older repository versions or other separate repositories with the same packages. The only way to delete content is to run orphan cleanup, and content that is still present in a repository version somewhere isn't orphaned.

Re: your point on use cases, some clarification from Katello / product on what the ultimate UX goal of this feature is would be useful, just so that we keep the user experience at the top of our minds and not focus too hard on a particular idea or set of ideas of how to achieve it.

Anyway, my understanding (and correct me if I'm wrong) is that the overall intended "use case" is here is to provide a middle ground between "keep everything" (additive sync and copy always adding new content, never removing it) and "keep nothing" (newly synced content always replacing old content, mirror mode). The two are kind of symbiotic for that purpose.

Goal of retain-latest-N-packages (I imagine):

  • Keep the amount of content present in a given repository (version) low
    • The more content is present in a repository version, the longer it takes to publish and copy
  • Give client systems the ability to downgrade individual packages from the repo (compared to mirror syncing)
  • Help orphan old content faster so that it can be cleaned up
    • Content can take up a lot of space
    • Deleting old repository versions doesn't help if the latest version is just a superset of everything from the entire life of the repo - it'll never be orphaned

Goal of retain-latest-N-repo-versions (I imagine):

  • Might speed up certain queries by keeping the # of repository versions low
  • Help orphan old content faster so that it can be cleaned up
    • Content can take up a lot of space
    • Deleting old content within a version doesn't help if every previous repository version is being kept - it'll never be orphaned
Actions #15

Updated by dalley almost 4 years ago

Here's some thoughts I have on this:

I'm not sure why we would make it a sync-time option as in the description. A lot of users will only use sync on particular repos, and then manage their other repos by copying or uploading content manually. This feature wouldn't really benefit those users much.

If we implemented this as a repository setting + validation step, like we do with advisory merge and duplicate detection, then it would apply to the repo universally, which seems a lot more consistent to me. I guess I'm making the assumption that applying it universally is a good thing, maybe some customers could want manual upload to be an exceptional case and not delete old content, idk.

I can only see one time when this strategy would present issues, which is when an upstream repo (let's say a RHEL repo, I forget if they actually do this or not) never delete old content, so every version of every package is present in the repo, and a user wants to sync the repo (and especially immediate sync). ALL the content will still go through the entire pipeline and be saved (and maybe downloaded) just to be immediately thrown out of the repository version and orphaned after the retainment check.

But if we consider that a problem, then we can't solve that problem by using our database because, obviously, that can only help us with saved content and once it's saved it's too late. The only place we can realistically solve that problem is in the metadata processing layer.

Actions #16

Updated by ttereshc almost 4 years ago

My understanding that the request is to have retain-latest-N-packages in a repo.

+1 to repo setting + validation step. Special cases are not special enough ;) copy and upload ought to be considered as well.

One question. what to do with modular RPMs? They might be old and should be removed according to the suggested logic but they are a part of a module. So probably we should exclude modular rpms from that query or we need to introduce some logic around that.

Actions #17

Updated by ipanova@redhat.com almost 4 years ago

dalley wrote:

Here's some thoughts I have on this:

I'm not sure why we would make it a sync-time option as in the description. A lot of users will only use sync on particular repos, and then manage their other repos by copying or uploading content manually. This feature wouldn't really benefit those users much.

If we implemented this as a repository setting + validation step, like we do with advisory merge and duplicate detection, then it would apply to the repo universally, which seems a lot more consistent to me. I guess I'm making the assumption that applying it universally is a good thing, maybe some customers could want manual upload to be an exceptional case and not delete old content, idk.

+1 repository setting + validation step.

I can only see one time when this strategy would present issues, which is when an upstream repo (let's say a RHEL repo, I forget if they actually do this or not) never delete old content, so every version of every package is present in the repo, and a user wants to sync the repo (and especially immediate sync). ALL the content will still go through the entire pipeline and be saved (and maybe downloaded) just to be immediately thrown out of the repository version and orphaned after the retainment check.

With this approach we are basically crossing out one the benefit of having N package version ( 3rd bullet of Goal of retain-latest-N-packages) Every action whether sync or upload will be longer, because more bits would need to be processed/downloaded. More space will be occupied.

If we are fine with this ^ trade off then let's do it in the database.

But if we consider that a problem, then we can't solve that problem by using our database because, obviously, that can only help us with saved content and once it's saved it's too late. The only place we can realistically solve that problem is in the metadata processing layer.

I'm afraid solving this problem in the metadata layer will help only during sync. For upload and copy we'd need probably to still use the db

Actions #18

Updated by ipanova@redhat.com almost 4 years ago

ttereshc wrote:

My understanding that the request is to have retain-latest-N-packages in a repo.

+1 to repo setting + validation step. Special cases are not special enough ;) copy and upload ought to be considered as well.

One question. what to do with modular RPMs? They might be old and should be removed according to the suggested logic but they are a part of a module. So probably we should exclude modular rpms from that query or we need to introduce some logic around that.

+1 to exclude modular rpms.

Imagine our retain old count is 2.

ModuleA: foo-1.1, ModuleB:foo-1.2, and foo-1.3( normal rpm)

ModuleA and would loose his package.

Actions #19

Updated by dalley almost 4 years ago

I agree that modular RPMs need to be excluded. And based on discussion with the modularity folks, we shouldn't try to do automatic cleanup of older versions of modules.

Tanya, do you think that we need to implement both strategies? As in, implement the database check so that upload and copy are covered (also corner cases with sync), and also implement a sync optimization that only creates DeclarativeContent for the newest packages?

Actions #20

Updated by ipanova@redhat.com almost 4 years ago

I think as a start we could go with the database check and in a separate story possibly address the sync optimization

Actions #21

Updated by ttereshc almost 4 years ago

I'm fine with all optimizations or early checks to be in a separate story. I'm not sure how much they are needed, to be honest.

Added by dalley almost 4 years ago

Revision 4d0247c8 | View on GitHub

Add a retention policy feature for purging older packages

closes: #5367 https://pulp.plan.io/issues/5367

Actions #22

Updated by dalley almost 4 years ago

  • Status changed from NEW to MODIFIED
  • % Done changed from 0 to 100
Actions #23

Updated by ttereshc almost 4 years ago

  • Status changed from MODIFIED to CLOSED - CURRENTRELEASE
Actions #24

Updated by ttereshc almost 4 years ago

  • Sprint/Milestone changed from Pulp 3.x RPM (Katello 4.1) to Pulp RPM 3.5.0

Also available in: Atom PDF