Project

Profile

Help

Story #1684

closed

Retain Old Repodata on Re-publish

Added by bmbouter almost 9 years ago. Updated over 5 years ago.

Status:
CLOSED - WONTFIX
Priority:
Normal
Assignee:
-
Sprint/Milestone:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Platform Release:
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Quarter:

Description

When publishing, Pulp doesn't keep old repodata files, when metadata is changed, Pulp generates additive or completely new content of “repodata” directory. In both cases with checksums enabled, any change to the metadata contents results in a completely new filename since the checksum is the first portion of the filename.

Yum as a client caches the repomd.xml for some period of time, which names these files in the repodata directory which have checksums on them. Upon republish, older publish metadata should be left in tact so they can be found by a cached repomd.xml file. Old repodata files are kept up to 6 hours by default on Yum, so Pulp should clean up old metadata files older than 6 hours. Yum has this configuration in yum.conf as the metadata_expire which is the time (in seconds) after which the metadata will expire.

This will be a new option on the Yum Distributor named metadata_retention_hours specifying how long metadata should be retained in hours. This value may also be set to 0 which causes old repo metadata to be removed immediately on publish. This feature should default to 6 hours to align with the yum default.

Note that old metadata cannot be modified because its checksum would likely change which defeats the point of this feature. It should be left alone. Because of this, no attempt shall be made to update old metadata if packages are not found. In that case the yum client will receive a 404 which it can handle.


Related issues

Related to Pulp - Issue #5573: Publish won't create multiple checkecksummed copies of primary.xml, fileliststs.xml etc even when in fast-forward modeCLOSED - CURRENTRELEASEActions
Actions #1

Updated by bmbouter almost 9 years ago

  • Description updated (diff)
Actions #2

Updated by mhrivnak over 8 years ago

  • Sprint/Milestone set to 19
Actions #3

Updated by ipanova@redhat.com over 8 years ago

I was getting some more info on how yum works in this case and i found out that:
- the old repodata files are stored not for 14 days but for 6 hours only( source man yum.conf, option 'metadata_expire')
- I do understand that in case there was some change in metadata and re-publish occurred, repomd.xml will be different, but yum will still try to fetch the old files. In this case yum will complain, you can run 'yum clean expire-cache' and next time yum will fetch fresh metadata.

I don't understand why we cannot leave this kind of situations managed directly by yum? Because all our repos are regular yum repos.

More information on this will be appreciated, maybe i am missing some points here.

Actions #5

Updated by mhrivnak over 8 years ago

I think we need a clear reproducer before starting work. Would it look something like this?

  • publish a yum repo with pulp
  • yum update on a client aimed at that repo
  • add a newer rpm to the repo in pulp and re-publish
  • yum update on the client again, and yum breaks? how exactly?

Reproducing this and showing output from yum would be ideal.

Actions #7

Updated by ipanova@redhat.com over 8 years ago

  • Private changed from No to Yes
Actions #8

Updated by ipanova@redhat.com over 8 years ago

  • Private changed from Yes to No
Actions #10

Updated by ipanova@redhat.com over 8 years ago

  • Sprint/Milestone deleted (19)

On common agreement with RCM we decided to put this issue back into backlog for now.

Actions #11

Updated by bmbouter over 8 years ago

FYI I noticed today that createrepo provides an option called --retain-old-md which retains the number of previous metadata outputs by count. This in a way would provide Pulp with a similar feature capability.

Actions #12

Updated by rmcgover almost 8 years ago

My createrepo_c also has --retain-old-md-by-age=AGE , (but createrepo doesn't.)

Actions #13

Updated by bmbouter over 7 years ago

  • Subject changed from Retain old repodata on re-publish to Retain Old Repodata on Re-publish
  • Description updated (diff)

rewritten and added check list items

Actions #16

Updated by bmbouter over 7 years ago

I would like a recap of Pulp's behavior regarding repodata retention when using fast-forward, regular, and force-full publishes. I don't currently know what the behaviors are in these areas. It seems that what is written in this ticket is the preferred behavior for all of those publish types so maybe what it does currently doesn't matter.

Actions #17

Updated by mhrivnak over 7 years ago

All publishes by upstream pulp_rpm discard old repo metadata files. There is currently no retention, regardless of whether a publish was "full" or "incremental".

Actions #18

Updated by mhrivnak over 7 years ago

rmcgover, since the behavior currently in place for you retains all old metadata files, I wonder if we could simply preserve that behavior when you upgrade to a recent upstream Pulp release.

I believe the goal is: make sure repo metadata files that might be requested by a client don't disappear.

When using rsync to transfer a newly published repo to a remote destination, could you simply have that rsync operation preserve existing metadata files on the other end? That is the default behavior of rsync. This would allow old metadata files to accumulate on the remote side only. Each Pulp publication would still only have the most recently-published metadata files. That should fit the use case, and it can be done with current upstream Pulp.

Thoughts on that?

Actions #19

Updated by rmcgover over 7 years ago

In the typical case, the requirement is met by the behavior you describe, as our publishes by default won't be deleting old files from the remote. However, there are cases where old files will have to be deleted.

Here's a scenario:

  1. User runs "yum update <package>", it downloads repomd.xml and other metadata but not filelists.
  2. Release engineers remove an RPM from the repo. They need it to no longer be downloadable, so they trigger the yum distributor followed by rsync distributor with "delete" option. This deletes the repodata referenced by the repomd.xml fetched by the user at (1).
  3. User runs "yum provides ..." which tries and fails to download filelists.

It seems like relying on the rsync layer will give something that works most of the time, by accident; and I think this issue is hoping for a more robust solution which works by design.

Actions #21

Updated by mhrivnak over 7 years ago

rmcgover, great, thanks for the feedback.

I'm taking from this discussion that you would like to prioritize this feature, but it is not a blocker for you. Is that correct?

As for the 3 steps you describe, can you explain how that should work in the ideal case, once we've implemented this RFE? It seems like if you want to normally keep (some) old metadata, but occasionally delete it, any ill effects of the deletion will happen regardless of whether this RFE is in place. So I'm not quite following what you have in mind.

Actions #22

Updated by rmcgover over 7 years ago

Yes, I don't think we have to consider this a blocker.

If I run through the same scenario again, where pulp_rpm had implemented the behavior currently described in this issue: (behavior change in bold)

  1. User runs "yum update <package>", it downloads repomd.xml and other metadata but not filelists.
  2. Release engineers remove an RPM from the repo. They need it to no longer be downloadable, so they trigger the yum distributor followed by rsync distributor with "delete" option. Since the repodata downloaded by user at (1) isn't yet 6 hours old, it still exists in Pulp's published yum repo, so rsync doesn't delete it.
  3. User runs "yum provides ..." which is able to successfully download filelists.
Actions #23

Updated by bmbouter over 7 years ago

  • Tags RCM added
Actions #24

Updated by rmcgover over 7 years ago

mhrivnak wrote:

All publishes by upstream pulp_rpm discard old repo metadata files. There is currently no retention, regardless of whether a publish was "full" or "incremental".

I just did some testing on 2.13 and that seems to be the case for XML files but not sqlite files. If I publish with generate_sqlite=true, and don't include force_full=true, then the destination path will retain all sqlite files.

Unfortunately the current behavior seems like the worst of both worlds as our consumers suffer from attempts to download XML files which were removed, while we suffer wasted disk space and publish time from the unlimited retention of sqlite files.

Tested on pulp 63aa4d9ce5de374d01d8bfdcda3f84594de4e5af, pulp_rpm d5bded6b2db0640ab5d262368335940c971ba3c7.

Actions #25

Updated by mhrivnak over 7 years ago

Comment https://pulp.plan.io/issues/1684#note-24 seems like a separate bug worth filing and fixing.

Actions #26

Updated by rmcgover over 7 years ago

OK, I've filed separate issue https://pulp.plan.io/issues/2788 for the sqlite behavior. The expected behavior references back to this issue.

Actions #27

Updated by bmbouter over 5 years ago

  • Status changed from NEW to CLOSED - WONTFIX

Pulp 2 is approaching maintenance mode, and this Pulp 2 ticket is not being actively worked on. As such, it is being closed as WONTFIX. Pulp 2 is still accepting contributions though, so if you want to contribute a fix for this ticket, please reopen or comment on it. If you don't have permissions to reopen this ticket, or you want to discuss an issue, please reach out via the developer mailing list.

Actions #28

Updated by bmbouter over 5 years ago

  • Tags Pulp 2 added
Actions #29

Updated by bmbouter over 5 years ago

  • Tags deleted (RCM)
Actions #30

Updated by mihai.ibanescu@gmail.com about 5 years ago

  • Related to Issue #5573: Publish won't create multiple checkecksummed copies of primary.xml, fileliststs.xml etc even when in fast-forward mode added

Also available in: Atom PDF