Project

Profile

Help

Issue #1618

--checksum-type is broken

Added by jluza over 5 years ago. Updated over 2 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
High
Assignee:
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
2.9.0
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Sprint 3
Quarter:

Description

Because rhel-5 doesn't support sha256 checksum type, we need to pulp to be able to generate repodata with different checksum type than default one.

This is not a trivial fix and involves several significant changes. For detailed discussion of these problems, see the notes below, but the following outlines tasks that probably need to be done for this issue:

Modifying the data models

There is currently a story (#1647) that tracks properly modeling content. This involves having a table that has a record for each and every file managed by Pulp. That's probably not something we want to bite off for this issue. It is probably best to just change the parent class of RPM, DRPM, etc. to contain several checksum fields, but maybe adding it to the platform somewhere would be easier.

In addition to the checksums, the current implementation contains XML snippets with the checksum and checksum type. These need to be turned to templates that are filled in with the appropriate checksum. This rendering should probably live as a method on the model(s).

Creating migrations

Both the new checksum fields (wherever they are) and the XML snippets need to be migrated/populated with the existing checksum types.

Create Task to Checksum Files

We need a way to generate these new checksums. As part of #1647, we'll want a task to "scrub" Pulp for corrupted files, and we might be able to lay the groundwork here. Perhaps not, but it's worth thinking about during implementation. In this particular instance the task wouldn't be dispatched (just called synchronously inside the publish task), but we'd be able to share code.

Ensure Publish Handles Edge Cases

With lazy syncing in the mix, we might not have the files available to checksum. We need to make sure we fail gracefully in cases where a file isn't available.


Related issues

Related to Pulp - Story #1647: Unify checksum management to the platform and add some featuresCLOSED - WONTFIX

<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>
Related to RPM Support - Story #1878: Support for choosing the checksum type in updateinfoCLOSED - CURRENTRELEASE

<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>
Has duplicate RPM Support - Issue #627: --checksum-type does not affect the checksum used in primary.xmlCLOSED - DUPLICATE<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>
Blocks RPM Support - Issue #1619: as user, I can export repo groups with different checksum than sha256CLOSED - CURRENTRELEASE<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>

Associated revisions

Revision 7094ec37 View on GitHub
Added by mhrivnak over 5 years ago

Adds a function to calculate checksums of multiple types at once.

Calculation of checksums was moved out of the "verification" module, because it is useful in many more cases than just in the process of verification. The only plugin using the moved code is pulp_rpm, and corresponding changes to that plugin will be in a separate PR.

re #1618 https://pulp.plan.io/issues/1618

Revision 7094ec37 View on GitHub
Added by mhrivnak over 5 years ago

Adds a function to calculate checksums of multiple types at once.

Calculation of checksums was moved out of the "verification" module, because it is useful in many more cases than just in the process of verification. The only plugin using the moved code is pulp_rpm, and corresponding changes to that plugin will be in a separate PR.

re #1618 https://pulp.plan.io/issues/1618

Revision d1491184 View on GitHub
Added by mhrivnak over 5 years ago

yum_distributor now uses configured checksum type for all metadata.

A lot of model-related code was moved into models.py from other places. In addition to being more object-oriented, it made that code accessible from multiple places instead of being isolated somewhere, such as in the upload code. Being able to use the code from multiple places was the primary reason for moving the code in this PR.

re #1618 https://pulp.plan.io/issues/1618

Revision ccf4c941 View on GitHub
Added by mhrivnak over 5 years ago

adding a new error code for missing unit file

re #1618

Revision ccf4c941 View on GitHub
Added by mhrivnak over 5 years ago

adding a new error code for missing unit file

re #1618

History

#1 Updated by jortel@redhat.com over 5 years ago

  • Priority changed from Normal to High
  • Platform Release set to 2.8.0
  • Triaged changed from No to Yes

Please verify that this is not fixed in latest 2.7.

#2 Updated by jcline@redhat.com over 5 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to jcline@redhat.com

#3 Updated by jcline@redhat.com over 5 years ago

In both 2.7 and 2.8 the repomd.xml has the configured hash algorithm, but the primary.xml always has sha256.

#4 Updated by jcline@redhat.com over 5 years ago

Okay, so here is what I've discovered thus far:

  1. There is an RCM patch for this issue: https://github.com/release-engineering/pulp_rpm/commit/5ded67b4954395cb040af2f03065dc10a6ad0188
  2. The ``--checksum-type`` configuration flag for repositories only applies to the metadata files themselves. That is to say, repodata/repomd.xml uses that checksum type and all the repo metadata files are named <checksum-type>-<metadata-type>.xml.gz. The package checksums in those metadata files does not change based on the checksum type specified for a repository.
  3. When uploading a content unit to a repository, you can provide a checksum type to use when generating the package metadata. If one is not provided, the default appears to always be sha256 (so it doesn't honor the repository setting, which is somewhat surprising to me).

The workflow thata led to this issue and patch is as follows:

  1. Upload all RPMs to one repository. When uploading, do not specify a checksum type.
  2. Copy RPMs into the desired repository and have the checksum type configured on the repository.
  3. Publish the repository.
  4. The metadata for that repository uses the checksum type configured, and only that checksum type.

The expectation (which is very reasonable) is that at the repo metadata uses the checksum type throughout, not just for the metadata files themselves. I'm not certain how the ``--checksum-type`` repo flag should behave, but my understanding based on the docs (or rather, the single line of text) is that the checksum type specified should be used across the board. This, however, is almost certainly not possible with deferred downloading.

The options as I see it are:

  • Ensure the ``--checksum-type`` is honored for all metadata. This means we need to have every file at publish time, which in turn means if the checksum type specified doesn't match upstream's checksum type, you can't use deferred downloading with that repository. Of course, we won't know that until we download the metadata during the first sync. This is probably an edge case situation, though.
  • Keep things the way they are and see if we can work with RCM to find a different workflow that meets their needs.

#5 Updated by jcline@redhat.com over 5 years ago

  • Tracker changed from Issue to Story
  • Status changed from ASSIGNED to NEW
  • Assignee deleted (jcline@redhat.com)
  • Platform Release deleted (2.8.0)
  • Groomed set to No
  • Sprint Candidate set to No

#6 Updated by jcline@redhat.com over 5 years ago

  • Parent task set to #1647

#7 Updated by bmbouter over 5 years ago

  • Parent task deleted (#1647)

#8 Updated by bmbouter over 5 years ago

  • Parent task set to #1683

#9 Updated by bmbouter over 5 years ago

  • Related to Story #1647: Unify checksum management to the platform and add some features added

#10 Updated by mhrivnak over 5 years ago

  • Sprint/Milestone set to 19

#11 Updated by mhrivnak over 5 years ago

I'll improve this story.

#12 Updated by rbarlow over 5 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to rbarlow
  • Platform Release set to 2.9.0

#13 Updated by rbarlow over 5 years ago

I've discussed this with jcline and we came up with the following plan to work around the issues with publishing lazy repositories:

  • We will not allow users to set --checksum-type on lazy repositories.
  • We will not allow users to copy files into a lazy repository if Pulp does not have the file at the moment of copy.

In both of the above scenarios, we will need to give the user a helpful error message if they attempt to perform these operations. By ensuring that we have the files on disk before publish time, we will be able to generate alternative checksums if requested.

A problem with this approach is that the publish currently uses pregenerated XML snippets that are stored in MongoDB that contain the checksums. There are a couple of options I can think of around this:

  • We can use the snippets when the user has not requested a different checksum. The plus side to this approach is that it's a less disruptive change to Pulp. The downside is that publishes will go slowly if the user requests a different checksum type than the unit's XML snippet uses.
  • We can alter the XML snippets to be a template that allows us to inject the checksum at publish time. The plus side here is that we should be able to operate at the same speed for all checksum types. The downsides would be that all publishes may go a little slower due to having to render templates, and that we will have to write a migration to update all the snippets to be a template.

Another element to consider is when the unit checksums should be calculated. We probably don't want to calculate them for every publish, so we'll want to add a new field on the units to store alternative checksums. I think there are only really two times that it makes sense to calculate alternative checksums if they don't already exist:

  • At publish time, which could be slow. Since we will be storing the checksums on the unit after calculating it, the publish would only be slow due to this effect the first time any given unit is being published with a checksum type that it has not been published with before. This means that repeat publishes with the same (or mostly the same) units should be faster.
  • When the unit is being added to a repository (upload, copy, sync). The downside to this is that copies could become slow, and they have historically been a quick operation in Pulp.

I think I lean towards the second option, but only really because it seems odd for a publish operation to modify units.

#14 Updated by mhrivnak over 5 years ago

Sorry I didn't get to this until now. As promised, I'll add the thoughts that Brian and I came up with in a brief brainstorm session. It's not a complete solution, but may be useful to consider.

It sounds like RCM wants the ability to specify a checksum type and have all checksums in the publish repo metadata use that type. That seems reasonable.

The only types currently in use are sha1 and sha256, and we don't know of any plans to use more.

One way to make this happen:

The XML snippets stored in the database become templates, which requires a migration. The publish operation would render those templates.

We would add two new fields to the model called "sha1" and "sha256" or similar, and put the corresponding checksum values in there. It would duplicate data potentially from the unit key, but is worth it so we can preserve the duplicate units already present; if we tried to identify rpms that are duplicates, and consolidate, that raises other problems we might not want to solve right now. For example, if we replace RPMs that have sha1 in the unit key with their sha256 counterparts, if the sha1 unit gets orphan-removed, the file would disappear and published repos that used to have the sha1 unit might end up with broken links.

A migration would just add those two new fields and populate them. For lazy content, that's a problem as you've already pointed out, because we only know one of the checksums.

For new RPMs, both checksums can be calculated at sync or upload time. But as you point out, the lazy workflow complicates it.

In any case, calculating checksums at publish time would not likely be received well by users, so finding another way would be best.

Adding restrictions at copy time, as you've already suggested, seems like a reasonable approach. One option is for each rpm repo to have a chosen checksum type, either by user choice or by default, and refuse to add units if they don't have the required checksum. Although that wouldn't help if a user changes the repo's checksum type afterward. As an additional or separate guard, pulp could refuse to publish a repo if there are units that lack the requested checksum type. For a user who runs into that scenario, the simple solution is to call the download-repo task, which would get the files and populate all the checksums.

When the lazy workflow gets an rpm, either through the download-repo task, or the download-deferred task, we would want to calculate the missing checksum. Hooking into that might be challenging.

It would be ideal to expand this checksum storage approach, or whatever we implement, to all units that have files. It may or may not make sense to try doing that now. Perhaps doing it just in pulp_rpm is a decent proving ground, and we can later expand it to all content.

I think those are all the thoughts we had. Hopefully some of that is helpful, and please let me know if it would be valuable to have more discussion on it.

#15 Updated by rbarlow over 5 years ago

  • Tracker changed from Story to Issue
  • Subject changed from as user, I can generate repodata with different checksum than sha256 to --checksum-type is broken
  • Severity set to 2. Medium
  • Triaged set to No

#16 Updated by rbarlow over 5 years ago

  • Status changed from ASSIGNED to NEW
  • Assignee deleted (rbarlow)

I have more security issues to deal with and haven't made much progress on this anyway. Putting it down for now.

#17 Updated by jortel@redhat.com over 5 years ago

I propose the following as a blend of the preceding proposals.

  • Add sha1 and sha256 as attributes on the ContentUnit model object.
  • Alter the XML snippets to be a template and inject the checksum at publish time.
  • Calculate missing checksums needed for publishing - at publish time and store them on the unit.
  • No additional restrictions on "lazy" repositories.

I'm thinking that if pulp has imported RPM units with SHA256 and a user want to publish as SHA-1 they (the user) should endure the extra overhead (time) during publish. The overhead is in direct support of a publish operation and so this is the correct place to bare the burden. Since the new checksum is stored on the unit, the additional overhead is only incurred once. We could also add a low priority background task that periodically calculates missing checksums. This could be enabled/controlled by a configuration setting.

Obviously, we'll need a migration to:

  • convert XML snippets to templates
  • populate the new sha1 and sha256 attributes using the metadata.

#18 Updated by mhrivnak over 5 years ago

  • Triaged changed from No to Yes

#19 Updated by jcline@redhat.com over 5 years ago

  • Blocks Issue #1619: as user, I can export repo groups with different checksum than sha256 added

#20 Updated by jcline@redhat.com over 5 years ago

  • Description updated (diff)
  • Status changed from NEW to ASSIGNED
  • Assignee set to jcline@redhat.com

#21 Updated by jcline@redhat.com over 5 years ago

  • Description updated (diff)

#22 Updated by jcline@redhat.com over 5 years ago

Alright, so here's what I've found that's take a bit of wind out of my sails. These XML snippets are quite large, and reference the checksum a lot. I haven't taken the time to fully understand the schema, but what jumped out at me is all the primary.xml files I looked at had something like

 <checksum pkgid="YES" type="sha1">733033d4ba6761c30fbd1086a70784f4fb317687</>

and then everywhere else uses the checksum as the pkgid. So this means our templates will be very unwieldy and probably very slow to process.

Here's what I propose:

  • Use createrepo_c to generate the repository metadata if possible. I think this will also make doing things like DRPMs, fast incremental updates (it has a --update flag and if we know where the previous publish lives we can use that), etc much easier since we can just use the library. The current version in EPEL6 is 0.9 (the latest release is 0.10).
  • When it isn't possible (when a repository contains a lazy unit), publish using the existing snippets we have, unless the requested repo checksum doesn't match the upstream repo metadata type. In those cases we can fail and inform the user to either download the content, or change the checksum type to <type>

#23 Updated by mhrivnak over 5 years ago

I see that each rpm is referenced once in other.xml, and once in filelists.xml. Both references are by pkgid. Are you seeing more than those two references?

#24 Updated by rbarlow over 5 years ago

Hey Jeremy,

I too am not a fan of how we store XML in our database this way. I had been looking at the template approach too but I also found it to be unwieldy and the code was becoming even more hacky than it already is. IMO, the right approach is to go back to generating the XML as we did before. Using createrepo_c to do it seems fine to me, but I don't have any first hand experience with it to speak of.

I'd rather not have two ways we generate the XML, so I think eliminating the XML snippets in the database would be good. For lazy, I think we should just publish with the checksum we know in the database, rather than using the snippets.

Above, I had proposed disabling changing the checksum type for lazy fetching repos as a way to stop this from happening at publish time. I think that approach might be worth considering.

#25 Updated by jcline@redhat.com over 5 years ago

createrepo_c on a 10K RPM repository (no sqlite DBs):

[vagrant@dev os]$ time createrepo_c --no-database .
Directory walk started
Directory walk done - 10572 packages
Temporary output repo path: ./.repodata/
Pool started (with 5 workers)
Pool finished

real    0m29.418s
user    1m0.550s
sys     0m6.642s
[vagrant@dev os]$ 

Now, as you'd probably guess, since this is calculating the checksum of every RPM this is an I/O bound process (I apologize, I can't get Redmine to do a monospace font):

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
  0   0 100   0   0   0|   0     0 | 184B  308B|   0     0 | 132   256 
  4   0  96   0   0   0|  24k    0 | 476B  924B|   0     0 | 300   302 
 43   6  29  22   0   0| 317M    0 | 316B  728B|   0     0 |  13k 6922 
 51   9  15  25   0   0| 423M  896k|  66B  178B|   0     0 |  17k   10k
 47   7   3  43   0   0| 466M    0 | 118B  194B|   0     0 |  12k 6984 
 61   7   0  32   0   0| 692M    0 |  66B  194B|   0     0 |  13k 4453 
 44   6  16  33   0   0| 511M    0 | 118B  194B|   0     0 |  10k 4764 
 38   5   7  49   0   0| 414M   48k|  66B  178B|   0     0 |  10k 5526 
 39   5   3  52   0   0| 423M    0 | 118B  178B|   0     0 |9643  5256 
 48   6   6  39   0   0| 409M    0 |  66B  194B|   0     0 |  13k 6411 
 48   7   4  40   0   0| 520M    0 | 232B  514B|   0     0 |  14k 6233 
 34   5   8  53   0   0| 427M    0 |  66B   66B|   0     0 |8231  4659 
 44   6   9  41   0   0| 521M   12k| 118B  194B|   0     0 |9607  4471 
 40   4  16  41   0   0| 405M    0 |  66B  178B|   0     0 |8870  4878 
 49   6  25  20   0   0| 439M    0 | 118B  178B|   0     0 |  11k 4805 
 77   8   2  13   0   0| 477M    0 |  66B  178B|   0     0 |  11k   12k
 69   7   8  16   0   0| 397M    0 | 118B  178B|   0     0 |9024  3738 
 86   8   1   6   0   0| 472M    0 |  66B  178B|   0     0 |  11k 1911 
 85   5   2   7   0   0| 277M   15M| 118B  178B|   0     0 |8208  2594 
 66   5   1  28   0   0| 242M    0 |  66B  178B|   0     0 |8484  3545 
 51   9   9  31   0   0| 338M    0 | 118B  210B|   0     0 |  20k 9698 
 51   8   5  36   0   0| 372M    0 |  66B  178B|   0     0 |  14k 6742 
 44   6   6  44   0   0| 397M    0 | 118B  178B|   0     0 |9864  5881 
 37   8   5  51   0   0| 318M   24k|  66B  178B|   0     0 |  14k 7869 
 46   8  12  34   0   0| 369M    0 | 118B  178B|   0     0 |  15k 8857 
 50   7  12  31   0   0| 416M    0 |  66B  178B|   0     0 |  15k 8502 
 54   9   7  30   0   0| 458M    0 | 118B  178B|   0     0 |  15k 8898 
 52   8  14  26   0   0| 418M    0 |  66B  178B|   0     0 |  16k 8789 
 49   8  15  27   0   0| 397M   12k| 118B  178B|   0     0 |  15k 9508 
 24   4  63  10   0   0| 320M    0 |  66B  178B|   0     0 |6623  3078 
 43   7   2  48   0   0| 476M    0 | 118B  178B|   0     0 |  11k 6478 
 11   2  81   6   0   0| 105M    0 | 330B  518B|   0     0 |3242  2024 
  0   0 100   0   0   0|   0     0 | 118B  486B|   0     0 | 135   251 

I've got an SSD, so your mileage will vary. However, we could probably optimize things by using their Python bindings to build the repodata without it touching any of the files (for the cases when we either already have the checksum from upstream or we are publishing a lazy repository when changing the checksum isn't allowed).

Now, the whole publish operation (on the same host) takes about 1 minute and 30 seconds:

[vagrant@dev lib]$ time pulp-admin rpm repo publish run --repo-id el7-copy
+----------------------------------------------------------------------+
                    Publishing Repository [el7-copy]
+----------------------------------------------------------------------+

This command may be exited via ctrl+c without affecting the request.

Initializing repo metadata
[-]
... completed

Publishing Distribution files
[-]
... completed

Publishing RPMs
[==================================================] 100%
10572 of 10572 items
... completed

Publishing Delta RPMs
... skipped

Publishing Errata
[==================================================] 100%
1133 of 1133 items
... completed

Publishing Comps file
[==================================================] 100%
86 of 86 items
... completed

Publishing Metadata.
[-]
... completed

Closing repo metadata
[-]
... completed

Generating sqlite files
... skipped

Publishing files to web
[\]
... completed

Writing Listings File
[-]
... completed

Task Succeeded

real    1m27.178s
user    0m1.000s
sys     0m0.130s
[vagrant@dev lib]$ 

My guess is that on slow hardware publishes will take quite a bit longer than they currently do. However, no matter what we do we cannot simply jam XML snippets in the database and spit them back out. We have to modify the XML depending on distributor settings.

Since the whole point of the snippets seems to be about avoiding parsing and generating XML, I think we should get rid of them. We can either store them, then parse and modify them before spitting them out (and we'll need to handle checksumming all the files and so on and so forth), or generate it in a C library written expressly for this purpose.

#26 Updated by mhrivnak over 5 years ago

Interesting findings. I tried to reproduce just for comparison, and to have another data point. My hardware is apparently slower, which seems to have produced a substantially different comparison. Also, I copied only the RPMs into a repo and published that. createrepo_c does not help us with errata, comps.xml, distribution, etc, so I factored those out of the publish scenario by leaving them out of the repo.

My dstat output looked like this when running createrepo_c:

 16  11  16  57   0   0| 144M    0 | 118B  126B|   0     0 |8676  4832 
 24  10   7  59   0   0| 170M    0 |  66B  126B|   0     0 |7462  4026 
 27   7   6  59   0   0| 183M    0 | 118B  126B|   0     0 |7097  3652 
 16  13   4  67   0   0| 134M  612k|  66B  126B|   0     0 |7952  5030 
 27  11  22  40   0   0| 168M    0 | 118B  126B|   0     0 |8274  4320 
 21  10  39  29   0   0| 150M    0 |  66B  134B|   0     0 |8223  4854 
 29  16   7  48   0   0| 145M    0 | 118B  126B|   0     0 |8612  4951

It's also an SSD, but disk I/O is roughly 3x slower than yours. Maybe it's that I ran in vagrant with NFS? Maybe your SSD is just faster? Or maybe mine is actually CPU-bound, doing the calculations? In any case...

I'm seeing publish times of about 90s for just the RPMs.

createrepo_c took 117s.

dstat from the pulp publish looked like this:

 25   1  74   0   0   0|2684k 2596k| 330B  534B|   0     0 |2053  1094 
 25   1  74   0   0   0|2788k    0 | 382B  338B|   0     0 |1531  1221 
 24   1  73   1   0   0|2544k   18M| 330B  534B|   0     0 |1860  1306 
 24   1  73   1   0   0|4016k 7828k| 382B  636B|   0     0 |3185  2231 
 25   1  74   0   0   0|1408k    0 | 132B  330B|   0     0 |1402  1166 
 24   1  74   1   0   0|3064k 3164k| 184B  228B|   0     0 |1509  1434 
 25   1  74   0   0   0|2816k    0 | 132B  228B|   0     0 |1403  1105 
 25   1  74   1   0   0|4408k    0 | 382B  440B|   0     0 |1636  1448 

Our two data points at least hint that slower hardware will have a bigger impact on the createrepo_c option. That fits what we know about the work load, that createrepo_c is bottlenecked on hardware, whereas the current publish model is presumably limited by python's ability to create model instances, loop over them, and write text to files, and/or mongo's ability to deliver the data stream.

Of course this isn't rendering templates yet. It would be interesting to do a quick proof of concept to see how template rendering affects performance.

But one other factor to consider carefully is that we're doing these tests on mostly-idle systems. If createrepo_c wants to use all of my disk IO or 250% CPU, no problem. But on a busy server, that could have a big impact on other processes, and createrepo_c might only get a fraction of the resources it wants. Consider multiple concurrent publishes, and how that might scale, not to mention other operations or API queries that need to hit the database, etc.

I suspect that the current model of keeping pre-calculated XML (although soon as templates), and rendering them at publish time will continue to be a very compelling solution due to the much lower impact on system resources. But it's valuable to have the comparison, and to continue evaluating the pros and cons of each option.

The potential to stop worrying about the XML entirely and let something else make it for us does sound compelling.

#27 Updated by jcline@redhat.com over 5 years ago

  • Status changed from ASSIGNED to NEW
  • Assignee deleted (jcline@redhat.com)

I'm putting this back down since I need to start on 1769.

I wrote out a story (https://pulp.plan.io/issues/1877) that summarizes the problems I found with the RPM model as part of the investigation for this issue. I've also made a PR documenting some of the RPM model (https://github.com/pulp/pulp_rpm/pull/857).

#28 Updated by bmbouter over 5 years ago

  • Related to Story #1878: Support for choosing the checksum type in updateinfo added

#29 Updated by mhrivnak over 5 years ago

  • Sprint/Milestone changed from 19 to 20

#30 Updated by mhrivnak over 5 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to mhrivnak

#31 Updated by mhrivnak over 5 years ago

I wired in template rendering at publish time, as a PoC. It renders a template for each entry made to primary.xml, filelist.xml, or other.xml, adding the checksum where appropriate. It's the real work being proposed, just not in a polished form.

Publishing a RHEL 7 repo with about 10,000 packages, the publishes that render templates take 9% longer than before they did template rendering. There isn't a decisive difference in performance between using django vs. jinja2 as the template engine.

In all cases, even when not rendering templates, the publish is CPU-bound.

#32 Updated by bmbouter over 5 years ago

I'm OK with the performance impact because (1) it provides a necessary fix and (2) horizontal scalability allows users to mostly reach their performance goals so I'm less concerned with the impact of any single Pulp operation's performance.

Long term, I think we should move to letting createrepo_c do all the RPM metadata maintenance and generation, but I won't suggest that now is the right time for that.

#34 Updated by mhrivnak over 5 years ago

  • Sprint/Milestone changed from 20 to 21

#35 Updated by bmbouter over 5 years ago

  • Has duplicate Issue #627: --checksum-type does not affect the checksum used in primary.xml added

#38 Updated by mhrivnak over 5 years ago

  • Status changed from POST to MODIFIED

#39 Updated by pthomas@redhat.com over 5 years ago

  • Status changed from MODIFIED to ASSIGNED

Seems like checksum types can be updated for the repos with download policy set to on_demand


[root@ibm-x3550m3-11 ~]# rpm -qa pulp-server
pulp-server-2.9.0-0.3.beta.el7.noarch
[root@ibm-x3550m3-11 ~]# 

[root@ibm-x3550m3-11 ~]# pulp-admin rpm repo create --repo-id rhel7-os --feed http://cdn.rcm-internal.redhat.com/content/dist/rhel/rhui/server/7/7Server/x86_64/os/ --download-policy on_demand 
Successfully created repository [rhel7-os]

[root@ibm-x3550m3-11 ~]# pulp-admin rpm repo sync run  --repo-id rhel7-os +----------------------------------------------------------------------+
                  Synchronizing Repository [rhel7-os]
+----------------------------------------------------------------------+

This command may be exited via ctrl+c without affecting the request.

Downloading metadata...
[-]
... completed

Downloading repository content...
[|]
[==================================================] 100%
RPMs:       11051/11051 items
Delta RPMs: 0/0 items

... completed

Downloading distribution files...
[==================================================] 100%
Distributions: 0/0 items
... completed

Importing errata...
[/]
... completed

Importing package groups/categories...
[\]
... completed

Cleaning duplicate packages...
[-]
... completed

Task Succeeded

Initializing repo metadata
[-]
... completed

Publishing Distribution files
[-]
... completed

Publishing RPMs
[==================================================] 100%
11053 of 11053 items
... completed

Publishing Delta RPMs
... skipped

Publishing Errata
[==================================================] 100%
1231 of 1231 items
... completed

Publishing Comps file
[==================================================] 100%
87 of 87 items
... completed

Publishing Metadata.
[-]
... completed

Closing repo metadata
[-]
... completed

Generating sqlite files
... skipped

Generating HTML files
... skipped

Publishing files to web
[\]
... completed

Writing Listings File
[-]
... completed

Task Succeeded

[root@ibm-x3550m3-11 ~]# pulp-admin rpm repo update --repo-id rhel7-os --checksum-type sha256
This command may be exited via ctrl+c without affecting the request.

[\]
Running...
Updating distributor: yum_distributor

Task Succeeded

[\]
Running...
Updating distributor: export_distributor

Task Succeeded

[root@ibm-x3550m3-11 ~]# pulp-admin rpm repo publish run  --repo-id rhel7-os +----------------------------------------------------------------------+
                    Publishing Repository [rhel7-os]
+----------------------------------------------------------------------+

This command may be exited via ctrl+c without affecting the request.

Copying files
[\]
... completed

Initializing repo metadata
[-]
... completed

Publishing Distribution files
[-]
... completed

Publishing RPMs
[/]
... completed

Publishing Delta RPMs
... skipped

Publishing Errata
[==================================================] 100%
1231 of 1231 items
... completed

Publishing Comps file
[==================================================] 100%
87 of 87 items
... completed

Publishing Metadata.
[-]
... completed

Closing repo metadata
[-]
... completed

Generating sqlite files
... skipped

Generating HTML files
... skipped

Publishing files to web
[\]
... completed

Writing Listings File
[-]
... completed

Task Succeeded

[root@ibm-x3550m3-11 ~]# e

#40 Updated by mhrivnak over 5 years ago

  • Status changed from ASSIGNED to MODIFIED

That is expected behavior.

I assume you have in mind the scenario where the rpm hasn't been downloaded, and the user wants to publish with a checksum type pulp doesn't already have. In that case the publish will fail gracefully. Here is the documentation for the distributor setting which hopefully explains that clearly:

Checksum type to use for metadata generation. For any units where the checksum of this type is not already known, it will be computed on-the-fly and saved for future use. If any such units have not been downloaded, then checksum calculation is impossible, and the publish will fail gracefully.

#41 Updated by pthomas@redhat.com over 5 years ago

  • Status changed from MODIFIED to 6

Verified

#42 Updated by semyers over 5 years ago

  • Status changed from 6 to CLOSED - CURRENTRELEASE

#47 Updated by bmbouter over 3 years ago

  • Sprint set to Sprint 3

#48 Updated by bmbouter over 3 years ago

  • Sprint/Milestone deleted (21)

#49 Updated by bmbouter over 2 years ago

  • Tags Pulp 2 added

Please register to edit this issue

Also available in: Atom PDF