https://pulp.plan.io/https://pulp.plan.io/favicon.ico2016-05-26T16:45:44ZPulpRPM Support - Issue #1949: re-publish takes longer than expectedhttps://pulp.plan.io/issues/1949?journal_id=117562016-05-26T16:45:44Zmhrivnakmhrivnak@redhat.com
<ul></ul><p>Once triaged, this should go on the current sprint so we can at a minimum identify exactly where the time is being spent, and identify options for improvement.</p> RPM Support - Issue #1949: re-publish takes longer than expectedhttps://pulp.plan.io/issues/1949?journal_id=117572016-05-26T17:09:34Zjcline@redhat.comjcline@redhat.com
<ul><li><strong>File</strong> <a href="/attachments/251">Default_Organization-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_7_Server_-_Extras_RPMs_x86_64_sync_stats</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/251/Default_Organization-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_7_Server_-_Extras_RPMs_x86_64_sync_stats">Default_Organization-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_7_Server_-_Extras_RPMs_x86_64_sync_stats</a> added</li><li><strong>File</strong> <a href="/attachments/250">Default_Organization-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_7_Server_-_Extras_RPMs_x86_64_publish_stats</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/250/Default_Organization-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_7_Server_-_Extras_RPMs_x86_64_publish_stats">Default_Organization-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_7_Server_-_Extras_RPMs_x86_64_publish_stats</a> added</li><li><strong>File</strong> <a href="/attachments/253">el7-extras_publish_stats</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/253/el7-extras_publish_stats">el7-extras_publish_stats</a> added</li><li><strong>File</strong> <a href="/attachments/252">scl7_publish_stats</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/252/scl7_publish_stats">scl7_publish_stats</a> added</li></ul><p>From my brief analysis it's because of the work for <a href="https://pulp.plan.io/issues/1548" class="external">https://pulp.plan.io/issues/1548</a></p> RPM Support - Issue #1949: re-publish takes longer than expectedhttps://pulp.plan.io/issues/1949?journal_id=118162016-05-27T15:13:06Zjcline@redhat.comjcline@redhat.com
<ul></ul><p>This is probably a duplicate of <a href="https://pulp.plan.io/issues/1947" class="external">https://pulp.plan.io/issues/1947</a></p> RPM Support - Issue #1949: re-publish takes longer than expectedhttps://pulp.plan.io/issues/1949?journal_id=118172016-05-27T15:17:27Zbmbouterbmbouter@redhat.com
<ul><li><strong>Triaged</strong> changed from <i>No</i> to <i>Yes</i></li></ul> RPM Support - Issue #1949: re-publish takes longer than expectedhttps://pulp.plan.io/issues/1949?journal_id=119682016-06-01T13:05:07Zmhrivnakmhrivnak@redhat.com
<ul><li><strong>Has duplicate</strong> <i><a class="issue tracker-1 status-12 priority-6 priority-default closed" href="/issues/1947">Issue #1947</a>: Concurrent sync of repos is running slow</i> added</li></ul> RPM Support - Issue #1949: re-publish takes longer than expectedhttps://pulp.plan.io/issues/1949?journal_id=119702016-06-01T13:44:54Zbmbouterbmbouter@redhat.com
<ul><li><strong>Sprint/Milestone</strong> set to <i>21</i></li></ul><p>Per Comment 1 I'm putting this on the current Sprint. If possible, the fix should be introduced in the 2.8.z release stream.</p> RPM Support - Issue #1949: re-publish takes longer than expectedhttps://pulp.plan.io/issues/1949?journal_id=120292016-06-03T13:12:25Zpthomas@redhat.com
<ul></ul><pre><code>I have a tried to to re sync a rhel6 repo which has --skp rpm and takes over an hour to publish errata. it is taking roughly 10s per errata
And cpu usage hovers around 100%
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3799 apache 20 0 905m 95m 4136 R 98.5 2.0 11:54.38 python
4080 root 20 0 271m 22m 6460 S 1.9 0.5 0:11.51 pulp-admin
23551 mongodb 20 0 3229m 548m 237m S 1.6 11.3 6:55.11 mongod
3186 apache 20 0 1100m 59m 9112 S 1.2 1.2 0:08.21 httpd
Also worth noting that on el7 the same sync completed just fine
[root@mgmt8 ~]# time pulp-admin rpm repo sync run --repo-id rhel6
+----------------------------------------------------------------------+
Synchronizing Repository [rhel6]
+----------------------------------------------------------------------+
This command may be exited via ctrl+c without affecting the request.
Downloading metadata...
[\]
... completed
Downloading repository content...
[-]
[==================================================] 100%
RPMs: 0/0 items
Delta RPMs: 0/0 items
... completed
Downloading distribution files...
[==================================================] 100%
Distributions: 0/0 items
... completed
Importing errata...
[-]
... completed
Importing package groups/categories...
[-]
... completed
Cleaning duplicate packages...
[-]
... completed
Task Succeeded
Copying files
[-]
... completed
Initializing repo metadata
[-]
... completed
Publishing Distribution files
[-]
... completed
Publishing RPMs
[-]
... completed
Publishing Delta RPMs
... skipped
Publishing Errata
[==================================================] 100%
3325 of 3325 items
... completed
Publishing Comps file
[==================================================] 100%
212 of 212 items
... completed
Publishing Metadata.
[-]
... completed
Closing repo metadata
[-]
... completed
Generating sqlite files
... skipped
Publishing files to web
[-]
... completed
Writing Listings File
[-]
... completed
Task Succeeded
real 2m56.243s
user 0m7.112s
sys 0m0.276s
[root@mgmt8 ~]#
</code></pre> RPM Support - Issue #1949: re-publish takes longer than expectedhttps://pulp.plan.io/issues/1949?journal_id=123082016-06-13T13:53:46Zmhrivnakmhrivnak@redhat.com
<ul><li><strong>Sprint/Milestone</strong> changed from <i>21</i> to <i>22</i></li></ul> RPM Support - Issue #1949: re-publish takes longer than expectedhttps://pulp.plan.io/issues/1949?journal_id=123692016-06-15T15:31:59Zsemyerssean.myers@redhat.com
<ul></ul><p>I think it might be worth going over the history of how we got here. The first change was when we started concatenating errata package lists, which was necessary due to an incorrect assumption that errata with the same id would not appear in multiple repos. You'd sync down el7 and its errata, and presumably those errata link to packages in the synced repo. Then you'd sync el6 and its errata, and the el6 package list would replace the el7 package list. Now those errata only link to el7 packages, but when the el7 repo is republished by pulp, it's linked to errata that now reference el6 packages, and so it publishes el6 errata into an el7 repo. To fix this, we started concatenating package lists from multiple repos.</p>
<p>Here's the change, and the related bugzilla:<br>
<a href="https://github.com/pulp/pulp_rpm/pull/625" class="external">https://github.com/pulp/pulp_rpm/pull/625</a><br>
<a href="https://bugzilla.redhat.com/show_bug.cgi?id=1171278" class="external">https://bugzilla.redhat.com/show_bug.cgi?id=1171278</a></p>
<p>For a while, everything seemed to be working, but then reports started coming in about errata referencing packages that aren't in a published repo (because in our example above, the updateinfo package list is now naming el6 packages in a published el7 repo):</p>
<p><a href="https://pulp.plan.io/issues/1366" class="external">https://pulp.plan.io/issues/1366</a><br>
<a href="https://pulp.plan.io/issues/1548" class="external">https://pulp.plan.io/issues/1548</a></p>
<p>My solution to this, seen in 1548, is what most likely causes the slowness, which is to go through the concatenated errata package lists and filter out packages not in the repo being published.</p>
<p>Another errata-related issue, having to do with the syncing of errata metadata even a repo has already been sync, isn't really related to this at first glance, but it might hold some valuable information to assist in solving this issue. Starting at comment 7, a discussion between ttereshc and me, with input from jluza, reveals some very useful details about what can and cannot be relied upon in the errata data:</p>
<p><a href="https://pulp.plan.io/issues/858#note-7" class="external">https://pulp.plan.io/issues/858#note-7</a></p>
<p>ttereshc's fix here was to modify package list short names to include the pulp repo_id, ensuring that even if the errata unit is shared among multiple repos, we at least have a way to make those packagelist names unique per-repo. This might even give us a way to link packagelists back to a repo by repo_id, but I'm not sure about the reliability of this approach, since it might be based on parsing a string to pull the repo_id out.</p>
<p>I have an idea about doing this reliably, which is basically to iterate over repo_ids and package list names, returning the package list who's short name .endswith() the longest match repo_id match, and pulling out the package list for the repo being published. I can elaborate on this if needed, but if the logic works, there's still the problem that it would only work on repos synced since ttereshc's change made it into pulp, so we likely still need my slow solution from <a class="issue tracker-1 status-11 priority-7 priority-high2 closed" title="Issue: published errata contain packages not in repo (CLOSED - CURRENTRELEASE)" href="https://pulp.plan.io/issues/1548">#1548</a> to stick around if a package list can't be associated with the repo being published.</p> RPM Support - Issue #1949: re-publish takes longer than expectedhttps://pulp.plan.io/issues/1949?journal_id=123702016-06-15T15:35:06Zsemyerssean.myers@redhat.com
<ul></ul><p>Related to my previous comment: <a class="issue tracker-1 status-11 priority-7 priority-high2 closed" title="Issue: As a user, I would like to receive updated errata metadata (CLOSED - CURRENTRELEASE)" href="https://pulp.plan.io/issues/858">#858</a> has a target release of 2.8.5, which is being released imminently.</p> RPM Support - Issue #1949: re-publish takes longer than expectedhttps://pulp.plan.io/issues/1949?journal_id=124652016-06-20T13:54:19Zsemyerssean.myers@redhat.com
<ul></ul><p>I also forgot to mention that <a class="issue tracker-4 status-9 priority-6 priority-default closed" title="Refactor: Refactor errata to be related to repositories (CLOSED - WONTFIX)" href="https://pulp.plan.io/issues/1989">#1989</a> is one way that I think we can solve this.</p> RPM Support - Issue #1949: re-publish takes longer than expectedhttps://pulp.plan.io/issues/1949?journal_id=124662016-06-20T14:22:26Zbmbouterbmbouter@redhat.com
<ul><li><strong>Status</strong> changed from <i>NEW</i> to <i>ASSIGNED</i></li><li><strong>Assignee</strong> set to <i>bmbouter</i></li></ul> RPM Support - Issue #1949: re-publish takes longer than expectedhttps://pulp.plan.io/issues/1949?journal_id=124672016-06-20T14:28:34Zmhrivnakmhrivnak@redhat.com
<ul></ul><p>Assuming the performance problem is in querying the database, here are some quick thoughts that may or may not be helpful, depending on what is revealed by further investigation:</p>
<p>The problem boils down to: for each RPM listed in an errata package list, pulp needs to know if it is in the repository, and if not, leaves it out of the published xml.</p>
<p>One option is to do one big query of the DB to get all of the NEVRA that are in the repo, and store them in a python set. That could be very quickly queried while iterating through the errata. Sync already does something similar to do dependency resolution, so the memory impact should be tolerable.</p>
<p>Something less heavy-handed would be to do a first pass through the errata to collect all the NEVRA referenced by them, then do one big query for just those nevra, and then do a second pass iterating the errata with the knowledge of which RPMs are in the repo. This would have a lower memory footprint.</p> RPM Support - Issue #1949: re-publish takes longer than expectedhttps://pulp.plan.io/issues/1949?journal_id=125512016-06-22T21:44:52Zbmbouterbmbouter@redhat.com
<ul></ul><p>I tested the new, quicker implementation to ensure it was omitting NEVRA from other repos using:</p>
<pre><code># create two repos
pulp-admin rpm repo create --repo-id rhel6 --feed https://YOURCDNHOSTNAME/content/dist/rhel/server/6/6Server/x86_64/os/ --download-policy on_demand
pulp-admin rpm repo create --repo-id rhel7 --feed https://YOURCDNHOSTNAME/content/dist/rhel/server/7/7Server/x86_64/os/ --download-policy on_demand
# sync two repos
pulp-admin rpm sync publish run --repo-id rhel6
pulp-admin rpm sync publish run --repo-id rhel7 # <----- this is where I checked the filtering to ensure that rhel6 rpms were filtered out.
# For example RHSA-2014:1293 and RHSA-2016:1217 both refer to RPMs in EL6 and EL7.
</code></pre> RPM Support - Issue #1949: re-publish takes longer than expectedhttps://pulp.plan.io/issues/1949?journal_id=125722016-06-24T14:10:55Zbmbouterbmbouter@redhat.com
<ul></ul><p>For the following repo with 213 rpms, and 163 errata, the Errata publish step was taking about 32 seconds. With the new implementation is takes < 2.</p>
<p>/content/dist/rhel/server/7/7Server/x86_64/extras/os/</p> RPM Support - Issue #1949: re-publish takes longer than expectedhttps://pulp.plan.io/issues/1949?journal_id=125742016-06-24T14:14:34Zsemyerssean.myers@redhat.com
<ul></ul><p>bmbouter wrote:</p>
<blockquote>
<p>For the following repo with 213 rpms, and 163 errata, the Errata publish step was taking about 32 seconds. With the new implementation is takes < 2.</p>
<p>/content/dist/rhel/server/7/7Server/x86_64/extras/os/</p>
</blockquote>
<p>If you're able to do before-and-after profiling and have a way to get memory numbers, I think those numbers might look good, too.</p> RPM Support - Issue #1949: re-publish takes longer than expectedhttps://pulp.plan.io/issues/1949?journal_id=125832016-06-24T16:50:56Zbmbouterbmbouter@redhat.com
<ul></ul><p>The new design stores all of a repo's rpm nevra as a list of named tuples in memory. For the 213 RPMs this is stored as 1936 bytes as reported by sys.getsizeof(). This variable is populated when the PublishErrataStep is initialized and garbage collected after the publish is complete.</p>
<pre><code>>>> sys.getsizeof(repo_nevra)
Out[7]: 1936
>>> len(repo_nevra)
Out[8]: 213
</code></pre> RPM Support - Issue #1949: re-publish takes longer than expectedhttps://pulp.plan.io/issues/1949?journal_id=125842016-06-24T16:53:20Zbmbouterbmbouter@redhat.com
<ul><li><strong>Status</strong> changed from <i>ASSIGNED</i> to <i>POST</i></li></ul><p>PR available at <a href="https://github.com/pulp/pulp_rpm/pull/916" class="external">https://github.com/pulp/pulp_rpm/pull/916</a></p> RPM Support - Issue #1949: re-publish takes longer than expectedhttps://pulp.plan.io/issues/1949?journal_id=125872016-06-24T20:08:33Zbmbouterbmbouter@redhat.com
<ul><li><strong>Status</strong> changed from <i>POST</i> to <i>MODIFIED</i></li><li><strong>% Done</strong> changed from <i>0</i> to <i>100</i></li></ul><p>Applied in changeset <a class="changeset" title="Erratum publish reads repo nevra into memory The Errata publish performance became a problem rec..." href="https://pulp.plan.io/projects/pulp_rpm/repository/9/revisions/500341f6d8694dacfbc06e62549232923abd77cc">500341f6d8694dacfbc06e62549232923abd77cc</a>.</p> RPM Support - Issue #1949: re-publish takes longer than expectedhttps://pulp.plan.io/issues/1949?journal_id=125882016-06-24T20:09:29Zbmbouterbmbouter@redhat.com
<ul><li><strong>Platform Release</strong> set to <i>2.8.6</i></li></ul> RPM Support - Issue #1949: re-publish takes longer than expectedhttps://pulp.plan.io/issues/1949?journal_id=127802016-07-04T02:27:41Zpthomas@redhat.com
<ul><li><strong>Status</strong> changed from <i>MODIFIED</i> to <i>6</i></li></ul><p>Verified</p>
<p>1. Updated from 2.8.5 to the nightly to make sure that republish is not taking long</p>
<p>2. Verified <a class="issue tracker-3 status-9 priority-6 priority-default closed" title="Story: As a developer, I have an example repository with an importer (CLOSED - WONTFIX)" href="https://pulp.plan.io/issues/16">#16</a></p>
<p>3. Reverified <a href="https://pulp.plan.io/issues/1548" class="external">https://pulp.plan.io/issues/1548</a></p> RPM Support - Issue #1949: re-publish takes longer than expectedhttps://pulp.plan.io/issues/1949?journal_id=129752016-07-12T20:01:42Zsemyerssean.myers@redhat.com
<ul><li><strong>Status</strong> changed from <i>6</i> to <i>5</i></li></ul> RPM Support - Issue #1949: re-publish takes longer than expectedhttps://pulp.plan.io/issues/1949?journal_id=129812016-07-12T20:03:28Zsemyerssean.myers@redhat.com
<ul><li><strong>Status</strong> changed from <i>5</i> to <i>6</i></li></ul><p>I accidentally moved this from VERIFIED back to ON_QA. Sorry for the ticket noise!</p> RPM Support - Issue #1949: re-publish takes longer than expectedhttps://pulp.plan.io/issues/1949?journal_id=130352016-07-18T19:21:44Zsemyerssean.myers@redhat.com
<ul><li><strong>Status</strong> changed from <i>6</i> to <i>CLOSED - CURRENTRELEASE</i></li></ul> RPM Support - Issue #1949: re-publish takes longer than expectedhttps://pulp.plan.io/issues/1949?journal_id=130562016-07-19T18:59:26Zsemyerssean.myers@redhat.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-2 status-8 priority-6 priority-default closed" href="/issues/2083">Task #2083</a>: Issues common to 2.9.1 and 2.8 stream</i> added</li></ul> RPM Support - Issue #1949: re-publish takes longer than expectedhttps://pulp.plan.io/issues/1949?journal_id=253202018-03-08T18:54:49Zbmbouterbmbouter@redhat.com
<ul><li><strong>Sprint</strong> set to <i>Sprint 4</i></li></ul> RPM Support - Issue #1949: re-publish takes longer than expectedhttps://pulp.plan.io/issues/1949?journal_id=253402018-03-08T18:57:53Zbmbouterbmbouter@redhat.com
<ul><li><strong>Sprint/Milestone</strong> deleted (<del><i>22</i></del>)</li></ul> RPM Support - Issue #1949: re-publish takes longer than expectedhttps://pulp.plan.io/issues/1949?journal_id=388872019-04-15T20:29:51Zbmbouterbmbouter@redhat.com
<ul><li><strong>Tags</strong> <i>Pulp 2</i> added</li></ul>