In a pulp_rpm PR[0], errata were converted to append packages to existing errata from different repos, rather than overwrite the existing errata unit. That was a needed improvement, and related PRs can also be seen in the bugzilla for that[1]. On the publish side, though, this means that we're now listing every unit pulp knows about in a repo's errata, when published, instead of only the units associated with that repo. It is simple enough to apply this filter when publishing a repo, but since errata package lists consist of package names (and thus only their NEVRA), in order to associate a package from an errata with a given pulp repo, we must first determine that unit that corresponds with that package, and then find see if a repo/unit association exists for that unit in the repo being published. If so, include it in updateinfo, if not, don't.
mhrivnak wrote:
Probably we need to filter packages out of the package list at publish time.
Here's a diff where I give that a shot using some minimally-invasive surgery:
diff --git a/plugins/pulp_rpm/plugins/distributors/yum/metadata/updateinfo.py b/plugins/pulp_rpm/plugins/distributors/yum/metadata/updateinfo.py
index 532c9fa..fdce5f6 100644
--- a/plugins/pulp_rpm/plugins/distributors/yum/metadata/updateinfo.py
+++ b/plugins/pulp_rpm/plugins/distributors/yum/metadata/updateinfo.py
@@ -2,6 +2,7 @@ import os
from xml.etree import ElementTree
from pulp.plugins.util.metadata_writer import XmlFileContext
+from pulp.server.db.model.criteria import UnitAssociationCriteria
from pulp_rpm.plugins.distributors.yum.metadata.metadata import REPO_DATA_DIR_NAME
from pulp_rpm.yum_plugin import util
@@ -13,9 +14,10 @@ UPDATE_INFO_XML_FILE_NAME = 'updateinfo.xml.gz'
class UpdateinfoXMLFileContext(XmlFileContext):
- def __init__(self, working_dir, checksum_type=None):
+ def __init__(self, working_dir, checksum_type=None, conduit=None):
metadata_file_path = os.path.join(working_dir, REPO_DATA_DIR_NAME,
UPDATE_INFO_XML_FILE_NAME)
+ self.conduit = conduit
super(UpdateinfoXMLFileContext, self).__init__(
metadata_file_path, 'updates', checksum_type=checksum_type)
@@ -95,13 +97,22 @@ class UpdateinfoXMLFileContext(XmlFileContext):
name_element.text = pkglist['name']
for package in pkglist['packages']:
-
package_attributes = {'name': package['name'],
'version': package['version'],
'release': package['release'],
'epoch': package['epoch'] or '0',
'arch': package['arch'],
'src': package.get('src', '') or ''}
+
+ # If an rpm with this unit key isn't associated with this repo,
+ # don't include in this repository's updateinfo XML
+ package_key = package_attributes.copy()
+ del(package_key['src'])
+ criteria = UnitAssociationCriteria(type_ids=['rpm'], unit_filters=package_key)
+
+ if self.conduit is not None and not self.conduit.get_units(criteria):
+ continue
+
package_element = ElementTree.SubElement(collection_element, 'package',
package_attributes)
diff --git a/plugins/pulp_rpm/plugins/distributors/yum/publish.py b/plugins/pulp_rpm/plugins/distributors/yum/publish.py
index 2df7cbb..a9e7ada 100644
--- a/plugins/pulp_rpm/plugins/distributors/yum/publish.py
+++ b/plugins/pulp_rpm/plugins/distributors/yum/publish.py
@@ -562,7 +562,8 @@ class PublishErrataStep(platform_steps.UnitModelPluginStep):
one that is built into the UpdateinfoXMLFileContext
"""
checksum_type = self.parent.get_checksum_type()
- self.context = UpdateinfoXMLFileContext(self.get_working_dir(), checksum_type)
+ self.context = UpdateinfoXMLFileContext(self.get_working_dir(), checksum_type,
+ self.get_conduit())
self.context.initialize()
# set the self.process_unit method to the corresponding method on the
# UpdateInfoXMLFileContext as there is no other processing to be done for each unit.
...and it totally works! The only problem is that it makes publishing errata go extremely slowly when I test it out. My test used the rhel6 and rhel7 main repos, publishing only rhel7. It takes an almost insignificant amount of time to publish the repo normally. With the diff above applied, it took my poor dev VM all night to finish the job. I didn't measure the times precisely, but the ratio appears to be "very bad", certainly well over the "unacceptably bad" line.
So there are a few different options here. The first, of course, is tolerate extremely slow errata publishes. I hate this option.
Another option is to use a different mechanism that can do the same filtering as in my diff, but waaaaaay faster. There are several[2] bits[3] of code[4] that do things like this on the import side that can be used as a reference for creating something a little more task-specific that gets the job done more quickly for a publish. I'm skeptical of this option, because this is a relational problem in a nonrelational system, and I suspect that there's little that we can do on a per-unit basis to make this faster. Nevertheless, it's probably worth trying a few different methods to see if one manages to get the job done in a timely manner.
The option that I think is most extreme is to do something similar to what I've done when faced with a similar problem, which is to use advanced mongodb features to make changes in one collection based on the contents of another[5]. I think this option might actually be worse than just waiting a few hours for the publish to happen, since it will involve writing more mongo-specific JS, which comes with a couple tons of baggage.
The remaining option might be the best one, which is one I haven't yet thought of. I welcome comments on this issue, and in the meantime will put this down for a little while so I can come back to it fresh and start trying the available options.
[0]: https://github.com/pulp/pulp_rpm/pull/625
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1171278
[2]: https://github.com/pulp/pulp_rpm/blob/43970f1f51302fba1db23483285acbcb6a885714/plugins/pulp_rpm/plugins/importers/yum/existing.py#L64-L89
[3]: https://github.com/pulp/pulp_rpm/blob/43970f1f51302fba1db23483285acbcb6a885714/plugins/pulp_rpm/plugins/importers/yum/associate.py#L90-L125
[4]: https://github.com/pulp/pulp/blob/ec59a2a794b1d59b9f88986952eea73e27be6803/server/pulp/plugins/conduits/mixins.py#L234-L255
[5]: https://github.com/pulp/pulp_rpm/blob/43970f1f51302fba1db23483285acbcb6a885714/plugins/pulp_rpm/plugins/importers/yum/purge.py#L293-L496
published updateinfo only contains units in repo
Errata units in Pulp contain all units in all repos that are linked to errata with the same id, which was resulting in published errata referencing packages that weren't actually available in the published repo. This limits packages in published errata updateinfo XML to only the packages that are contained in the published repo.
fixes #1366 https://pulp.plan.io/issues/1366
fixes #1548 https://pulp.plan.io/issues/1548