As a user, I can have better memory performance on Publish by using SAX instead of etree for comps and updateinfo XML production
Etree is currently used for the XML production of comps and updateinfo files at publish time. Etree supports great features, however it’s slow and eats lots of memory. A user, jluza has run into several memory issues which were caused by the use of etree. In a fork of Pulp he maintains, replacing the etree usage with XML resolved the memory issues and reduced the publish time.
Updated by mhrivnak over 6 years ago
It would be helpful to have more detail on this. What were the circumstances when memory problems were seen? How much memory was used by an individual worker? Which version of pulp exhibited this issue?
Looking in pulp 2.8, it appears that where pulp_rpm does need to parse XML during publish, it does so with sax. See pulp.plugins.util.metadata_writer, which is used by modules in pulp_rpm.plugins.distributors.yum.metadata
Updated by jluza over 6 years ago
Here's the story. I was working on tool that supposed to compare metadata files in repodata directory. For huge repositories, repodata files can have megabytes, even hundreds of megabytes (unpacked). At first I tried to use etree library, but my laptop run out of memory quite quickly - when you consider you have to open two copied of repodata to compare them. So I rewrote my tool to use SAX parsing and stored nodes as lazy, so tool read them only when they were needed. That helped for reading, however diff result had to be also stored somewhere. So I dig more into SAX library and found out sax xml generator. That's how I've discovered SAX generator is much faster then etree.
Then I had idea we could rewrite pulp metadata generator for updateinfo to be SAX instead of etree - because at that times we were trying to improve publish performance every in possible way. After I did that, comparison results showed up significant speed improvement.
I think we've never encountered memory issues due metadata publishing, or if we have, it was only very occasionally. That doesn't mean it can't happen and I think it's better to save memory for more useful stuff than greedy etree library - notwithstanding fact you will get better performance.
and there's still etree used. Of course primary, files and other metadata are composed by sticking individual pieces from db together, I don't think we could make any perf improvement there. But for comps and updateinfo, I think there's place for improvement.
I will provide you patch for comps and updateinfo sax generator, it's basically replaces whole package.py and updateinfo.py files and provides saxwriter library for that. It should be easy to test it.