Project

Profile

Help

Story #1716

As a user, I can have better memory performance on Publish by using SAX instead of etree for comps and updateinfo XML production

Added by bmbouter about 3 years ago. Updated about 1 month ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
-
% Done:

100%

Platform Release:
2.9.0
Blocks Release:
Backwards Incompatible:
No
Groomed:
Yes
Sprint Candidate:
Yes
Tags:
Pulp 2
QA Contact:
Complexity:
Smash Test:
Verified:
No
Verification Required:
No
Sprint:
Sprint 2

Description

Etree is currently used for the XML production of comps and updateinfo files at publish time. Etree supports great features, however it’s slow and eats lots of memory. A user, jluza has run into several memory issues which were caused by the use of etree. In a fork of Pulp he maintains, replacing the etree usage with XML resolved the memory issues and reduced the publish time.


Related issues

Related to RPM Support - Issue #1821: 2 GB is not enough RAM to publish the RHEL 5 repository CLOSED - WONTFIX Actions

Associated revisions

Revision a43eb31a View on GitHub
Added by ttereshc almost 3 years ago

Add SAX writer to generate XML without etree

re #1716
https://pulp.plan.io/issues/1716

Revision a43eb31a View on GitHub
Added by ttereshc almost 3 years ago

Add SAX writer to generate XML without etree

re #1716
https://pulp.plan.io/issues/1716

Revision a43eb31a View on GitHub
Added by ttereshc almost 3 years ago

Add SAX writer to generate XML without etree

re #1716
https://pulp.plan.io/issues/1716

Revision 65f22d16 View on GitHub
Added by ttereshc almost 3 years ago

Generate updateinfo.xml and comps.xml using SAX instead of etree

closes #1716
https://pulp.plan.io/issues/1716

History

#1 Updated by mhrivnak about 3 years ago

It would be helpful to have more detail on this. What were the circumstances when memory problems were seen? How much memory was used by an individual worker? Which version of pulp exhibited this issue?

Looking in pulp 2.8, it appears that where pulp_rpm does need to parse XML during publish, it does so with sax. See pulp.plugins.util.metadata_writer, which is used by modules in pulp_rpm.plugins.distributors.yum.metadata

#2 Updated by jluza about 3 years ago

Here's the story. I was working on tool that supposed to compare metadata files in repodata directory. For huge repositories, repodata files can have megabytes, even hundreds of megabytes (unpacked). At first I tried to use etree library, but my laptop run out of memory quite quickly - when you consider you have to open two copied of repodata to compare them. So I rewrote my tool to use SAX parsing and stored nodes as lazy, so tool read them only when they were needed. That helped for reading, however diff result had to be also stored somewhere. So I dig more into SAX library and found out sax xml generator. That's how I've discovered SAX generator is much faster then etree.
Then I had idea we could rewrite pulp metadata generator for updateinfo to be SAX instead of etree - because at that times we were trying to improve publish performance every in possible way. After I did that, comparison results showed up significant speed improvement.

I think we've never encountered memory issues due metadata publishing, or if we have, it was only very occasionally. That doesn't mean it can't happen and I think it's better to save memory for more useful stuff than greedy etree library - notwithstanding fact you will get better performance.

I checked
https://github.com/pulp/pulp_rpm/blob/master/plugins/pulp_rpm/plugins/distributors/yum/metadata/updateinfo.py
and
https://github.com/pulp/pulp_rpm/blob/master/plugins/pulp_rpm/plugins/distributors/yum/metadata/package.py

and there's still etree used. Of course primary, files and other metadata are composed by sticking individual pieces from db together, I don't think we could make any perf improvement there. But for comps and updateinfo, I think there's place for improvement.

I will provide you patch for comps and updateinfo sax generator, it's basically replaces whole package.py and updateinfo.py files and provides saxwriter library for that. It should be easy to test it.

#3 Updated by mhrivnak about 3 years ago

  • Related to Issue #1821: 2 GB is not enough RAM to publish the RHEL 5 repository added

#6 Updated by mhrivnak about 3 years ago

  • Sprint/Milestone set to 20
  • Groomed changed from No to Yes

#7 Updated by ttereshc about 3 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to ttereshc

#9 Updated by ttereshc almost 3 years ago

  • Status changed from POST to MODIFIED
  • % Done changed from 0 to 100

#10 Updated by ttereshc almost 3 years ago

  • Platform Release set to 2.9.0

#12 Updated by semyers almost 3 years ago

  • Status changed from MODIFIED to CLOSED - CURRENTRELEASE

#13 Updated by bmbouter about 1 year ago

  • Sprint set to Sprint 2

#14 Updated by bmbouter about 1 year ago

  • Sprint/Milestone deleted (20)

#15 Updated by bmbouter about 1 month ago

  • Tags Pulp 2 added

Please register to edit this issue

Also available in: Atom PDF