Project

Profile

Help

Story #1716

closed

As a user, I can have better memory performance on Publish by using SAX instead of etree for comps and updateinfo XML production

Added by bmbouter about 8 years ago. Updated about 5 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Sprint/Milestone:
-
Start date:
Due date:
% Done:

100%

Estimated time:
Platform Release:
2.9.0
Groomed:
Yes
Sprint Candidate:
Yes
Tags:
Pulp 2
Sprint:
Sprint 2
Quarter:

Description

Etree is currently used for the XML production of comps and updateinfo files at publish time. Etree supports great features, however it’s slow and eats lots of memory. A user, jluza has run into several memory issues which were caused by the use of etree. In a fork of Pulp he maintains, replacing the etree usage with XML resolved the memory issues and reduced the publish time.


Related issues

Related to RPM Support - Issue #1821: 2 GB is not enough RAM to publish the RHEL 5 repositoryCLOSED - WONTFIXActions
Actions #1

Updated by mhrivnak about 8 years ago

It would be helpful to have more detail on this. What were the circumstances when memory problems were seen? How much memory was used by an individual worker? Which version of pulp exhibited this issue?

Looking in pulp 2.8, it appears that where pulp_rpm does need to parse XML during publish, it does so with sax. See pulp.plugins.util.metadata_writer, which is used by modules in pulp_rpm.plugins.distributors.yum.metadata

Actions #2

Updated by jluza about 8 years ago

Here's the story. I was working on tool that supposed to compare metadata files in repodata directory. For huge repositories, repodata files can have megabytes, even hundreds of megabytes (unpacked). At first I tried to use etree library, but my laptop run out of memory quite quickly - when you consider you have to open two copied of repodata to compare them. So I rewrote my tool to use SAX parsing and stored nodes as lazy, so tool read them only when they were needed. That helped for reading, however diff result had to be also stored somewhere. So I dig more into SAX library and found out sax xml generator. That's how I've discovered SAX generator is much faster then etree.
Then I had idea we could rewrite pulp metadata generator for updateinfo to be SAX instead of etree - because at that times we were trying to improve publish performance every in possible way. After I did that, comparison results showed up significant speed improvement.

I think we've never encountered memory issues due metadata publishing, or if we have, it was only very occasionally. That doesn't mean it can't happen and I think it's better to save memory for more useful stuff than greedy etree library - notwithstanding fact you will get better performance.

I checked
https://github.com/pulp/pulp_rpm/blob/master/plugins/pulp_rpm/plugins/distributors/yum/metadata/updateinfo.py
and
https://github.com/pulp/pulp_rpm/blob/master/plugins/pulp_rpm/plugins/distributors/yum/metadata/package.py

and there's still etree used. Of course primary, files and other metadata are composed by sticking individual pieces from db together, I don't think we could make any perf improvement there. But for comps and updateinfo, I think there's place for improvement.

I will provide you patch for comps and updateinfo sax generator, it's basically replaces whole package.py and updateinfo.py files and provides saxwriter library for that. It should be easy to test it.

Actions #3

Updated by mhrivnak about 8 years ago

  • Related to Issue #1821: 2 GB is not enough RAM to publish the RHEL 5 repository added
Actions #6

Updated by mhrivnak almost 8 years ago

  • Sprint/Milestone set to 20
  • Groomed changed from No to Yes
Actions #7

Updated by ttereshc almost 8 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to ttereshc

Added by ttereshc almost 8 years ago

Revision a43eb31a | View on GitHub

Add SAX writer to generate XML without etree

re #1716 https://pulp.plan.io/issues/1716

Added by ttereshc almost 8 years ago

Revision a43eb31a | View on GitHub

Add SAX writer to generate XML without etree

re #1716 https://pulp.plan.io/issues/1716

Added by ttereshc almost 8 years ago

Revision 65f22d16 | View on GitHub

Generate updateinfo.xml and comps.xml using SAX instead of etree

closes #1716 https://pulp.plan.io/issues/1716

Actions #9

Updated by ttereshc almost 8 years ago

  • Status changed from POST to MODIFIED
  • % Done changed from 0 to 100
Actions #10

Updated by ttereshc almost 8 years ago

  • Platform Release set to 2.9.0
Actions #12

Updated by semyers almost 8 years ago

  • Status changed from MODIFIED to CLOSED - CURRENTRELEASE
Actions #13

Updated by bmbouter about 6 years ago

  • Sprint set to Sprint 2
Actions #14

Updated by bmbouter about 6 years ago

  • Sprint/Milestone deleted (20)
Actions #15

Updated by bmbouter about 5 years ago

  • Tags Pulp 2 added

Also available in: Atom PDF