Story #6134: [EPIC] Pulp import/export
As a user, I can export into a series of files of a particular size
This is to support the use case where Katello wants to export a series of files of a given size that can be recombined and imported later.
This would allow Katello to avoid the 3x disk usage problem whereby three copies of an artifact exist: the original one, the exported one, and the one in the split archives.
Updated by ggainey over 3 years ago
IRC conversation that spurred this issue:
<partha> jsherrill: jturel ggainey so one of the stories I was looking at regarding import/export was the ability to break stuff into smaller batch sizes . Can ya repository content span more than a dv iso ? <partha> in size <partha> I guess a content view verison can can right ? <ggainey> pulp-export drops a single .tar.gz - prev discussion said splitting would be handled by the caller <ggainey> a single repo-version can certainly be bigger than 4Gb - consider "export all of RHEL7", for example <jsherrill> ggainey: i thought we talked about pulp doing that because that was the only way to avoid 3x the disc space <jsherrill> i know originally we said pulp wouldn't, but then i think that dawned on us <ggainey> jsherrill: pulp is avoiding the 3x disk space problem, by writing directly to a tar.gz instead of to a filesystem-tree and then tar.gz-ing at the end <jsherrill> ggainey: right but the entire solution needs to avoid the 3x disk space issue <ggainey> jsherrill: how can it? katello has its own data to export, on top of/in addition to what pulp owns <jsherrill> ggainey: we just have a simple metadata file, that could easily ship beside the exported data, or appended to a tar file potentially <ggainey> but you can't "append to the tarfile" until that file exists. <jsherrill> right, it just depends on how the files are split and if the format supports appending, but if not like i said we can just ship it beside the file(s) <jsherrill> i swear we talked about all this ;) <ggainey> ... <ggainey> we did - and the outcome (that I heard, at least) was "pulp will produce a .tar.gz and it will be the caller's (ie katello's) reposonsibility to handle splitting it up however it wants <ggainey> " <ggainey> which is what I've been saying/documenting ever since <jsherrill> that was, until we realized we couldn't solve the 3x disk space problem that way <ggainey> which is why, for example, the API docs have no mention of ever accepting a "chunk/split size" <jsherrill> i heard it more as a 'future feature' <jsherrill> i know its not there now <jsherrill> i didn't have any expectation of it being there now <ggainey> so, just for transparency's sake, today has been a full day already of "oh but I meant <something else>" about things that I have been documenting and explaining and emailing about for months now <ggainey> sigh <jsherrill> i'm not sure how we avoid the 3x disk space issue without pulp doing it? <jsherrill> which was the point right? <jsherrill> the user doesn't care if some sub-system avoids the 3x disk space <jsherrill> if the overall solution doesn't <ggainey> the "three times" thing today, as it has been explained to me, is "in pulp, in filesystem-export-tree, in resulting .tar.gz" <jsherrill> i'm sorry if that was explained incorrectly then :( <ggainey> one can then use, say, the os 'split' command to cut that tar.gtz up into chunks to move to some medium <jsherrill> but i mean its possible i'm misremembering this conversation <ggainey> this is why I basically beg people to write things down/add to docs/answer email threads <ggainey> it's fine <jsherrill> i think this was during a meeting? <jsherrill> when we had this realizatipn <jsherrill> realization <jsherrill> sorry that it did not come sooner :) <ggainey> if it's not in the doc, then I am not currently planning to do it <jsherrill> during docs/email review <ggainey> so - is this an immediate requirement? <jsherrill> i think we have to discuss it? <ggainey> well, I think the discussion would be about time, resources, and priority - but it has to come *after* we have a definition of the actual use-case <ggainey> esp since that's where the confucion is happening, not on the scheduling end <ggainey> confusion, even <daviddavis> ggainey jsherrill I know we talked about it. it was my understanding that it was a future feature and not something we had to solve right away. who can we ask about whether it needs to be soon? <jsherrill> daviddavis: and that may have been <jsherrill> we didn't do a good job of establishing what 'future feature' meant <jsherrill> which future ;) <daviddavis> agreed <partha> I mean if we have a 20GB cv dump and have to split it into 5 on disk ourselves wouldn;t we need 2x the space ? <jsherrill> 3x includes the original content <jsherrill> iirc <partha> good point <daviddavis> is it not possible to split a file in place? <ggainey> daviddavis: split, when used against a file, needs the space for the original and the parts, until it's done (iirc) <jsherrill> it looks like it might be possible with dd <jsherrill> https://superuser.com/questions/177823/are-there-any-tools-in-linux-for-splitting-a-file-in-place <jsherrill> its really ugly though ;) <partha> was looking at the smae lol <ggainey> aanyway - we can do this, I think, for pulpexport by opening the tarfile 'w|', and streaming it to the split() cmd directly <jsherrill> its 2x+size of one chunk <jsherrill> which is a lot better than 3x <jsherrill> ggainey: would that avoid 3x? <ggainey> jsherrill: yes, because the tarfile never ends up on disk at all <ggainey> of course, you're now keeping one 'chunk' in *memory* at a time... <daviddavis> I filed https://pulp.plan.io/issues/6736 <ggainey> (assuming this works, it prob does but would have to see it working) <jsherrill> lets keep the conversation going, its possible we can do it on our side given that option with dd <daviddavis> is that less effort though? <ggainey> jsherrill: not and prefvent 3x - you don't have the file until export is done <ggainey> oh wait, I see <ggainey> you have 2x (in pulp, in file) and then you end up with 2x-plus-chunk <ggainey> I get it <ggainey> sorry, slow today
Added by ggainey over 3 years ago
Teach exporter to understand, validate, and respect chunk_size= parameter.
Since we can now have multiple-output-files, replaced filename/sha256 columns with output_file_info, a JSONField which is a dictionary of filename: hash pairs. closes #6736