Project

Profile

Help

Story #6134

[EPIC] Importers/Exporters

Added by daviddavis about 2 months ago. Updated 1 day ago.

Status:
NEW
Priority:
Normal
Assignee:
-
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
% Done:

25%

Platform Release:
Blocks Release:
Backwards Incompatible:
No
Groomed:
No
Sprint Candidate:
No
Tags:
QA Contact:
Complexity:
Smash Test:
Verified:
No
Verification Required:
No
Sprint:

Description

An epic for the next batch of importer/exporter stories for Katello.

After an import, the destination should have a repo version that is exactly the same as the exported repo version.

Collaboration on the design is happening here: https://hackmd.io/@ggainey/HyfXU_648


Subtasks

Story #6135: As a user, I can export a set of repository versions to a file, and have that file saved to disk.ASSIGNEDggainey

Actions
Story #6136: As a user, I will receive only the new artifacts if I export using a PulpExporter that has exported beforeNEW

Actions
Story #6137: As a user, I can import an export and have its contents be added to existing repositories to produce new repository versions.ASSIGNEDdaviddavis

Actions
Story #6138: As a user, I can upload an incremental export and specify a repository version to base it onCLOSED - WONTFIX

Actions
Story #6328: As a user, I can create/read/update/delete PulpExportersMODIFIEDdaviddavis

Actions
Story #6329: As a user, I can create/read/update/delete PulpImportersPOSTdaviddavis

Actions
Task #6364: Add docs for performing imports/exportsNEW

Actions
Story #6436: As a plugin writer, I can write custom resources to customize what content data gets exportedNEW

Actions

Related issues

Related to Pulp - Story #5096: [epic] As a user, I can export the content of a RepositoryVersion from one Pulp3 system and import on an air gapped Pulp3 system CLOSED - DUPLICATE Actions

History

#1 Updated by daviddavis about 2 months ago

  • Subject changed from Importers/Exporters to [EPIC] Importers/Exporters
  • Description updated (diff)

#2 Updated by daviddavis about 2 months ago

  • Description updated (diff)

#3 Updated by ggainey about 2 months ago

Notes from initial design meeting 2020-02-14:

Pulp3 exporters explanation

  • only for filesystem
    • exports to disk somewhere
  • talked about rsync exporter
    • same but uses rsync to elsewhere
  • not for re-importing, for external consumption
  • file repo
    • can't export version - list-of-content, but no metadata
    • current exporters only know how to export publications (which is what adds metadata)
    • master/detail (pulpcore/plugin) used for exporters
  • maybe use publish first?
    • publish == create-a-publication
    • publish code doesn't exist in pulpcore? - code is in plugins
  • not all plugins differentiate between version/publication
  • FileSystemVersion vs ...PublicationExporter
    • PubExp includes metadata we don't need
    • FS doesn't have data-from-db only from filesystem
  • Therefore - need a third-kind of Exporter?
    • RepositoryVersionExporter

two approaches

  • Master baseclass to handle grunt filesystem work
    • Plugins extend to expose/provide the API
    • this is The Pulp3 Way
    • could have most of the 'heavy lifting' happening in Master, even with API controlled/exposed by Detail
  • expose API at pulpcore level

django import/export

  • just handles models
  • however, export needs two pieces
    • dataset in db for all content-items in repo-version
    • artifacts
    • example:
      • RPM - errata from DB, and the RPMs themselves from filesystem
  • need not just content-units but also relationships
    • relationship between content and artifact
    • cross-content-unit relationships
    • certain content-types have relationships to other content types
    • can we rely on uuids-as-keys for db export/import work?
    • or, do we need to export by 'natural key' (eg, NEVRA or NSVCA or errata-name etc)

general notes

  • will prob want to apply to existing FileSystemExporter (once we know what we're doing)
  • diff-exports - export diff-metadata or full-metadata?
    • prob full - puts onus of set-theory on importer, and gives enough info to make that possible
  • 3 questions to be emailed to list:
    • master/detail vs core
    • natural key or uuid?
    • incremental export
      • dump 'all' db-metadata? (importer does set-theory to handle added/updated/removed)
      • dump just the differences? (exporter does set theory)
      • always dump just the incremental artifacts

AIs:

  • ggainey to add notes to epic
  • ddavis to send note to pulp-dev w/questions and pointer

#4 Updated by ggainey about 1 month ago

Notes from design discussion 2020-02-21

Attendees

  • ggainey
  • daviddavis

Use django import/export as basis

Ownership and workflow

  • core starts from repo-version, which 'knows about' all the artifacts - so core can be responsible for packaging up the physical on-disk entities
  • plugins will need ModelResources to define how to export/import the database metadata that matches a repo-version
  • who owns RAR-ing resulting exported filefile? katello? us? P^3I?
    • katello/caller would own this part of the process (and the re-creating at import as well)
  • what if plugin can't handle export-import?
    • core needs to be able to call a specific per-repo-version-type method to export/import
    • on error/exception/NotImplemented, return error to the caller
  • need to think about pre-export-sanity-checks (eg, disk space)

What does the API look like?

  • Possible /import /export endpoints
    • just a repository-version
    • a list of repository-versions
    • latest for a repository * what about "everything* for a repo (all versions/distributions/publications)
      • is this a real use case?
  • probable first cut is "you specify a specific repo-version" (export) and "specify a repository" (import)
  • need to know/talk about 'natural' keys for things 
    • if downstream can be relied on to never create content-artifacts 'on its own' , can we rely on uuid-to-natural-key being "the same" between up and downstream?

Artifact transfers

  • how to insure a given file/artifact only gets transferred once in the presence of multi-version-apis?
  • start with single, but make sure we don't architect-out multi-version/content-once approach next
  • exporting distributions - relative-path in pulp - needs to be exported
  • publications - export? or publish downstream? *pulpcore doesn't know about publishing

edge cases

  • There will be several/many - ponder on workflow/complicated plugins and start thinking about general answers

Notes from design discussion 2020-02-24

Attendees

  • ggainey
  • daviddavis

Django import/export discussion

  • how does it handle complicated FK relationships, esp at import-time?
  • how is import-order defined?
  • Need to look at real-world cases and prototype

Design doc draft

#5 Updated by ggainey about 1 month ago

Notes from design discussion 2020-02-27

daviddavis tried out import/export

  • (JSON example)[https://gist.github.com/daviddavis/f35ec8f0225585e4f137cf4e3aad9cc2]
  • foreign-key: exports the FKID
  • many-to-many: string of comma-separated-list of FKIDs (ie, "associated-foos": "foo-id-1,foo-id-2,foo-id-3")
  • as long as FKs are uuids, and not an internal-to-0this-db-only entity, should work for us

katello export use-cases

  • katello needs us to export multiple-repo-versions in order to export All The Things in a Content View at once
  • downstream not-just-like upstream 
    • repos have diff names, for example (pulp-uuids all in same content-view being imported) 
  • how do we handle mapping export-repo-version to appropriate destination-repo?
    • import-api must allow a destination-repo to be specified, for all repo-versions in the import-file
    • requires a way to have/generate a mapping for each repo-version in the export
    • export-side shouldn't have to 'know' this in advance - so must be handled on import-side
  • katello's usecase has enough info to fill in this info
  • for non-katello pulp user, must have a way for the user to define this mapping and hand it to the import side
  • regardless, import/ sides needs two params: tarfile-path, mapping

import-side discussion

  • dryrun could generate a mapping of all repo-versions being imported into repos with the same names as they had on the export-side.
    • User can manipulate/update/change that mapping to their hearts' content, then use it to do the 'real' import
  • import needs a dry run that accepts a mapping as a test vehicle and spits out what it would do with that mapping,
    • for sanity- and error-checking prior to actual-import errors (bad json, dest-repo doesn't exist, whatever)

triple-disk-problem

  • artifacts/content can be taking up disk space three times
    • in pulp
    • in temp-export-dir
    • in tarfile
  • user needing three times current pulp-usage to do exports is Painful
  • how do we avoid this?
    • linked streams?
    • holding entire tarfile in memory at once is a nonstarter

naming discussion

  • RE this question:

Currently, Pulp has the concept of Exporters (filesystem, rsync, etc) which are implemented as Master/Detail. This was done to accommodate the fact that some plugins will need to export publications while others might export repository versions. Do we divorce the concept of import/export and Exporters? Or bring the Exporters inline with import/export by looking into having core handle Exporters?

  • Exporter is a different , pulp-to-end-user functionality, import/export is pulp-to-pulp. Is there a better pairing than import/export to avoid this name clash? What if we rename current-exporter to Publisher?

  • email sent to ongoing importer/exporter discussion on pulp-dev@

#6 Updated by daviddavis about 1 month ago

  • Related to Story #5096: [epic] As a user, I can export the content of a RepositoryVersion from one Pulp3 system and import on an air gapped Pulp3 system added

#7 Updated by ggainey about 1 month ago

Notes from 2020-02-28 design doc

attendees: ggainey, daviddavis,bmbouters, dkliban

  • procedural issue - need to extend more invites - ggainey and daviddavis strongly concur
  • how are users going to use the functionality
  • what's the workflow? 
  • how do we support rollback (upstream had five versions, exported one, now want to rollback) 
  • what about just exporting each of the versions- diffs
  • need to import version in the right order
    • if pulp controls multi-version-export, can control multi-version-import - can make sure versions are imported in correct order
  • close 5096 (but make sure we've captured all the knowledge there)
  • testing of django-import-export
    • csv and json both supported
    • FK/manytomany kind of simple
  • what's the easiest impact on the pluginwriter
    • can we dynamically create modelresources if needed?
  • how many steps does the user take?
  • exporter can't know anything about the downstream
  • two types of export
    • repos that pulp can "just sync"
    • trying to keep a replica of a pulp instance
    • some repo-types have no at-rest state
  • import use-cases
    • repo doesn't exist
    • repo exists but is the wrong type
    • repo exists and is the right type
    • what if there's a malicious repo on the importer?
      • this has to be fixed by ACLs/RBAC/authorization
  • do we require the user to create an ExportEntity?
    • would need to if we care about history
    • "export everything since the last export"
    • need to be able to CRUDL these
    • s that new code? or do we somehow get this 'for free'?
  • if we have Exporter object, then this and current-exporters are the same kind-of thing
    • PulpExporter and PulpImporter
  • dry-run is really important, and probably moreso on the Importer
  • track sha256 of export-file and at import-time
  • Export model relies on a History model
  • export-history actually matters to all Exporters
    • need to make this stuff happen for all Exporters
  • lets make sure we can get contributions from Brno
    • have at least one meeting in AM EST
  • ggainey to massage gdoc with output by Monday

#8 Updated by ggainey about 1 month ago

2020-03-02

attendees: ggainey davis bmbouters dkliban ttereshc ipanova

  • ContentModel vs Just-a-Model
  • create an exporter, then call/invoke an exporter
  • can thus get incrementals for free
  • can override and ask "do the last thing over again"
  • restoring last-exported-version - can get complicated?
  • need to add labels (what did this mean? gg)

one tarball per repo?

  • no - doesn't handle the content-deduplicate issue or multi-file-issue for katello

publications/distributions: do we really need to do this?

  • p3 creates publications, and then creates distributions that points to that publication
  • leave for 'later' and a 'real' use case
  • secure environments - what about when you sign the metadata and the downstream can't do the signing
  • some plugins don't have publications (see live-api plugins)

incrementals question:

  • if we can import into any repo, do we support additive? or mirror? how does this interact with "import into this base-version?"
  • export/import wants to leave the 'downstream' as an 'exact copy' of whatever was exported
  • can we 'rely on' downstream version-numbers matching the upstream-version-numbers?
  • katello doesn't care
  • how can we guarantee content is the same, in the presence of incrementals?
  • at the end of the day, user has to know?

Publishing design-doc for comments

  • write the design
  • export as PDF
  • attach to epic
  • write subtasks referring directly to pdf/hackmd
  • wiki page in redmine?

2020-03-03

attendees: ggainey davis bmbouters dkliban ttereshc ipanova

importers

  • what do they look like?
  • how they decide "this is an incremental"
    • does it even need to care?
    • has full-db-metadata always
  • needs a mapping of my-repo to upstream-repo
  • can't find repo?
    • create or error?
    • what about the empty-downstream-case?
    • dry-run to catch errors
    • is there any missing data?
      • how are we going to handle errors?
      • what about plugin-extra-fields? (eg subrepo)
    • can we add this later?
      • sounds like a phase-2 or -3 thing
      • get help from community to add this
    • validation first
      • if anything is wrong, fail the entire operation immediately
      • will need some bad-export-tests
    • validation/import needs to happen
      • needs to lock all affected repos at start?
      • validate-and-lock on repo at a time?
      • is an import an atomic operation or not?
      • what happens if you reimport something you already imported?
        • use stage-api to make it possible to re-import
      • what are some unfixable errors?
        • artifact from a prev export was deleted from downstream - continue, or fail-the-repo?
          • switchable mode? - current default is safety-first, option to report-and-continue
          • question is do we create a repo-version if we have an artifact failure
          • so switch on "create a repo-version on a failure, or not" (like sync)
            • switch is on inside-repo problem, not on entire-version
  • how do we define import-order
    • pre-import-per-row hooks?
    • specify the models in an ordered list?
    • we may need to look at how the code works

API and http verbs

  • POST - create
  • PATCH - update
  • GET
  • add verb to API for "do the export"?
  • POST to pulp_exporters//export - Does The Thing
    • returns a task
    • task-created resource of  HREF for specific instance of an-export pulp_exporters//export/
    • create-an-export vs have-an-export vs have an instance-of-an-export
  • Export needs to include sha256 of created-archive

#9 Updated by ggainey 30 days ago

  • Description updated (diff)

#10 Updated by ggainey 28 days ago

Q&A with katello team 2020-03-06

attendees:  ttereshc ipanova ggainey bmbouters dkliban daviddavis croberts jsherril jturel

  • dryrun discussion
    • maybe don't need 'immediately' - but soon
    • should always do the dry-run equivalent on import first ?
    • RE fail-on-error?
      • can't find repo? - fatal
      • can't find artifact? - warn and continue?
      • task will lock all involved repos at start-time
      • can we 'rollback' in case of failure? (no, not really)
        • will import be idempotent?
          • yes should be
        • recoverable ("we have the data but something transient went wrong")
        • non-recoverable - missing data in export?
        • discussion about where we might be able to catch this/validate export/import
        • prob needs more pulp3-dev-discussion, katello seems to be ok with "do your best and report any errors"
  • how do we let the user delete the file(s) associated with an Export?
  • how do we deliver the file to the user? (assume user does not have shell-access to pulp server )* how do we cut up the file?
    • thought that was katello's thing? - no, alas
    • are we owning 'iso generation'? - no, but do own 'max file size' problem
  • Pulp3 needs to own the ability to specify image-size and respond appropriately
    • to avoid triple-storage, pulp has to solve this problem
    • we need to find a dataformat that solves the split/recombine problem
    • initial tech-prev release does not need this (as long as we can add )
  • history-via-export - actually wants to be per-repo?
    • katello may keep this info so we dont' have to
    • maybe first phase impl is 'immutable exporters' (consensus is 'yes')
  • timeline/roadmap
    • pulpcore 3.3 is end of March - want for end-of-May-release for katello
    • katello: prob unable to start integrating before end of April
    • katello use-of requirement would be July
    • can we get katello involved earlier? - jturel says yes
    • so 'tech preview' for 3.3
    • add 'split into multiple files' in April(ish)?

Please register to edit this issue

Also available in: Atom PDF