Project

Profile

Help

Story #985

Story #1883: As a user, I can sync and publish all package types

As a user, I can sync all packages from pypi (complete mirror)

Added by ashbyj@imsweb.com over 4 years ago. Updated 2 months ago.

Status:
NEW
Priority:
Normal
Assignee:
-
Category:
-
Sprint/Milestone:
Start date:
Due date:
% Done:

0%

Platform Release:
Blocks Release:
Target Release - Python:
Backwards Incompatible:
No
Groomed:
No
Sprint Candidate:
No
Tags:
QA Contact:
Complexity:
Smash Test:
Verified:
No
Verification Required:
No
Sprint:

Description

To sync all packages from PyPI, bandersnatch0 (the PyPI mirror tool) will be a good reference.

We will need 2 workflows.

initial sync /force-full sync

Roughly, the workflow the same as a sync with a whitelist of project names except this will require an additional call to the simple index to retrieve a list of all projects and their urls.

incremental sync

The XML-RPC PyPI API has a call `changelog_since_serial(since_serial)` which will return all of the projects that have been updated since the last sync. Once we have this, we essentially have our whitelist and sync can proceed as it does in the other cases.

This does present a problem though. The repository would need a "latest_serial" or something similar. Currently, this could be stored in repository.notes['latest_serial'], but if possible, I would prefer to avoid using the notes field like this. An alternative would require a significant change to pulpcore-- typed repositories.

[0]: https://pypi.org/project/bandersnatch/

History

#1 Updated by ashbyj@imsweb.com over 4 years ago

Warehouse looks great. How stable is it? I see they have a list_packages() query that hopefully can grab a list of packages in one query, but pulp_python may need some refactoring to download multiple packages in one shot instead of looping and downloading each package as a separate request. The mirroring support in Warehouse looks helpful as well.

https://warehouse.readthedocs.org/api-reference/xml-rpc/#package-querying

#2 Updated by rbarlow over 4 years ago

  • Tracker changed from Issue to Story
  • Category deleted (pulp-admin)
  • Groomed set to No
  • Sprint Candidate set to No

wrote:

Warehouse looks great. How stable is it? I see they have a list_packages() query that hopefully can grab a list of packages in one query, but pulp_python may need some refactoring to download multiple packages in one shot instead of looping and downloading each package as a separate request. The mirroring support in Warehouse looks helpful as well.

Yeah it really does look nice. I'm not sure how stable it is yet, other than that PyPI is not using it yet and they are still developing it. I have considered starting a branch to test the Python plugin against the current deployment, but I'd like to know the answer to your question before going too far with that ☺

As for the refactor - I think that's a good idea!

#3 Updated by amacdona@redhat.com over 3 years ago

  • Parent task set to #1883

#4 Updated by amacdona@redhat.com over 1 year ago

  • Subject changed from As a user, I can sync all packages from pypi to As a user, I can sync all packages from pypi (complete mirror)
  • Tags Pulp 3 added

#5 Updated by amacdona@redhat.com over 1 year ago

  • Description updated (diff)

From the original post: Basically, I'd like to be able to set up an internal pypi mirror. From our list discussion:

From: pulp-list On Behalf Of Randy Barlow
Sent: Wednesday, May 13, 2015 9:00 AM
To: pulp-list
Subject: Re: [Pulp-list] Sync all packages from PyPi with pulp_python plugin

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 05/13/2015 08:21 AM, Ashby, Jason (IMS) wrote:
> I’m looking to set up a pypi mirror with pulp.  I’m currently
> using Bandersnatch for this, but it’d be nice to drop it and use
> pulp instead.  Per the docs*, I see that you can sync specific
> packages from pypi, e.g.
> 
> pulp-admin python repo create --repo-id pypi --feed 
> https://pypi.python.org/ --package-names numpy,scipy
> 
> but I can’t seem to sync ALL packages.  I tried leaving off the 
> --package-names option, but a sync downloads 0 packages.   Should
> I submit an issue/feature request at 
> https://pulp.plan.io/projects/pulp_python/issues?

Hi Jason!

The problem is that PyPI does not have one single manifest file for
the available package versions, but rather one manifest per package
name. Due to this, in order to sync all packages from PyPI it would be
necessary to make around 45-50,000 web requests just to find out what
would need to be downloaded, and then of course we would need to
perform the actual package downloads.

That said, we are working on a plan to have Pulp be able to lazy fetch
packages as they are requested. This plan will take a long time to
implement (so don't expect it in any of our close releases) but I
think it will solve this problem in a performant way.

Another possible solution may be Warehouse[0]. I've been talking to
the PyPA developers about this problem, and they are aware that it
needs to be solved. They may fix it there, in which case we can get
the Python importer to be aware of all the packages.

I have also considered just doing the 50k requests anyway. I suspect
that PyPI won't like if we do that, but it is technically possible as
well.

I say go ahead and file an RFE. I'll think some more about how we
might be able to get it working. Thanks for the note, and I hope you
enjoy the plugin otherwise!

[0] https://warehouse.python.org/

#6 Updated by amacdona@redhat.com over 1 year ago

Many of us have expressed concern about being polite to PyPI and this story. Some notes from pycon:

  1. PyPI makes good use of caching
  2. there are a lot of mirrors (bandersnatch) that regularly sync, so they can handle it. In theory, pulp could actually reduce the load on pypi, especially after we work in the lazy feature.
  3. Using the changelog_since_serial will allow us to only download new metadata for projects that have changed

#7 Updated by bizhang over 1 year ago

  • Sprint/Milestone set to 3.0 GA

#8 Updated by bmbouter 6 months ago

  • Tags deleted (Pulp 3)

#9 Updated by CodeHeeler 3 months ago

  • Sprint set to Sprint 56

#10 Updated by rchan 2 months ago

  • Sprint changed from Sprint 56 to Sprint 57

#11 Updated by rchan 2 months ago

  • Sprint deleted (Sprint 57)

Please register to edit this issue

Also available in: Atom PDF