Project

Profile

Help

Story #985

closed

Story #1883: As a user, I can sync and publish all package types

As a user, I can sync all packages from pypi (complete mirror)

Added by ashbyj@imsweb.com almost 9 years ago. Updated over 3 years ago.

Status:
MODIFIED
Priority:
High
Assignee:
-
Sprint/Milestone:
Start date:
Due date:
% Done:

100%

Estimated time:
Platform Release:
3.0.0
Target Release - Python:
Groomed:
No
Sprint Candidate:
No
Tags:
Sprint:
Quarter:

Description

To sync all packages from PyPI, bandersnatch[0] (the PyPI mirror tool) will be a good reference.

We will need 2 workflows.

initial sync /force-full sync

Roughly, the workflow the same as a sync with a whitelist of project names except this will require an additional call to the simple index to retrieve a list of all projects and their urls.

incremental sync

The XML-RPC PyPI API has a call `changelog_since_serial(since_serial)` which will return all of the projects that have been updated since the last sync. Once we have this, we essentially have our whitelist and sync can proceed as it does in the other cases.

This does present a problem though. The repository would need a "latest_serial" or something similar. Currently, this could be stored in repository.notes['latest_serial'], but if possible, I would prefer to avoid using the notes field like this. An alternative would require a significant change to pulpcore-- typed repositories.

[0]: https://pypi.org/project/bandersnatch/


Related issues

Related to Python Support - Refactor #6930: Use Bandersnatch to perform package metadata fetching and filteringMODIFIEDgerrod

Actions
Actions #1

Updated by ashbyj@imsweb.com almost 9 years ago

Warehouse looks great. How stable is it? I see they have a list_packages() query that hopefully can grab a list of packages in one query, but pulp_python may need some refactoring to download multiple packages in one shot instead of looping and downloading each package as a separate request. The mirroring support in Warehouse looks helpful as well.

https://warehouse.readthedocs.org/api-reference/xml-rpc/#package-querying

Actions #2

Updated by rbarlow almost 9 years ago

  • Tracker changed from Issue to Story
  • Category deleted (21)
  • Groomed set to No
  • Sprint Candidate set to No

wrote:

Warehouse looks great. How stable is it? I see they have a list_packages() query that hopefully can grab a list of packages in one query, but pulp_python may need some refactoring to download multiple packages in one shot instead of looping and downloading each package as a separate request. The mirroring support in Warehouse looks helpful as well.

Yeah it really does look nice. I'm not sure how stable it is yet, other than that PyPI is not using it yet and they are still developing it. I have considered starting a branch to test the Python plugin against the current deployment, but I'd like to know the answer to your question before going too far with that ☺

As for the refactor - I think that's a good idea!

Actions #3

Updated by amacdona@redhat.com almost 8 years ago

  • Parent issue set to #1883
Actions #4

Updated by amacdona@redhat.com almost 6 years ago

  • Subject changed from As a user, I can sync all packages from pypi to As a user, I can sync all packages from pypi (complete mirror)
  • Tags Pulp 3 added
Actions #5

Updated by amacdona@redhat.com almost 6 years ago

  • Description updated (diff)

From the original post: Basically, I'd like to be able to set up an internal pypi mirror. From our list discussion:

From: pulp-list On Behalf Of Randy Barlow
Sent: Wednesday, May 13, 2015 9:00 AM
To: pulp-list
Subject: Re: [Pulp-list] Sync all packages from PyPi with pulp_python plugin

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 05/13/2015 08:21 AM, Ashby, Jason (IMS) wrote:
> I’m looking to set up a pypi mirror with pulp.  I’m currently
> using Bandersnatch for this, but it’d be nice to drop it and use
> pulp instead.  Per the docs*, I see that you can sync specific
> packages from pypi, e.g.
> 
> pulp-admin python repo create --repo-id pypi --feed 
> https://pypi.python.org/ --package-names numpy,scipy
> 
> but I can’t seem to sync ALL packages.  I tried leaving off the 
> --package-names option, but a sync downloads 0 packages.   Should
> I submit an issue/feature request at 
> https://pulp.plan.io/projects/pulp_python/issues?

Hi Jason!

The problem is that PyPI does not have one single manifest file for
the available package versions, but rather one manifest per package
name. Due to this, in order to sync all packages from PyPI it would be
necessary to make around 45-50,000 web requests just to find out what
would need to be downloaded, and then of course we would need to
perform the actual package downloads.

That said, we are working on a plan to have Pulp be able to lazy fetch
packages as they are requested. This plan will take a long time to
implement (so don't expect it in any of our close releases) but I
think it will solve this problem in a performant way.

Another possible solution may be Warehouse[0]. I've been talking to
the PyPA developers about this problem, and they are aware that it
needs to be solved. They may fix it there, in which case we can get
the Python importer to be aware of all the packages.

I have also considered just doing the 50k requests anyway. I suspect
that PyPI won't like if we do that, but it is technically possible as
well.

I say go ahead and file an RFE. I'll think some more about how we
might be able to get it working. Thanks for the note, and I hope you
enjoy the plugin otherwise!

[0] https://warehouse.python.org/
Actions #6

Updated by amacdona@redhat.com almost 6 years ago

Many of us have expressed concern about being polite to PyPI and this story. Some notes from pycon:

  1. PyPI makes good use of caching
  2. there are a lot of mirrors (bandersnatch) that regularly sync, so they can handle it. In theory, pulp could actually reduce the load on pypi, especially after we work in the lazy feature.
  3. Using the changelog_since_serial will allow us to only download new metadata for projects that have changed
Actions #7

Updated by bizhang almost 6 years ago

  • Sprint/Milestone set to 3.0 GA
Actions #8

Updated by bmbouter about 5 years ago

  • Tags deleted (Pulp 3)
Actions #9

Updated by CodeHeeler almost 5 years ago

  • Sprint set to Sprint 56
Actions #10

Updated by rchan over 4 years ago

  • Sprint changed from Sprint 56 to Sprint 57
Actions #11

Updated by rchan over 4 years ago

  • Sprint deleted (Sprint 57)
Actions #12

Updated by dalley almost 4 years ago

  • Priority changed from Normal to High
Actions #13

Updated by dalley almost 4 years ago

A little bit of additional context:

  • The XML-RPC APIs mentioned above are considered "deprecated" and not recommened for use, but plenty of people including bandersnatch still use them
  • If possible, it would be great if we could utilize bandersnatch as a library, but I haven't evaluated this at all

Upstream issue to track for JSON replacement APIs for XML-RPC replacement: https://github.com/pypa/warehouse/issues/284

Actions #14

Updated by dalley almost 4 years ago

  • Related to Refactor #6930: Use Bandersnatch to perform package metadata fetching and filtering added

Added by gerrod over 3 years ago

Revision 5270947a | View on GitHub

Pulp now uses Bandersnatch to perform metadata syncing

Sync uses Bandersnatch to perform metadata fetching and filtering enabling Pulp to sync all of PyPi.

closes: #6930 closes: #6875 closes: #985 https://pulp.plan.io/issues/6930 https://pulp.plan.io/issues/6875 https://pulp.plan.io/issues/985

Actions #15

Updated by gerrod over 3 years ago

  • Status changed from NEW to MODIFIED
  • % Done changed from 0 to 100
Actions #16

Updated by dalley over 3 years ago

  • Platform Release set to 3.0.0

Also available in: Atom PDF