Story #985
closedStory #1883: As a user, I can sync and publish all package types
As a user, I can sync all packages from pypi (complete mirror)
100%
Description
To sync all packages from PyPI, bandersnatch[0] (the PyPI mirror tool) will be a good reference.
We will need 2 workflows.
initial sync /force-full sync¶
Roughly, the workflow the same as a sync with a whitelist of project names except this will require an additional call to the simple index to retrieve a list of all projects and their urls.
incremental sync¶
The XML-RPC PyPI API has a call `changelog_since_serial(since_serial)` which will return all of the projects that have been updated since the last sync. Once we have this, we essentially have our whitelist and sync can proceed as it does in the other cases.
This does present a problem though. The repository would need a "latest_serial" or something similar. Currently, this could be stored in repository.notes['latest_serial'], but if possible, I would prefer to avoid using the notes field like this. An alternative would require a significant change to pulpcore-- typed repositories.
Related issues
Updated by ashbyj@imsweb.com over 9 years ago
Warehouse looks great. How stable is it? I see they have a list_packages() query that hopefully can grab a list of packages in one query, but pulp_python may need some refactoring to download multiple packages in one shot instead of looping and downloading each package as a separate request. The mirroring support in Warehouse looks helpful as well.
https://warehouse.readthedocs.org/api-reference/xml-rpc/#package-querying
Updated by rbarlow over 9 years ago
- Tracker changed from Issue to Story
- Category deleted (
21) - Groomed set to No
- Sprint Candidate set to No
ashbyj@imsweb.com wrote:
Warehouse looks great. How stable is it? I see they have a list_packages() query that hopefully can grab a list of packages in one query, but pulp_python may need some refactoring to download multiple packages in one shot instead of looping and downloading each package as a separate request. The mirroring support in Warehouse looks helpful as well.
Yeah it really does look nice. I'm not sure how stable it is yet, other than that PyPI is not using it yet and they are still developing it. I have considered starting a branch to test the Python plugin against the current deployment, but I'd like to know the answer to your question before going too far with that ☺
As for the refactor - I think that's a good idea!
Updated by amacdona@redhat.com over 6 years ago
- Subject changed from As a user, I can sync all packages from pypi to As a user, I can sync all packages from pypi (complete mirror)
- Tags Pulp 3 added
Updated by amacdona@redhat.com over 6 years ago
- Description updated (diff)
From the original post: Basically, I'd like to be able to set up an internal pypi mirror. From our list discussion:
From: pulp-list On Behalf Of Randy Barlow
Sent: Wednesday, May 13, 2015 9:00 AM
To: pulp-list
Subject: Re: [Pulp-list] Sync all packages from PyPi with pulp_python plugin
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
On 05/13/2015 08:21 AM, Ashby, Jason (IMS) wrote:
> I’m looking to set up a pypi mirror with pulp. I’m currently
> using Bandersnatch for this, but it’d be nice to drop it and use
> pulp instead. Per the docs*, I see that you can sync specific
> packages from pypi, e.g.
>
> pulp-admin python repo create --repo-id pypi --feed
> https://pypi.python.org/ --package-names numpy,scipy
>
> but I can’t seem to sync ALL packages. I tried leaving off the
> --package-names option, but a sync downloads 0 packages. Should
> I submit an issue/feature request at
> https://pulp.plan.io/projects/pulp_python/issues?
Hi Jason!
The problem is that PyPI does not have one single manifest file for
the available package versions, but rather one manifest per package
name. Due to this, in order to sync all packages from PyPI it would be
necessary to make around 45-50,000 web requests just to find out what
would need to be downloaded, and then of course we would need to
perform the actual package downloads.
That said, we are working on a plan to have Pulp be able to lazy fetch
packages as they are requested. This plan will take a long time to
implement (so don't expect it in any of our close releases) but I
think it will solve this problem in a performant way.
Another possible solution may be Warehouse[0]. I've been talking to
the PyPA developers about this problem, and they are aware that it
needs to be solved. They may fix it there, in which case we can get
the Python importer to be aware of all the packages.
I have also considered just doing the 50k requests anyway. I suspect
that PyPI won't like if we do that, but it is technically possible as
well.
I say go ahead and file an RFE. I'll think some more about how we
might be able to get it working. Thanks for the note, and I hope you
enjoy the plugin otherwise!
[0] https://warehouse.python.org/
Updated by amacdona@redhat.com over 6 years ago
Many of us have expressed concern about being polite to PyPI and this story. Some notes from pycon:
- PyPI makes good use of caching
- there are a lot of mirrors (bandersnatch) that regularly sync, so they can handle it. In theory, pulp could actually reduce the load on pypi, especially after we work in the lazy feature.
- Using the changelog_since_serial will allow us to only download new metadata for projects that have changed
Updated by dalley over 4 years ago
A little bit of additional context:
- The XML-RPC APIs mentioned above are considered "deprecated" and not recommened for use, but plenty of people including bandersnatch still use them
- If possible, it would be great if we could utilize bandersnatch as a library, but I haven't evaluated this at all
Upstream issue to track for JSON replacement APIs for XML-RPC replacement: https://github.com/pypa/warehouse/issues/284
Updated by dalley over 4 years ago
- Related to Refactor #6930: Use Bandersnatch to perform package metadata fetching and filtering added
Added by gerrod over 4 years ago
Updated by gerrod over 4 years ago
- Status changed from NEW to MODIFIED
- % Done changed from 0 to 100
Applied in changeset 5270947abc578d13c942f5cc64bf27556c212ebc.
Pulp now uses Bandersnatch to perform metadata syncing
Sync uses Bandersnatch to perform metadata fetching and filtering enabling Pulp to sync all of PyPi.
closes: #6930 closes: #6875 closes: #985 https://pulp.plan.io/issues/6930 https://pulp.plan.io/issues/6875 https://pulp.plan.io/issues/985