Project

Profile

Help

Story #985

Updated by amacdona@redhat.com almost 6 years ago

To sync Basically, I'd like to be able to set up an internal pypi mirror.    From our list discussion: 

 <pre> 
 From: pulp-list On Behalf Of Randy Barlow 
 Sent: Wednesday, May 13, 2015 9:00 AM 
 To: pulp-list 
 Subject: Re: [Pulp-list] Sync all packages from PyPI, bandersnatch[0] (the PyPI mirror tool) will be a good reference. PyPi with pulp_python plugin 

 We will need 2 workflows. -----BEGIN PGP SIGNED MESSAGE----- 
 Hash: SHA512 

 h3. initial sync /force-full sync 

 Roughly, the workflow the same as On 05/13/2015 08:21 AM, Ashby, Jason (IMS) wrote: 
 > I’m looking to set up a sync pypi mirror with a whitelist of project names except this will require an additional call pulp.    I’m currently 
 > using Bandersnatch for this, but it’d be nice to drop it and use 
 > pulp instead.    Per the simple index docs*, I see that you can sync specific 
 > packages from pypi, e.g. 
 >  
 > pulp-admin python repo create --repo-id pypi --feed  
 > https://pypi.python.org/ --package-names numpy,scipy 
 >  
 > but I can’t seem to retrieve sync ALL packages.    I tried leaving off the  
 > --package-names option, but a list of all projects and their urls. 

 h3. incremental sync downloads 0 packages.     Should 
 > I submit an issue/feature request at  
 > https://pulp.plan.io/projects/pulp_python/issues? 

 Hi Jason! 

 The XML-RPC problem is that PyPI API has a call `changelog_since_serial(since_serial)` which will return all of the projects that does not have been updated since one single manifest file for 
 the last sync. Once we have available package versions, but rather one manifest per package 
 name. Due to this, we essentially have our whitelist and in order to sync can proceed as all packages from PyPI it does in would be 
 necessary to make around 45-50,000 web requests just to find out what 
 would need to be downloaded, and then of course we would need to 
 perform the other cases. actual package downloads. 

 That said, we are working on a plan to have Pulp be able to lazy fetch 
 packages as they are requested. This does present plan will take a long time to 
 implement (so don't expect it in any of our close releases) but I 
 think it will solve this problem though. The repository would need in a "latest_serial" or something similar. Currently, performant way. 

 Another possible solution may be Warehouse[0]. I've been talking to 
 the PyPA developers about this could problem, and they are aware that it 
 needs to be stored solved. They may fix it there, in repository.notes['latest_serial'], which case we can get 
 the Python importer to be aware of all the packages. 

 I have also considered just doing the 50k requests anyway. I suspect 
 that PyPI won't like if we do that, but if possible, it is technically possible as 
 well. 

 I would prefer say go ahead and file an RFE. I'll think some more about how we 
 might be able to avoid using get it working. Thanks for the notes field like this. An alternative would require a significant change to pulpcore-- typed repositories.  

 [0]: https://pypi.org/project/bandersnatch/ note, and I hope you 
 enjoy the plugin otherwise! 

 [0] https://warehouse.python.org/ 

 </pre>

Back