Story #236: Don't re-download rpms if they exist on disk - RPM Support

Story #236

If +++ This bug was initially created as a content unit exists on disk but is not on the DB, Pulp re-downloads the content at sync time. clone of "Bugzilla Bug #1110923":https://bugzilla.redhat.com/show_bug.cgi?id=1110923 +++ 

 Pulp should instead recognize that Description of problem: 

 Description of problem: 

 We (Katello) have gotten a rpm plugin unit (rpm, srpm, drpm, distribution) is on disk lot of requests from users trying out katello and use redeploying it multiple times that pulp seems to create re-download the unit in rpms when syncing the database from. The result same repository after resetting the DB.    Ideally pulp would be that content which exists simply check to see if its on disk is not re-downloaded. the filesystem before downloading. 

 Version-Release number of selected component (if applicable): 
 2.4 

 How reproducible: 
 Always 

 Steps to Reproduce: 
 1.    Sync a large repo 
 2.    Watch it take a while 
 3.    Clear your mongo db 
 4.    rerun pulp-manage-db 
 5.    Sync the same repo 
 6. 5.    Watch it take the same amount of time (all the files are on the file system, so it should take the same amount of time) 


 --- Additional comment from tcameron@redhat.com at 06/25/2014 18:59:33 --- 

 This is actually a really big problem. In my testing, I rebuild my Satellite servers *very* frequently. Having to re-download 20GiB of content every time really, really slows things down. If pulp was smart enough to see that an RPM is already on disk and not download it again, that would make a huge difference. 


 --- Additional comment from tomckay@redhat.com at 06/25/2014 19:00:52 --- 

 +1 
 Some tool/trickery to, after resetting mongo, have it re-examine the files in its folders. Or perhaps a pulp-admin command to have it download content into a fake org that katello:reset knows not to reset if it exists. I'm sure there's something clever that can be done. 


 --- Additional comment from mhrivnak@redhat.com at 06/25/2014 20:42:30 --- 

 Pulp currently offers these options: 

 Backup: Get pulp to the state you want to reproduce, backup the whole thing, and then restore it later. So in the case of a demo, you'd create repositories, sync them, make a backup, then do your demo. After the demo, just restore the backup to roll-back. You could even use VM or lvm snapshots to quickly roll-back without making a complete copy of the data. That strategy could be applied to all of sat6. 

 Local Sync: Have a local copy of the repositories you want to import, and sync from them. You could use a local feed instead of the remote feed. Or, you could sync the local feed into a separate repository, then do a CDN sync like normal, in which case pulp won't re-download those units. 

 Alternate Content Sources: This is similar to a local sync, but does not require doing a full sync. It only requires defining an alternate content source, as described here: http://pulp-user-guide.readthedocs.org/en/latest/content-sources.html 

 Child Node: Keep a pulp VM that is a parent node with the repos you want, make your new pulp instance a child node, and do a node sync. 

 Are there use cases that are not satisfied by one of these options? My favorite option is the VM snapshot. Anyone doing lots of katello/satellite deployments will benefit greatly from using one VM image and possibly snapshots. 

 The biggest problem for pulp with implementing the proposed feature is that it makes a large assumption about how and where we store files. Long-term, we definitely want more flexibility in that area, so we can do things like store files in S3 or other blob storage. Adding a new feature that makes assumptions about where files are stored will make it more difficult for us to add other features in the future. 

 In other words, the way pulp stores files is not a public API. Users should not expect to even have access to, let alone manipulate, the storage medium pulp is using. (Neither would we let users manipulate our database directly.) If we support letting users pre-populate the filesystem and auto-discovery of those files, we will get bug reports and support requests when they find all kinds of ways to abuse that. 

 So to re-ask, are there use cases that are not satisfied by one of the above options? 


 --- Additional comment from jsherril@redhat.com at 06/25/2014 21:10:07 --- 

 I'll address all the alternatives: 

 * Backup:    With katello/satellite 6 it isn't just the pulp state to worry about. There's candlepin and katello dbs.    Certs are generated for all these services and when re-deploying to multiple environments it is not practical to regenerate all these certs (and redeploy them to the correct places). 

 * Local Sync:    Requires 2x the file size, not ideal.    Katello/the user would also need to change the feed url for however many repos are being synced to the local source and then back to the remote source.    The CDN structure would have to be mirrored exactly as well.    Creating a cdn mirror with selected repos isn't necessarily trivial.    For these cases, don't think about a single repo, think about 20 or 30 different repos all with complex directory structures that are dictated by the cdn. 

 * Alternative Content Sources: 
 Similar issues to Local Sync 


 * Child Node:  
 For katello/Satellite 6 there is no workflow having the katello/Satellite server be a child of an upstream node and even if there was you'd need a full 2nd satellite up and running to get this functionality.    This isn't a supportable option. 


 I do agree that implementing this requires an assumption about how and where we store files. However unless pulp is planning on dropping support completely for local file storage, or you plan to implement these alternative storage mechanisms with no addition code (which i'm guessing isn't possible) I'm not sure that its a good reason to not support it.    Satellite 5 had this feature, previous pulp versions had this feature and its a feature that many users/developers of Satellite expect.   

 "Adding a new feature that makes assumptions about where files are stored will make it more difficult for us to add other features in the future." 

 I would disagree with that, as if you were storing things differently, i would expect things to be handled differently and this 'feature' may not apply. 


 --- Additional comment from mhrivnak@redhat.com at 06/25/2014 21:57:22 --- 

 Can you elaborate on the use cases? I need to know more about why you think this would be valuable. It seems to only come up in the context of katello, but we have not seen a desire for this among direct users of pulp. 

 For the demo use case, using snapshots of an entire sat6 VM to roll-back to a previous state still seems ideal. What other use cases are we trying to address? 

 To respond... 

 Backups: I don't understand what you're getting at with the certificates and environments. Can you elaborate on the use case you have in mind?. 

 Local Sync: File storage is cheap. How much are we talking about? As for the structure, if you create separate repos that only sync the local feeds, then create the remote repos like normal, the CDN structure will be preserved perfectly. 

 Alternate Content Sources: File storage is cheap. 

 Disconnected/Export: This is another feature I forgot to mention that is literally designed for this use case. 

 As for the coding implications, I maintain that adding complexity and assumptions to the way we currently store files will make it more difficult to expand that functionality later. Giving users the expectation that they can directly modify the files on disk, and assume pulp will just deal with it, is also a risky path to go down. 


 --- Additional comment from jsherril@redhat.com at 06/26/2014 01:35:56 --- 

 Sure, granted most of these use cases are probably of little use to normal every day users.    I'm going to answer this from the perspective of a Satellite developer/user. 


 a) as a developer of Satellite 6 i reset my entire environment due to switching branches to older commits, developer errors that may leave our own db in a bad state, etc.    This involves resetting 3 different databases and normally uploading an entitlement manifest, and syncing 20+ Gigs of data. 

 b) as a developer of Satellite 6 I spin up and down new vms running a satellite very frequently.    Each VM may have a different ipaddress and hostname.    Everytime I spin up a new instance i have to upload a manifest and sync 20+ Gigs of data across multiple repositories. 

 c) as a consultant setting up proof of concepts for a customer, I may need to install Satellite 6 and configure it to be useful very quickly.    Waiting hours for content to sync (possibly over a slow link) impedes this POC. 

 In all cases simply re-using a file already on the file system solves the issue.    Assuming that for a) & b) an nfs mount of /var/lib/pulp/content solves the issue.    For c) it is very very common for a consultant, SA, etc.. to carry around a usb hard drive of /var/lib/pulp/content and simply copy/scp it to the satellite before syncing.    This can speed up deployments by hours. 


 In the case of B or C, a vm backup/snapshot simply would not work because: 

 1) it could be on baremetal  
 2) on a deployed satellite there are things in the db that are customer specific, hostname is referenced, a customer manifest is imported, an organization has been created (which the actual customer may not care about).    The default satellite 6 install and syncing of cdn content would involve all of these things. 
 3) If doing a vm backup/snapshot, many of the SSL certs that had been generated as part of the backup would no longer be valid as the hostname would be different on the new machines. 

 I'm sorry but file storage is not cheap in all cases.    My laptop has a 256 Gig SSD, i use i primarily for development and to spend 40 Gigs (versus 20) just to duplicate everything in order to sync RHEL 6 means that i will run out of space that much more frequently.    I already catch myself running out of space from time to time.   

 I'm assuming most of the pulp developers do not need to sync 20+ Gigs of data in order to do develop features, fix bugs, etc daily using a source of content. 

 In all of the above scenarios it seems to be like you're asking everyone else (katello developers, SAs, Consultants, GSS, probably over 100 people) to do much more complicated things to achieve something similar but less desirable (and in many ways a lot more inconvenient) than what they have had in the past. 

 "Giving users the expectation that they can directly modify the files on disk, and assume pulp will just deal with it, is also a risky path to go down."  

 No one is talking about modifying, only pre-seeding. 

 I'd like to re-iterate that this feature was present in previous pulp versions and Satellite 5 and i would have like to seen this level of scrutiny applied when it was removed :) 


 --- Additional comment from taw@redhat.com at 06/26/2014 02:31:21 --- 

 Bumped priority to HIGH. If this were Sat 6.1, I would bump it to URGENT. 

 This is a major impediment to SA and consultant engagements. And simply if the customer does somethign stupid... but not too stupid. 


 --- Additional comment from rbarlow@redhat.com at 06/26/2014 15:11:13 --- 

 I don't believe that Pulp being able to recover its lost database will have any significant performance improvement over a local sync. More, disk space is extremely cheap and readily available. Further responses inline: 

 (In reply to Justin Sherrill from comment #6) 
 > Sure, granted most of these use cases are probably of little use to normal 
 > every day users.    I'm going to answer this from the perspective of a 
 > Satellite developer/user. 

 I think that this being specific to developers strongly weakens the need for it, especially since the only benefit is less disk space. Multi TB hard drives are affordable, and we've spent more money on this discussion than several 10s of TBs would cost. 

 > a) as a developer of Satellite 6 i reset my entire environment due to 
 > switching branches to older commits, developer errors that may leave our own 
 > db in a bad state, etc.    This involves resetting 3 different databases and 
 > normally uploading an entitlement manifest, and syncing 20+ Gigs of data. 

 If we do what this bug requests, it will still take a long time to sync the data. If you pay attention to Pulp's logs, the time spent during local sync is entirely on database queries, not on retrieving the content from disk. Plus, we would have to read every file found on disk to see if it is the expected file. It really won't be faster. 

 > b) as a developer of Satellite 6 I spin up and down new vms running a 
 > satellite very frequently.    Each VM may have a different ipaddress and 
 > hostname.    Everytime I spin up a new instance i have to upload a manifest 
 > and sync 20+ Gigs of data across multiple repositories. 

 Local syncs will be just as fast as this proposed change. There is not a time benefit here. 

 > c) as a consultant setting up proof of concepts for a customer, I may need 
 > to install Satellite 6 and configure it to be useful very quickly.    Waiting 
 > hours for content to sync (possibly over a slow link) impedes this POC. 

 Sync from a local copy. Your proposal is already to maintain the /var/lib/pulp/content. It's just as easy to sync /var/lib/pulp/published/path/to/repo. That can be done today as is. No change or unusual features required from Pulp. 

 > In all cases simply re-using a file already on the file system solves the 
 > issue.    Assuming that for a) & b) an nfs mount of /var/lib/pulp/content 
 > solves the issue.    For c) it is very very common for a consultant, SA, etc.. 
 > to carry around a usb hard drive of /var/lib/pulp/content and simply 
 > copy/scp it to the satellite before syncing.    This can speed up deployments 
 > by hours. 

 This USB harddrive case further illustrates the case that hard drive space is cheap. Furthermore, a local sync will perform the same since the time is spent on queries. 

 > In the case of B or C, a vm backup/snapshot simply would not work because: 
 >  
 > 1) it could be on baremetal  

 LVM! 

 > I'm sorry but file storage is not cheap in all cases.    My laptop has a 256 
 > Gig SSD, i use i primarily for development and to spend 40 Gigs (versus 20) 
 > just to duplicate everything in order to sync RHEL 6 means that i will run 
 > out of space that much more frequently.    I already catch myself running out 
 > of space from time to time.   

 We do have access to servers with massive storage. That's how I develop. 

 > I'm assuming most of the pulp developers do not need to sync 20+ Gigs of 
 > data in order to do develop features, fix bugs, etc daily using a source of 
 > content. 

 I actually do this all the time. I have a VM that hosts a mirror of the CDN and I regularly sync against it. Again, the time spent is on DB queries. Bringing the content into /var/lib/pulp/content is not the majority of the time if you sync from a local source. 

 > In all of the above scenarios it seems to be like you're asking everyone 
 > else (katello developers, SAs, Consultants, GSS, probably over 100 people) 
 > to do much more complicated things to achieve something similar but less 
 > desirable (and in many ways a lot more inconvenient) than what they have had 
 > in the past. 

 I think the best proposal is to sync against the /var/lib/pulp/published from the last install. http://localhost/pulp/your/repo/here can be the feed, and you will get the same benefits. 


 --- Additional comment from tcameron@redhat.com at 06/26/2014 15:47:47 --- 

 As someone who is actively trying to learn and sell Satellite 6, the pushback on this is pretty frustrating. All of us in the field are trying to make this *easier* to learn and to use.  

 I have a local sync of the CDN. It took me a WEEK to copy it, saturating my internet connection for the entire bloody week. No one else really wants to do that. And if I use the local copy, as I understand it, I am stuck with using that local copy forever. I am not aware of a way to sync from a local CDN copy but then change Sat6 to sync from the actual CDN. How does that help if I set up a Sat6 box at a customer site? If there is a way to create a local repo, use it, and THEN start using the CDN, that goes a long way, but I don't think that's an option. I'd love to be wrong. 

 Oh, yeah, and: 

 [root@lady3jane ~]# du -hs /var/www/html/content/dist/rhel/server/6/ 
 416G 	 /var/www/html/content/dist/rhel/server/6/ 

 Are you seriously proposing I carry around 400+GiB of content all the time? I have not even started syncing 7 yet! How much is that going to be? 

 Sat5 did this. So the fact that Sat6 does not is a regression in my mind. 

 If I'm in the field at a customer site, and I want to stand up a Sat6 box in one day, I CAN NOT DO SO TODAY. Many of our customers have an Internet link which is only a few megs/second. Syncing 20GiB of content takes many days. If I have a dump of /var/lib/pulp, and can copy it from my USB drive over to the customer's server, that changes to hours. 


 --- Additional comment from rbarlow@redhat.com at 06/26/2014 17:22:47 --- 

 Hi Thomas, 

 (In reply to Thomas Cameron from comment #9) 
 > As someone who is actively trying to learn and sell Satellite 6, the 
 > pushback on this is pretty frustrating. All of us in the field are trying to 
 > make this *easier* to learn and to use.  

 We aren't trying to frustrate anyone. I've been arguing that there is already a way to do what you want without modifying Pulp. I'll try to explain more clearly below. 

 > I have a local sync of the CDN. It took me a WEEK to copy it, saturating my 
 > internet connection for the entire bloody week. No one else really wants to 
 > do that. And if I use the local copy, as I understand it, I am stuck with 
 > using that local copy forever. I am not aware of a way to sync from a local 
 > CDN copy but then change Sat6 to sync from the actual CDN. How does that 
 > help if I set up a Sat6 box at a customer site? If there is a way to create 
 > a local repo, use it, and THEN start using the CDN, that goes a long way, 
 > but I don't think that's an option. I'd love to be wrong. 

 Pulp allows you to change the feed on a repo. So you could create the repo with the local feed, sync to get it started, then change it to the CDN. 

 > Oh, yeah, and: 
 >  
 > [root@lady3jane ~]# du -hs /var/www/html/content/dist/rhel/server/6/ 
 > 416G 	 /var/www/html/content/dist/rhel/server/6/ 
 >  
 > Are you seriously proposing I carry around 400+GiB of content all the time? 

 As I understand it, you are arguing for the ability to be able to carry around 400GB. If you don't do that, getting it from the CDN will be your only option? Setting the feed to localhost for the initial sync will simply get the DB up to speed, exactly as this RFE is requesting. It just does it a different way. 

 > Sat5 did this. So the fact that Sat6 does not is a regression in my mind. 

 I don't think this is a particularly strong argument. Just because Pulp doesn't do something that a past release did, it isn't automatically a regression. There needs to be a use case to support features, and use cases change all the time. The specific implementation of the feature is what changed. We've provided a very reasonable way to accomplish the exact same thing. All you have to do is sync the local feed to get it started. This could be easily automated. 

 > If I'm in the field at a customer site, and I want to stand up a Sat6 box in 
 > one day, I CAN NOT DO SO TODAY. Many of our customers have an Internet link 
 > which is only a few megs/second. Syncing 20GiB of content takes many days. 
 > If I have a dump of /var/lib/pulp, and can copy it from my USB drive over to 
 > the customer's server, that changes to hours. 

 The proposal in this RFE will have no impact on the speed of import in comparison to my proposal to sync the local feed. 

 I think it's important to reiterate Michael's argument that our disk format is not a published or externally available API, and therefore is subject to change. The Pulp team does not want to guarantee anything about its internal implementations so that we are free to make improvements as we see fit over time. This RFE would make our internal operations (essentially, our private code) become public, which will restrict us from future improvements in storage. We've had requests for improving our storage options that we would have a much harder time implementing if we did what this RFE suggests. 

 I challenge you all to try the local sync option. I think you will find it is easy to automate, and will accomplish exactly what you are looking for feature wise. It just isn't the specific implementation you had in mind, so please be aware of your biases around that and be open to the idea that this solves the problems you are concerned about without us having to make our internal operations publicly supported APIs. Solving the use case is the important part, not the specific implementation chosen. 

 I've thought about this some more and I've realized that the local sync option will also not increase your disk space usage any more than this RFE would. There is no time saving or disk saving that this RFE offers over the local sync option. Moreover, the local sync already works. 

 The Pulp team is just as busy as everyone else is. We would like to focus on new features and bug fixes so that Satellite can be even awesomer than it already is. Adding this as a new feature will cost us unnecessary time, and will lead to future bugs that we'll have to fix as well. It will also make us have to maintain a particular format, which will also cost us. And all of this just for a specific implementation of a problem that is already easily solvable another way. 

 What we have now is free and it works and it solves the problem well. 


 --- Additional comment from jsherril@redhat.com at 06/26/2014 17:45:46 --- 

 " 
 I think it's important to reiterate Michael's argument that our disk format is not a published or externally available API, and therefore is subject to change." 

 I really don't see how this has anything to do with this discussion.    Unless pulp starts storing files in randomized locations this point is moot.    I would not assume    /var/lib/pulp/content to stay the same from pulp 2.3 to 2.4 to 2.5 to 99.0.    I'm not sure what this has to do with anything?    If i want to re-use content in 2.Y and the format has changed I'll re-sync on my 2.Y install. 

 Satellite 5 changed its disk format and this caused little problem.    There was no desire to retain the old format due to this 'feature' and so I don't understand why that would be hindrance. 


 --- Additional comment from mhrivnak@redhat.com at 06/26/2014 18:20:34 --- 

 I'd like to focus this discussion on paths forward. We (the pulp team) take these internal requirements very seriously, and we want to deliver the features that will provide the most value to our users. Sat6 developers, QE, consultants, sales, etc. are of course VIPs (very important pulp-users) and always have our ear. We have certainly not ruled out the proposed behavior, but before implementing a new feature for internal use only, we think we have alternatives that deserve serious consideration. 

 To be clear, we are committed to making it easy and fast to get content into pulp. 

 Given the use cases that Justin outlined (A-C), pulp's upcoming "alternate content sources" feature is a great fit. It is easy to use and will likely accomplish the goals that have been outlined. Using this feature would also not require any changes to satellite. 

 We would like an opportunity to do a demo of how it works, at which point we can talk through any concerns or limitations. Even if we need to invest additional time in this feature to make it work as well as possible for the cited use cases, that investment will pay off long-term for our users. 

 Unless there are any objections, I would like to move forward with that demo, and treat this RFE as something along the lines of "Make it easy to quickly load content into sat6", rather than a request for a very specific implementation of that behavior. If we cannot satisfy that behavior with an existing feature, then we will seriously consider implementing the suggested in-place discovery of content. 


 --- Additional comment from tcameron@redhat.com at 06/26/2014 18:24:06 --- 

 I'm certainly game for that. If there's a way to NOT have to download ~ 20GiB every time I want to build a Satellite server, that'd be great. 


 --- Additional comment from ejacobs@redhat.com at 07/17/2014 14:06:15 --- 

 I'll try to rephrase some of the commentary here: 

 * From a business perspective, we have a need for various folks to be able to obtain a small set of content and quickly get it into a Satellite. As a user story-type example: 

 As an administrator of Satellite 6 
 I should be able to obtain a dump of content for a version of RHEL (ex: EL6-x86_64) 
 That already has all the DB information that the various tools (Pulp, etc) require to consume this content 
 And import this into the Satellite quickly 

 Essentially, this is the fastest possible way to sync content -- the files would be copied from this dump and the database(s) would be imported. Done! Satellite5 ships content DVDs which consultants frequently use for import. This is slow, but it can't get much faster unless the database information was already prepared and simply inserted. 

 The other thing is that, by preparing dumps for an entire version-architecture, I can carry less stuff around. Carrying EL6-x86_64 dumps should enable me to import 6Server only, or 6.2 only, or any combination of releases of that version. 

 * From a developer perspective, I want the ability to create dumps of various states, and quickly recover those states. The situation for developers is that the RPM files themselves are likely already on the disk somewhere. As long as the developer can move these files to a fixed location that the tool expects, the tool could simply move the files back to wherever they need to go (even if the file location structure changes) and import the database state. 

 I hope this adds some more good information. 


 --- Additional comment from rjerrido@redhat.com at 07/17/2014 15:59:43 --- 

 Expanding on the admin use cases in comment #14 

 There are a couple of use cases that the administrator is looking for: 

 * Accelerated / Disconnected import of content.  

 Today customers, consultants, SAs, et al, leverage Channel Content ISOs that are used to populate Satellites. This allows import of content much more quickly by negating the process of pulling these bits over the wire. However, there is still a decent amount of time spent crunching on the metadata and copying the RPMs from the import location into /var/satellite. (Same for content exported via rhn-satellite-exporter). So two major tasks 

 ** Putting the data in an exportable format 
 ** Ingesting data on the satellite itself.  

 We can do the former with katello-disconnected today and the latter with changing the CDN URL to either file:/// or a local webserver. We at least have parity with Sat 5 here. However, even with a local CDN mirror or diskdump, it takes the better portion of an hour or two before one is operational. If we can get that to near-zero, that'll be swell.  

 * Also, there is a self-heal use case that we deal with Satellite 5. If an RPM below /var/satellite is deleted, either explicitly, inadvertedly, or due to corruption, it is redownloaded on the next invocation of 'satellite-sync'. We've had instances where we've had packages mistakenly signed with the beta key that landed in GA channels, and the resolution was to delete said RPMs from /var/satellite, and invoke satellite-sync to replace them. What is pulp's behavior in this aspect? (RPMs removed /var/lib/pulp without pulp's knowledge) 


 --- Additional comment from rjerrido@redhat.com at 07/17/2014 16:11:16 --- 

 We should have an internal discussion regarding the future of the Satellite 6 equivalent of Channel Content ISOs. Currently, we are recommending that customers who have a disconnected Satellite setup a dedicated synchronization host and run katello-disconnected. This is a fairly significant change in behavior for them.  


 Also, for internal resources, will the suggested method for us to prepare for POCs be to somehow mirror (either via our own katello-disconnected host or rsync from an internal CDN mirror) the repos that we need? More succinctly, if we (Red Hat) aren't planning to deliver content ISOs, are we planning to publish and document a process for internal resources to synchronize selected portions of the CDN? I'd rather not have to have a slew of SAs setting up katello-disconnected systems for this task.    (Honestly, I'd rather not have customers doing this either)
Back
Project

Profile

Help

RPM Support

Story #236