Story #7832
closed[EPIC] As a user, I have Alternate Content Sources
100%
Description
Background¶
In Pulp2 there was a feature called "Alternate Content Sources". In Pulp2 this was supported for pulp_rpm only. Here's how it worked:
-
User configured an alternate content source, which lists many "paths" each representing a repo available from the remote source. For example, the Alternate Content Source (ACS) could be a local copy of the CDN within an AWS region. Or another would be a locally mounted copy of a portion of the CDN.
-
Then a user refreshes the alternate content source. This indexes the binary data that is available in the remote source. This avoids every sync operation to have the parse all paths on the alternate content source at runtime.
-
When sync occurs, it considers the data available from the alternate content source and if an exact content unit is available via the ACS the download occurs from the ACS first.
Here's an example of a Pulp2 local ACS config:
[pulp-content-source]
enabled: 1
priority: 0
expires: 3d
name: Pulp Content Source
type: yum
base_url: http://192.168.1.11/pub/content/
paths: beta/rhel/server/7/x86_64/satellite/6/os/
eus/rhel/server/7/7.3/x86_64/os/
dist/rhel/client/6/6.2/x86_64/kickstart/
dist/rhel/client/6/6.2/x86_64/kickstart/Client/
dist/rhel/client/6/6.2/x86_64/os/
dist/rhel/client/6/6.8/i386/kickstart/
dist/rhel/client/6/6.8/i386/kickstart/Client/
A few Pulp2 details¶
ACS in Pulp2 had...
- a
priority
but in practice this was not meaningfully used - certificates used when fetching content (optional). Fetching content in an AWS region required a cert.
- a
headers
to specify headers attached to the requests going to the ACS
Use Cases¶
CDN connection is low-bandwidth and/or high-latency¶
As a user, I have a low-bandwidth and/or high-latency connection to the authoritative source of content, e.g. CDN. I also have a local (either local disk or local network), but it's not authoritative, it could be old. There should be a way to fetch the metadata from the authoritative source, and the content from the "near" source whenever it's identical.
Quickly setting up a Pulp server¶
As a user setting up a new Pulp server, and I already have a local disk or local network of content that should go into that Pulp server, I should be able to use it. The CDN or remote source is still the authoritative one, but if the binary data is the same, I shouldn't need to bring it in over the WAN.
Putting a Pulp server in the cloud¶
As a user, deploying a Pulp server to the cloud, e.g. Amazon AWS, the CDN should still be authoritative, but there is usually a "nearly up to date" copy of that content also available in the AWS region which is very fast. This is usually faster, but also cheaper since in-region network access does not cost like WAN access does. I want to use ACS to allow the CDN to be authoritative, but use the regional copy for binary data whenever possible.
Connection to CDN is fast, but it's not the authoritative source, that is slow¶
As a user, I could have fast access to the CDN, but it may not be the authoritative source for my Pulp server. Particularly in cases where I have multiple Pulp servers and this one (edge) is syncing from another Pulp server (central). In that case, the link between the authoritative, central Pulp server and this edge Pulp server is slow, but the connection between this edge Pulp server and the CDN is fast. I want to use ACS to allow the authoritative Pulp server to be authoritative for content, but receive the binary data from the CDN whenever possible.
Pulp3 Alternate Content Source Plan¶
Plugin writers will have a new pulpcore.plugin.models.AlternateContentSource
MasterModel which will define the following fields:
- name - a string, the name. A required field.
- enabled - an optional boolean, defaults to True
- paths - A list of string paths. Each must validate as a string path. It must not include a slash at the beginning of each. An optional field. If unspecified, only the base_path will be used when the AlternateContentSource is refreshed.
- a ForeignKey to a remote, this is required as the remote defineds how the ACS can sync.
Plugin writers will subclass this.
Pulp3 Alternate Content Source Usage¶
Create an Alternate Content Source:¶
- First creating a remote representing the remote source, e.g. a
RpmRemote
, or a pulp_ansibleCollectionRemote
. - Then use that remote in an alternate content source by doing:
POST /pulp/api/v3/acs/rpm/rpm/ remote=/pulp/api/v3/remotes/.../.../
. which would could yield a/pulp/api/v3/acs/rpm/rpm/:uuid/
.
Refresh an Alternate Content Source¶
Then perform a "refresh" of the alternate content source by calling POST /pulp/api/v3/acs/rpm/rpm/:uuid/refresh/
. The action endpoint refresh
is used here because it's not actually syncing down content. It's like an on-demand sync in the sense that when called it indexes the remote metadata and creates remote artifacts.
Use an AlternateContentSource¶
At feature launch, each ACS is assumed to be global so, e.g. every RPM sync will check with the RPM typed ACS known content during sync and prefer it over the content from the authoritative source for binary data.
Implementation¶
An ideal implementation would have the "prefer and use" the alternate content source data transparently in the downloader itself. Since an AlternateContentSource can be from either Http or File sources, it likely should be implemented in BaseDownloader itself.
So BaseDownloader, for each download it attempts should check with the database to determine if an AlternateContentSource for that type exists, and if so, find its RemoteArtifact and use that when downloading Artifact data.
Updated by ttereshc about 4 years ago
Thanks a lot for the background and use cases, it helps to efficiently refresh how this feature worked in Pulp 2.
I think it's a good idea to use remotes for ACS and create RemoteArtifacts as a refresh operation. Few questions to get my head around the proposal:
- What will be on the ACS model apart from presumably a FK to a Remote?
- Was
expired
configuration used in any way? I can imagine that it could be valuable for a regular task which refreshes outdated ACS. - Was it useful to have one base url and a list of paths in Pulp 2 from the user point of view? Or was it more of a downside that to refresh one path, I would refresh all for its specific ACS?
My concern is if the common workflow is to configure and/or refresh ACSs per base_url, then with the current proposal in Pulp 3 it will be a lot of work for a user to do so. - Similar concern when I want to remove an ACS. I'm switching off my local server and I want to make sure it's no longer configured as an ACS. If we have one Remote per ACS, so in pulp 2 terms, it will be base_url + a path, how user can identify all ACSs for the server they want to shut down?
- What happens when I configure Remotes and ACSs for the first time. I haven't run sync yet, so I have no content in Pulp which Remotes refer to. If we need to create RemoteArtifact, then we'll need to create Content and ContenArtifacts for all the content possible.
- RemoteArtifact creation will cover Content with Artifacts. What happens with other content? Is it always synced from the authoritative source since all the logic which source to choose happens in the downloader for artifacts only?
-
It's like an on-demand sync in the sense that when called it indexes the remote metadata and creates remote artifacts.
What is implied by the indexing of remote metadata? I read the proposal as the refresh action creates RemoteArtifacts and that's it.
Updated by bmbouter about 4 years ago
ttereshc wrote:
Thanks a lot for the background and use cases, it helps to efficiently refresh how this feature worked in Pulp 2.
I think it's a good idea to use remotes for ACS and create RemoteArtifacts as a refresh operation. Few questions to get my head around the proposal:
- What will be on the ACS model apart from presumably a FK to a Remote?
- Yes a FK to a Remote.
- Also the list of base path fragments (expected to be appended to the remote.url).
- a name
- Was
expired
configuration used in any way? I can imagine that it could be valuable for a regular task which refreshes outdated ACS.
No it was not, and the plan was for us to not implement it.
- Was it useful to have one base url and a list of paths in Pulp 2 from the user point of view? Or was it more of a downside that to refresh one path, I would refresh all for its specific ACS?
My concern is if the common workflow is to configure and/or refresh ACSs per base_url, then with the current proposal in Pulp 3 it will be a lot of work for a user to do so.
Agreed. It was useful. I'm modifying the ticket to have a list of "base path fragments"
- Similar concern when I want to remove an ACS. I'm switching off my local server and I want to make sure it's no longer configured as an ACS. If we have one Remote per ACS, so in pulp 2 terms, it will be base_url + a path, how user can identify all ACSs for the server they want to shut down?
Yes let's use one ACS with multiple paths.
- What happens when I configure Remotes and ACSs for the first time. I haven't run sync yet, so I have no content in Pulp which Remotes refer to. If we need to create RemoteArtifact, then we'll need to create Content and ContenArtifacts for all the content possible.
I was hoping to have the ACS refresh only create RemoteArtifacts without content, and @dkliban gave suggested the downloaders could match the RemoteArtifact by the checksum of the thing being downloaded. The issue with this approach is that RemoteArtifact requires ContentArtifact which in turn requires Content. I need to think this part over some more. What do you think?
- RemoteArtifact creation will cover Content with Artifacts. What happens with other content? Is it always synced from the authoritative source since all the logic which source to choose happens in the downloader for artifacts only?
Yes I think this can only apply to artifacts because that is the only "binary data" there is, all the other data I think falls into the metadata category at which point the ACS should not be more authoritative than the remote url. What do you think?
It's like an on-demand sync in the sense that when called it indexes the remote metadata and creates remote artifacts.
What is implied by the indexing of remote metadata? I read the proposal as the refresh action creates RemoteArtifacts and that's it.
That's the idea, is the refresh only generates RemoteArtifacts, and then when subsequent sync's occur these RemoteArtifacts are "preferred".
Updated by ipanova@redhat.com about 4 years ago
bmbouter wrote:
ttereshc wrote:
Thanks a lot for the background and use cases, it helps to efficiently refresh how this feature worked in Pulp 2.
I think it's a good idea to use remotes for ACS and create RemoteArtifacts as a refresh operation. Few questions to get my head around the proposal:
- What will be on the ACS model apart from presumably a FK to a Remote?
- Yes a FK to a Remote.
- Also the list of base path fragments (expected to be appended to the remote.url).
- a name
And headers
. These are needed for the case when RHUI is set as ACS content provider https://pulp.plan.io/issues/1282#note-15
- Was
expired
configuration used in any way? I can imagine that it could be valuable for a regular task which refreshes outdated ACS. No it was not, and the plan was for us to not implement it.
- Was it useful to have one base url and a list of paths in Pulp 2 from the user point of view? Or was it more of a downside that to refresh one path, I would refresh all for its specific ACS?
My concern is if the common workflow is to configure and/or refresh ACSs per base_url, then with the current proposal in Pulp 3 it will be a lot of work for a user to do so. Agreed. It was useful. I'm modifying the ticket to have a list of "base path fragments"
Will refresh
command deal with the outdated content as well? We need a way to also be able to stay up-to-date with the available content and purge the one which is not longer available.
- Similar concern when I want to remove an ACS. I'm switching off my local server and I want to make sure it's no longer configured as an ACS. If we have one Remote per ACS, so in pulp 2 terms, it will be base_url + a path, how user can identify all ACSs for the server they want to shut down? Yes let's use one ACS with multiple paths.
- What happens when I configure Remotes and ACSs for the first time. I haven't run sync yet, so I have no content in Pulp which Remotes refer to. If we need to create RemoteArtifact, then we'll need to create Content and ContenArtifacts for all the content possible. I was hoping to have the ACS refresh only create RemoteArtifacts without content, and @dkliban gave suggested the downloaders could match the RemoteArtifact by the checksum of the thing being downloaded. The issue with this approach is that RemoteArtifact requires ContentArtifact which in turn requires Content. I need to think this part over some more. What do you think?
I think one of the primary use cases of setting up an ACS is to quickly populate Pulp instance with the content available locally, so I would imagine that in most of the cases a user will have setup an ACS and not yet synced repos. I think we will need to find a path to not only create RA but also Content and CA.
- RemoteArtifact creation will cover Content with Artifacts. What happens with other content? Is it always synced from the authoritative source since all the logic which source to choose happens in the downloader for artifacts only? Yes I think this can only apply to artifacts because that is the only "binary data" there is, all the other data I think falls into the metadata category at which point the ACS should not be more authoritative than the remote url. What do you think?
It's like an on-demand sync in the sense that when called it indexes the remote metadata and creates remote artifacts.
What is implied by the indexing of remote metadata? I read the proposal as the refresh action creates RemoteArtifacts and that's it. That's the idea, is the refresh only generates RemoteArtifacts, and then when subsequent sync's occur these RemoteArtifacts are "preferred".
I think we should also add logic to what what happens when:
- ACS is deleted --> remove RAs
- A remote that is used in ACS was deleted, make sure we don't blow up with 500
- A remote that is used in ACS was updated ( for example the remote.url). Next
refresh
should create new RAs but also remove outdated ones. - Will we support update to an existing ACS, for example add/remove base_path fragments? In that case will
refresh
be automatically trigged or it will require a manualrefresh
Updated by jsherril@redhat.com over 3 years ago
I didn't see the main content updated for 'base path fragments'. Is that intentional? was that updated elsewhere?
Updated by bmbouter over 3 years ago
- Subject changed from As a user, I have Alternate Content Sources to [EPIC] As a user, I have Alternate Content Sources
Updated by ipanova@redhat.com over 3 years ago
- Sprint changed from Sprint 101 to Sprint 102
Updated by rchan about 3 years ago
- Sprint changed from Sprint 105 to Sprint 106
Updated by rchan about 3 years ago
- Sprint changed from Sprint 106 to Sprint 107
Updated by rchan about 3 years ago
- Sprint changed from Sprint 107 to Sprint 108
Updated by rchan about 3 years ago
- Sprint changed from Sprint 108 to Sprint 109
Updated by ppicka about 3 years ago
- Status changed from NEW to CLOSED - CURRENTRELEASE