Issue #2618
closed"blob" files are delivered with incorrect content headers
Description
Related to https://pulp.plan.io/issues/1781.
Let's say you create, populate and publish a docker repository. That done, "blob" files will be available at paths in the form /pulp/docker/v2/{repo_id}/blobs/{blob_sum}
. As a concrete example, one repository I worked with made the following URLs available:
https://rhel-6-8-pulp-2-12/pulp/docker/v2/8f12187a-8f95-488c-8f1b-c627f404f809/blobs/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4
https://rhel-6-8-pulp-2-12/pulp/docker/v2/8f12187a-8f95-488c-8f1b-c627f404f809/blobs/sha256:ffc8a12d3678ba8f82b54c3a9ca8260f56ce4be47748743658d89d8f39e80a04
These "blob" files are gzip-encoded binary files. When a client requests a blob, they expect to receive a gzip-encoded file. A client can verify that they've received a valid file by calculating the checksum of the downloaded file and asserting that it matches the checksum embedded in the file name:
$ wget --server-response --no-check-certificate 'https://rhel-7-3-pulp-2-12/pulp/docker/v2/ff380cd9-f931-4bc7-9198-eec900f19610/blobs/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4'
--2017-03-01 18:18:46-- https://rhel-7-3-pulp-2-12/pulp/docker/v2/ff380cd9-f931-4bc7-9198-eec900f19610/blobs/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4
Resolving rhel-7-3-pulp-2-12... 192.168.100.177
Connecting to rhel-7-3-pulp-2-12|192.168.100.177|:443... connected.
WARNING: cannot verify rhel-7-3-pulp-2-12's certificate, issued by ‘CN=PulpCA,OU=Development,O=Pulp,L=Raleigh,ST=North Carolina,C=US’:
Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Wed, 01 Mar 2017 18:18:21 GMT
Server: Apache/2.4.6 (Red Hat Enterprise Linux) OpenSSL/1.0.1e-fips mod_wsgi/3.4 Python/2.7.5
Last-Modified: Wed, 01 Mar 2017 18:12:54 GMT
ETag: "20-549af42ef90f3"
Accept-Ranges: bytes
Content-Length: 32
Docker-Distribution-API-Version: registry/2.0
Keep-Alive: timeout=5, max=10000
Connection: Keep-Alive
Length: 32
Saving to: ‘sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4’
sha256:a3ed95caeb02ffe68cdd9fd844 100%[===========================================================>] 32 --.-KB/s in 0s
2017-03-01 18:18:46 (4.23 MB/s) - ‘sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4’ saved [32/32]
$ ls -1
sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4
$ file 'sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4'
sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4: gzip compressed data
$ sha256sum 'sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4'
a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4 sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4
So far, so good. The trouble is with how RHEL 6 handles these requests. Some background information:
- The 'Content-Type' header states the type of the file as requested by the client application. In this case, a client requested a gzip archive. As a result, the
Content-Type: application/x-gzip
header should be set, or the 'Content-Type' header should be omitted entirely. - The 'Content-Encoding' header states which additional encoding, if any, has been applied on top of what the client requested. In this case, any encoding supported by both wget and the server may be applied, but given that the file is already gzip-encoded, it doesn't make sense to further encode the file, and the 'Content-Encoding' header should be omitted.
Here's what RHEL 6 actually does:
$ wget --server-response --no-check-certificate 'https://rhel-6-8-pulp-2-12/pulp/docker/v2/d8d948e9-e87d-4fa9-be83-f62ba91210b8/blobs/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4'
--2017-03-01 18:24:32-- https://rhel-6-8-pulp-2-12/pulp/docker/v2/d8d948e9-e87d-4fa9-be83-f62ba91210b8/blobs/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4
Resolving rhel-6-8-pulp-2-12... 192.168.100.79
Connecting to rhel-6-8-pulp-2-12|192.168.100.79|:443... connected.
WARNING: cannot verify rhel-6-8-pulp-2-12's certificate, issued by ‘CN=PulpCA,OU=Development,O=Pulp,L=Raleigh,ST=North Carolina,C=US’:
Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Wed, 01 Mar 2017 18:24:31 GMT
Server: Apache/2.2.15 (Red Hat)
Last-Modified: Wed, 01 Mar 2017 18:13:11 GMT
ETag: "9ff30-20-549af43f32d1f"
Accept-Ranges: bytes
Content-Length: 32
Docker-Distribution-API-Version: registry/2.0
Connection: close
Content-Type: text/plain; charset=UTF-8
Content-Encoding: x-gzip
Length: 32 [text/plain]
Saving to: ‘sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4’
sha256:a3ed95caeb02ffe68cdd9fd844 100%[===========================================================>] 32 --.-KB/s in 0s
2017-03-01 18:24:32 (3.40 MB/s) - ‘sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4’ saved [32/32]
$ ls -1
sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4
$ file 'sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4'
sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4: gzip compressed data
$ sha256sum 'sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4'
a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4 sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4
$ gunzip --to-stdout 'sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4' | sha256sum
5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef -
Notice how the 'Content-Encoding' header is set? That's wrong. In this case, wget ignores the header and doesn't gunzip the file before saving it to disk. This behaviour is regressive, and I've gunzipped and checksummed the file to show what will happen with other, more compliant libraries. One such example is Python's "requests", which complies with the 'Content-Encoding' header. For example, check out this simple script:
#!/usr/bin/env python3
import requests
def main():
url = (
'https://rhel-6-8-pulp-2-12/pulp/docker/v2/d8d948e9-e87d-4fa9-be83-f6'
'2ba91210b8/blobs/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb164'
'22d00e8a7c22955b46d4'
)
with open('blob-decoded', 'wb') as handle:
handle.write(requests.get(url, verify=False).content)
if __name__ == '__main__':
exit(main())
The blob-decoded
file written to disk is gunzipped, as suggested by the 'Content-Encoding' header:
$ ls -1
get_decoded.py
$ ./get_decoded.py
/usr/lib/python3.6/site-packages/urllib3/connectionpool.py:852: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
$ ls -1
blob-decoded
get_decoded.py
$ file blob-decoded
blob-decoded: data
$ sha256sum blob-decoded
5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef blob-decoded
Notice the checksum of the file? It's the same as in the case where the file is fetched with wget
and then manually gunzipped.
The long and short of it is that RHEL 6 is incorrectly adding a Content-Encoding: x-gzip
header to docker blobs. No other platforms do this. This is true for the current Pulp 2.12 nightlies. Here's the packages installed on my current RHEL 6 test system:
[root@rhel-6-8-pulp-2-12 ~]# rpm -qa | grep -i httpd
httpd-tools-2.2.15-56.el6_8.3.x86_64
httpd-2.2.15-56.el6_8.3.x86_64
[root@rhel-6-8-pulp-2-12 ~]# rpm -qa | grep -i pulp | sort
mod_wsgi-3.4-2.pulp.el6.x86_64
pulp-admin-client-2.12.2-0.1.alpha.git.17.b101ff0.el6.noarch
pulp-docker-admin-extensions-2.3.1-0.1.alpha.git.5.052c506.el6.noarch
pulp-docker-plugins-2.3.1-0.1.alpha.git.5.052c506.el6.noarch
pulp-puppet-admin-extensions-2.12.2-0.1.alpha.git.2.f338f5d.el6.noarch
pulp-puppet-plugins-2.12.2-0.1.alpha.git.2.f338f5d.el6.noarch
pulp-python-admin-extensions-2.0.1-0.1.alpha.git.6.8c46f3f.el6.noarch
pulp-python-plugins-2.0.1-0.1.alpha.git.6.8c46f3f.el6.noarch
pulp-rpm-admin-extensions-2.12.2-0.1.alpha.git.19.da51b5f.el6.noarch
pulp-rpm-plugins-2.12.2-0.1.alpha.git.19.da51b5f.el6.noarch
pulp-selinux-2.12.2-0.1.alpha.git.17.b101ff0.el6.noarch
pulp-server-2.12.2-0.1.alpha.git.17.b101ff0.el6.noarch
python-isodate-0.5.0-4.pulp.el6.noarch
python-kombu-3.0.33-6.pulp.el6.noarch
python-pulp-bindings-2.12.2-0.1.alpha.git.17.b101ff0.el6.noarch
python-pulp-client-lib-2.12.2-0.1.alpha.git.17.b101ff0.el6.noarch
python-pulp-common-2.12.2-0.1.alpha.git.17.b101ff0.el6.noarch
python-pulp-docker-common-2.3.1-0.1.alpha.git.5.052c506.el6.noarch
python-pulp-oid_validation-2.12.2-0.1.alpha.git.17.b101ff0.el6.noarch
python-pulp-puppet-common-2.12.2-0.1.alpha.git.2.f338f5d.el6.noarch
python-pulp-python-common-2.0.1-0.1.alpha.git.6.8c46f3f.el6.noarch
python-pulp-repoauth-2.12.2-0.1.alpha.git.17.b101ff0.el6.noarch
python-pulp-rpm-common-2.12.2-0.1.alpha.git.19.da51b5f.el6.noarch
python-pulp-streamer-2.12.2-0.1.alpha.git.17.b101ff0.el6.noarch