Project

Profile

Help

Issue #1781

Files ending in .gz are delivered with incorrect content headers

Added by rmcgover almost 6 years ago. Updated almost 3 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
1. Low
Version:
Master
Platform Release:
2.12.1
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Sprint 14
Quarter:

Description

Files ending in .gz served up by pulp.server.content.web.views are served with an incorrect content type and encoding.

This affects .xml.gz files generated by yum-distributor. For example:

$ curl -v http://192.168.121.51/pulp/repos/f04a1d5b-32d8-4a60-a580-cc158ab25004/repodata/5f90923d6f2e4b3d0f27c13e927821b9a7001b349f99222752c37496a40531d8-updateinfo.xml.gz > /dev/null
*   Trying 192.168.121.51...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to 192.168.121.51 (192.168.121.51) port 80 (#0)
> GET /pulp/repos/f04a1d5b-32d8-4a60-a580-cc158ab25004/repodata/5f90923d6f2e4b3d0f27c13e927821b9a7001b349f99222752c37496a40531d8-updateinfo.xml.gz HTTP/1.1
> User-Agent: curl/7.40.0
> Host: 192.168.121.51
> Accept: */*
> 
< HTTP/1.1 200 OK
< Date: Thu, 17 Mar 2016 03:24:50 GMT
< Server: Apache/2.4.18 (Fedora) OpenSSL/1.0.2g-fips mod_wsgi/4.4.8 Python/2.7.10
< Last-Modified: Thu, 17 Mar 2016 00:57:43 GMT
< ETag: "278-52e341e355108"
< Content-Length: 632
< Content-Type: text/xml
< 

The server incorrectly claims that the response content is Content-Type: text/xml, with no Content-Encoding.
It should be: Content-Type: text/xml, Content-Encoding: gzip.

It looks like this can't be made to work correctly with mod_xsendfile, since that module discards any Content-Encoding header.
If that won't be fixed then some other options could be:

  • set Content-Type: application/x-gzip, if mimetypes.guess_type guessed an encoding of gzip
  • set Content-Type: application/octet-stream, if mimetypes.guess_type guessed any encoding other than None

Related issues

Has duplicate Docker Support - Issue #1868: Pulp on RHEL 6 serves wrong filesCLOSED - DUPLICATE<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>
Has duplicate Pulp - Issue #2471: Repo download fails for drpmCLOSED - DUPLICATE<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>

Associated revisions

Revision b7ed20b6 View on GitHub
Added by semyers almost 5 years ago

Don't guess encoding for xsendfile, only content type

mod_xsendfile intentionally drops the Content-Encoding header[0] when responding to a request, but mimetypes.guess_type returns a tuple of ('Content-Type', 'Content-Encoding'). We were already ignoring the encoding component of this tuple, so if mimetypes.guess_type does guess that a non-None encoding value should be returned, all x_send would use was the type component, resulting in the incorrect type being used in the response.

For example: By default, mimetypes.guess_type returns ('text/xml', 'gzip') for a file name like 'metadata.xml.gz', so xsendfile's response incorrectly responds with the 'text/xml' Content-Type header. This change makes mimetypes return ('application/gzip', None) instead, hopefully resulting in a happy client.

[0]: https://tn123.org/mod_xsendfile -- search for "Content-Encoding"

closes #1781 https://pulp.plan.io/issues/1781

Revision b7ed20b6 View on GitHub
Added by semyers almost 5 years ago

Don't guess encoding for xsendfile, only content type

mod_xsendfile intentionally drops the Content-Encoding header[0] when responding to a request, but mimetypes.guess_type returns a tuple of ('Content-Type', 'Content-Encoding'). We were already ignoring the encoding component of this tuple, so if mimetypes.guess_type does guess that a non-None encoding value should be returned, all x_send would use was the type component, resulting in the incorrect type being used in the response.

For example: By default, mimetypes.guess_type returns ('text/xml', 'gzip') for a file name like 'metadata.xml.gz', so xsendfile's response incorrectly responds with the 'text/xml' Content-Type header. This change makes mimetypes return ('application/gzip', None) instead, hopefully resulting in a happy client.

[0]: https://tn123.org/mod_xsendfile -- search for "Content-Encoding"

closes #1781 https://pulp.plan.io/issues/1781

History

#1 Updated by mhrivnak almost 6 years ago

The best values here are Content-Type of application/x-gzip, and no Content-Encoding.

mod_mime_magic on RHEL and related distributions ships with a default configuration that (in some opinions, mine included) incorrectly assumes that all gzipped files should have their Content-Encoding header set to x-gzip. Unfortunately, that's lead to misconceptions about the correct use of these headers. This bit of RFC 7231 nicely summarizes why it's incorrect:

   If the media type includes an inherent encoding, such as a data
   format that is always compressed, then that encoding would not be
   restated in Content-Encoding even if it happens to be the same
   algorithm as one of the content codings.  Such a content coding would
   only be listed if, for some bizarre reason, it is applied a second
   time to form the representation.

https://tools.ietf.org/html/rfc7231#section-3.1.2.2

#2 Updated by rmcgover almost 6 years ago

The implication seems that my suggestion of Content-Type: text/xml, Content-Encoding: gzip for an .xml.gz file is incorrect, but I can't agree...

I think this part of the RFC is trying to say that it would be incorrect to specify the encoding in Content-Encoding if the encoding is already intrinsic to the Content-Type. For example, for a file compressed by gzip, it would be wrong to respond with Content-Type: application/x-gzip, Content-Encoding: gzip. Unless the file were compressed by gzip twice ("for some bizarre reason" in the RFC's words).

This isn't a problem for reporting Content-type: text/xml, Content-Encoding: gzip, since no inherent encoding from text/xml is being restated in Content-Encoding.

The point is that, if the client applies the decoding corresponding to the server's reported encoding, then the resulting content should be of the type reported in Content-Type. This is true when reporting Content-Type: text/xml, Content-Encoding: gzip, for an .xml.gz file. It would allow browsers to directly open and view the files.

#3 Updated by mhrivnak almost 6 years ago

  • Triaged changed from No to Yes

#4 Updated by jcline@redhat.com almost 6 years ago

  • Triaged changed from Yes to No

I agree with rmcgover. We're pretty clearly violating the RFC and the most correct value would be Content-Type: text/xml, Content-Encoding: gzip. I was not aware that mod_xsendfile tosses out Content-Encoding (although we're not even setting the Content-Encoding in pulp.server.content.web.views).

As an aside, when I originally did the content type detection, I used the Python standard library's `mimetypes`, but I don't think it inspects anything other than the file suffix. I did this because I didn't really want to introduce a new dependency, but it might be worth doing so.

#5 Updated by jcline@redhat.com almost 6 years ago

  • Triaged changed from No to Yes

#6 Updated by Anonymous almost 6 years ago

  • Parent task set to #1683

#7 Updated by mhrivnak almost 6 years ago

Here is a helpful question to consider when trying to understand the difference between Content-Type and Content-Encoding; when would it be appropriate for a response to have the Content-Type header set to "application/x-gzip"?

The answer usually boils down to: when you want to use HTTP to retrieve a file that happens to be gzip compressed, but you do not want the HTTP layer to automatically decompress it or treat it differently than any other file.

Or to think of it another way, in the case where the Content-Type header is "application/x-gzip", that is by definition a "media type includes an inherent encoding, such as a data format that is always compressed", as described in the RFC.

In the case of yum metadata files, an application reasonably wants to receive the original gzipped file, put it somewhere, and later think about opening it and unzipping it. That is certainly pulp's expectation. A file might be retrieved via http, ftp, or from a local filesystem; I want each of those mechanisms to just give me the file unmodified.

Many HTTP libraries, such as python-requests, will automatically unzip the message body when Content-Encoding is set to gzip. That illustrates the different expected use cases. As mentioned in my first comment here, there are some servers on the internet putting misleading Content-Encoding headers on responses. Pulp has encountered this problem, and even has a work-around to ignore the content-encoding for any file that ends in .gz, to prevent python-requests from auto-unzipping it: https://github.com/pulp/nectar/blob/python-nectar-1.5.1-1/nectar/downloaders/threaded.py#L321

Putting "gzip" in either the Content-Type or Content-Encoding header depends on the use case. Do you want the zipped blob? Or do you want the http layer to be aware of the encoding and remove it on the client side, so you can immediately start reading unzipped bytes? For pulp's purposes, I think we want to be serving zipped blobs that clients will write to disk, likely cache for a period of time, and then open for reading as they see fit.

#8 Updated by semyers almost 6 years ago

wrote:

We're pretty clearly violating the RFC and the most correct value would be Content-Type: text/xml, Content-Encoding: gzip. I was not aware that mod_xsendfile tosses out Content-Encoding (although we're not even setting the Content-Encoding in pulp.server.content.web.views).

This would be correct if we wanted to deliver an XML file, which I don't think is what we want. We want to deliver whatever was requested, which in this case is a gzip file. What's inside that gzip file is a separate concern, to be handled by the client after receiving the requested file.

While that's my opinion, it seemed like a good idea to ask "How does fedora do it?"

Here's how:

curl -v http://mirrors.rit.edu/fedora/fedora/linux/releases/23/Everything/x86_64/os/repodata/0fa09bb5f82e4a04890b91255f4b34360e38ede964fe8328f7377e36f06bad27-primary.xml.gz >/dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 129.21.171.72...
* Connected to mirrors.rit.edu (129.21.171.72) port 80 (#0)
> GET /fedora/fedora/linux/releases/23/Everything/x86_64/os/repodata/0fa09bb5f82e4a04890b91255f4b34360e38ede964fe8328f7377e36f06bad27-primary.xml.gz HTTP/1.1
> Host: mirrors.rit.edu
> User-Agent: curl/7.43.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Content-Type: application/x-gzip
< Accept-Ranges: bytes
< ETag: "146301521"
< Last-Modified: Sat, 31 Oct 2015 17:30:17 GMT
< Content-Length: 12106437
< Date: Mon, 04 Apr 2016 22:04:18 GMT
< Server: lighttpd/1.4.39

If we did want to mess around with Content-Encoding: gzip, the filenames should have the appropriate extension for the encoded content type. So, if we really do want to send Content-Type: xml, Content-Encoding: gzip, all the metadata XML files should end in .xml, not .gz.

#9 Updated by semyers over 5 years ago

In my opinion, this is a bug.

To fix this, when responding to a request for a gzipped file, like the yum metadata, pulp should include "Content-Type: application/x-gzip" for gzipped files, and not use the "Content-Encoding" header.

#12 Updated by dkliban@redhat.com over 5 years ago

  • Has duplicate Issue #1868: Pulp on RHEL 6 serves wrong files added

#13 Updated by semyers over 5 years ago

I'm not entirely sure that #1868 is a dupe of this (but I'm pretty sure...). Anyone tackling this issue should take a look in there, though, due to the excellent amount of additional detail recorded there.

#15 Updated by pcreech about 5 years ago

Just adding a found workaround:

For the vhost that is serving up the *.gz content, adding this variable to the vhost entry disables MimeMagic from setting these header variables

MimeMagicFile /dev/null

This allows the server (in my case, pulp_docker) to serve up files without having them auto-decompressed in transit.

#17 Updated by pthomas@redhat.com about 5 years ago

  • Has duplicate Issue #2471: Repo download fails for drpm added

#18 Updated by ipanova@redhat.com about 5 years ago

  • Sprint/Milestone set to 31

#19 Updated by jortel@redhat.com about 5 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to jortel@redhat.com

#20 Updated by mhrivnak about 5 years ago

  • Sprint/Milestone changed from 31 to 32

#21 Updated by jortel@redhat.com about 5 years ago

While investigating this on EL7 using pulp 2.8.3, I found some interesting behavior.

Created a YUM repository in pulp using the https://repos.fedorapeople.org/repos/pulp/pulp/fixtures/rpm/ repository named "zoo"

Sync and publish the "zoo" repository.

Created an ISO repository in pulp named "files".

Then, used curl to GET the published "updateinfo.xml.gz" and noted the Content-Type: text/xml.

[jortel@el7u ~]$ curl -v GET --insecure https://el7u.redhat.com/pulp/repos/repos/pulp/pulp/fixtures/rpm/repodata/da5a83a4af0e670b1a0a582743555f20cad88da4071f61f20b5c6ab4e3b16df8-updateinfo.xml.gz
* Could not resolve host: GET; Name or service not known
* Closing connection 0
curl: (6) Could not resolve host: GET; Name or service not known
* About to connect() to el7u.redhat.com port 443 (#1)
*   Trying 192.168.122.113...
* Connected to el7u.redhat.com (192.168.122.113) port 443 (#1)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* skipping SSL peer certificate verification
* SSL connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
* Server certificate:
*     subject: CN=el7u.redhat.com,O=Default Company Ltd,L=Default City,C=XX
*     start date: Sep 22 17:57:57 2016 GMT
*     expire date: Sep 22 17:57:57 2017 GMT
*     common name: el7u.redhat.com
*     issuer: CN=el7u.redhat.com,O=Default Company Ltd,L=Default City,C=XX
> GET /pulp/repos/repos/pulp/pulp/fixtures/rpm/repodata/da5a83a4af0e670b1a0a582743555f20cad88da4071f61f20b5c6ab4e3b16df8-updateinfo.xml.gz HTTP/1.1
> User-Agent: curl/7.29.0
> Host: el7u.redhat.com
> Accept: */*
> 
* skipping SSL peer certificate verification
* NSS: client certificate not found (nickname not specified)
* skipping SSL peer certificate verification
< HTTP/1.1 200 OK
< Date: Wed, 18 Jan 2017 14:48:30 GMT
< Server: Apache/2.4.6 (Red Hat Enterprise Linux) OpenSSL/1.0.1e-fips mod_fcgid/2.3.9 mod_wsgi/3.4 Python/2.7.5
< Last-Modified: Mon, 16 Jan 2017 21:14:55 GMT
< ETag: "281-5463cacda5029"
< Content-Length: 641
< Content-Type: text/xml

I uploaded the (previously downloaded from "zoo") updateinfo.xml.gz into the "files" repo.

Then, used curl to GET the uploaded "updateinfo.xml.gz" and noted the Content-Type: application/x-gzip.

[jortel@el7u ~]$ curl -v GET --insecure https://el7u.redhat.com/pulp/isos/iso/da5a83a4af0e670b1a0a582743555f20cad88da4071f61f20b5c6ab4e3b16df8-updateinfo.xml.gz
* Could not resolve host: GET; Name or service not known
* Closing connection 0
curl: (6) Could not resolve host: GET; Name or service not known
* About to connect() to el7u.redhat.com port 443 (#1)
*   Trying 192.168.122.113...
* Connected to el7u.redhat.com (192.168.122.113) port 443 (#1)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* skipping SSL peer certificate verification
* SSL connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
* Server certificate:
*     subject: CN=el7u.redhat.com,O=Default Company Ltd,L=Default City,C=XX
*     start date: Sep 22 17:57:57 2016 GMT
*     expire date: Sep 22 17:57:57 2017 GMT
*     common name: el7u.redhat.com
*     issuer: CN=el7u.redhat.com,O=Default Company Ltd,L=Default City,C=XX
> GET /pulp/isos/iso/da5a83a4af0e670b1a0a582743555f20cad88da4071f61f20b5c6ab4e3b16df8-updateinfo.xml.gz HTTP/1.1
> User-Agent: curl/7.29.0
> Host: el7u.redhat.com
> Accept: */*
> 
* skipping SSL peer certificate verification
* NSS: client certificate not found (nickname not specified)
< HTTP/1.1 200 OK
< Date: Wed, 18 Jan 2017 14:45:39 GMT
< Server: Apache/2.4.6 (Red Hat Enterprise Linux) OpenSSL/1.0.1e-fips mod_fcgid/2.3.9 mod_wsgi/3.4 Python/2.7.5
< Last-Modified: Wed, 18 Jan 2017 14:41:51 GMT
< ETag: "281-5465f6ad18d92"
< Accept-Ranges: bytes
< Content-Length: 641
< Content-Type: application/x-gzip
< 

I added tried the work around in comment 15:

MimeMagicFile /dev/null

to every where I could think of:

  • At the end of httpd.conf (global)
  • In ssl.conf in <VirtualHost default:443/>
  • At the top of pulp_rpm.conf (global)

Note, "MimeMagicFile NEVER_EVER_USE" already exists globally in pulp_docker.conf

Also, tried adding (global) to the top of pulp_rpm.conf:

AddType application/x-gzip .gz 

NOTHING helped.

#22 Updated by jortel@redhat.com about 5 years ago

  • Status changed from ASSIGNED to NEW
  • Assignee deleted (jortel@redhat.com)

I'm stumped. Setting back to NEW to let someone else take a crack at it.

#23 Updated by semyers almost 5 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to semyers

#24 Updated by semyers almost 5 years ago

After completely denuding httpd of any modules that work with content-type, I can confirm that this is all related to xsendfile throwing away the guessed Content-Encoding:
https://github.com/pulp/pulp/blob/master/server/pulp/server/content/web/views.py#L79

I created three files: foo, foo.xml, foo.xml.gz, and then I ran mimetypes.guess_type on them. foo reported (None, None). foo.xml reported ('text/xml', None), foo.xml.gz reported ('text/xml', 'gzip'). For fun, I also ran it against foo.gzip, which reported (None, 'gzip'), which seems especially annoying.

guess_type ultimately uses posixpath.splitext to split the extension off, so it will never see '.xml.gz' as a file's extension. If it does see '.gz' at the end of a file, it quickly shorts out, having at least set Content-Encoding to 'gzip', which we then ignore. It does splitext the file again to try to find the content type, which is how we end up with ('text/xml', 'gzip'). The encoding is decided by the contents of the mimetypes.encodings_map dictionary.

Easy solution, low risk:
The mimetypes module is a builtin module. You can mutate it all you want, but probably don't want to since other libs might be depending on it behaving a certain way. Looking at the source, you can see that all of the module-level functions pass back to a _db object, which is itself just an instance of mimetypes.MimeTypes. So, for the content serving views, since xsendfile throws away Content-Encoding, we should use our own instance of mimetypes.MimeTypes with an empty encodings_map, which will cause anything.whatever.gz to receive content-type application/gzip.

I'll implement and try to test this tomorrow, and also look into application/gzip vs. application/x-gzip. /etc/mime.types on my system has application/gzip, so I expect I won't have to do anything with or about the Content-Type value, and can instead just let the local system do what it thinks is right via mimetypes.MimeTypes().guess_type, after clearing its encodings map.

#25 Updated by semyers almost 5 years ago

  • Status changed from ASSIGNED to POST

#26 Updated by semyers almost 5 years ago

  • Status changed from POST to MODIFIED

#27 Updated by semyers almost 5 years ago

  • Platform Release set to 2.12.1

#29 Updated by bizhang almost 5 years ago

  • Status changed from MODIFIED to 5

#30 Updated by bizhang almost 5 years ago

  • Version changed from Master to 2.12.1

#31 Updated by bizhang almost 5 years ago

  • Version changed from 2.12.1 to Master

#32 Updated by pthomas@redhat.com almost 5 years ago

verified


curl -v GET --insecure https://localhost/pulp/repos/pulp/pulp/fixtures/rpm-signed/repodata/59d63638410b0c25d1fc000053358a10424a92a2c11c99298e2d7e7e4d5bff8d-updateinfo.xml.gz
* Could not resolve host: GET; Name or service not known
* Closing connection 0
curl: (6) Could not resolve host: GET; Name or service not known
* About to connect() to localhost port 443 (#1)
*   Trying ::1...
* Connected to localhost (::1) port 443 (#1)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* skipping SSL peer certificate verification
* SSL connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
* Server certificate:
*     subject: CN=mgmt2,OU=Development,O=Pulp,ST=North Carolina,C=US
*     start date: Feb 21 20:23:50 2017 GMT
*     expire date: Feb 21 20:23:50 2018 GMT
*     common name: mgmt2
*     issuer: CN=PulpCA,OU=Development,O=Pulp,L=Raleigh,ST=North Carolina,C=US
> GET /pulp/repos/pulp/pulp/fixtures/rpm-signed/repodata/59d63638410b0c25d1fc000053358a10424a92a2c11c99298e2d7e7e4d5bff8d-updateinfo.xml.gz HTTP/1.1
> User-Agent: curl/7.29.0
> Host: localhost
> Accept: */*
> 
* skipping SSL peer certificate verification
* NSS: client certificate not found (nickname not specified)
* skipping SSL peer certificate verification
< HTTP/1.1 200 OK
< Date: Thu, 23 Feb 2017 14:25:58 GMT
< Server: Apache/2.4.6 (Red Hat Enterprise Linux) OpenSSL/1.0.1e-fips mod_wsgi/3.4 Python/2.7.5
< Last-Modified: Wed, 22 Feb 2017 20:36:35 GMT
< ETag: "282-5492473e84116"
< Content-Length: 642
< Content-Type: application/gzip
< 
i?D-ʤH??ù??`??.r?ml?W]o?0}ﯰ???B#H?M??Xu????
                 ??xRǴ)?2?qV&!???????*???)??+B?k"S?
,??bR??Ȑ??v                                        )??9P???, N?2#^?:%?\Q-2?B5"f????vi????m{:
           wFon?sۣ??2U?rX|??V????j?ڰ????X??6??DV)丿?hS??G?????:U??<???????9D?Ȕ
                                                                             R??a?dEC?w???,???(??R?ܲ6?????
?lSMQa?^Svl???)TP&uV~X?m^????!???!S&?%^@??? X;?u??z?
???C??????x?O?y????~??x?<??}???cE??^?ϟ???69?H?`???O(S??/?TC???????_2?`?{&T??AF8Y?
                                                              ????q???)???nb?3?Ol?9T?
                                                                                     \Җ6?s]??r????
* Connection #1 to host localhost left intact
??????q2?

#33 Updated by bizhang almost 5 years ago

  • Status changed from 5 to CLOSED - CURRENTRELEASE

#34 Updated by bmbouter almost 4 years ago

  • Sprint set to Sprint 16

#35 Updated by bmbouter almost 4 years ago

  • Sprint changed from Sprint 16 to Sprint 14

#36 Updated by bmbouter almost 4 years ago

  • Sprint/Milestone deleted (32)

#37 Updated by bmbouter almost 3 years ago

  • Tags Pulp 2 added

Also available in: Atom PDF