Range requests for on demand content return the full file (Kickstarts fail for on_demand repos)
When fetching an rpm from the content app and specifying the RANGE header, like so:
curl -k http://foreman-nuc2.usersys.redhat.com/pulp/content/Demo/Library/custom/CentOS7/main/Packages/s/sg3_utils-1.37-19.el7.x86_64.rpm -H "Range: bytes=1384-44339" > foo.rpm
if the package is being lazily downloaded, the entire file is returned and not JUST the range:
$ ls -l foo.rpm -h -rw-rw-r--. 1 jlsherri jlsherri 646K Jun 4 09:03 foo.rpm
Once the package is downloaded, it behaves as you'd expect:
$ curl -k http://foreman-nuc2.usersys.redhat.com/pulp/content/Demo/Library/custom/CentOS7/main/Packages/s/sg3_utils-1.37-19.el7.x86_64.rpm -H "Range: bytes=1384-44339" > foo2.rpm $ ls -l foo2.rpm -rw-rw-r--. 1 jlsherri jlsherri 42956 Jun 4 09:04 foo2.rpm
For el7 (at least, probably more), this causes yum/anaconda to hang up the connection as soon as it gets the amount of requested data, which makes the content app really unhappy and leads to this error:
[2021-06-04 11:57:02 +0000]  [ERROR] Error handling request Traceback (most recent call last): File "/usr/lib64/python3.6/site-packages/aiohttp/web_protocol.py", line 422, in _handle_request resp = await self._request_handler(request) File "/usr/lib64/python3.6/site-packages/aiohttp/web_app.py", line 499, in _handle resp = await handler(request) File "/usr/lib/python3.6/site-packages/pulpcore/content/handler.py", line 138, in stream_content return await self._match_and_stream(path, request) File "/usr/lib/python3.6/site-packages/pulpcore/content/handler.py", line 387, in _match_and_stream request, StreamResponse(headers=headers), ca File "/usr/lib/python3.6/site-packages/pulpcore/content/handler.py", line 501, in _stream_content_artifact response = await self._stream_remote_artifact(request, response, remote_artifact) File "/usr/lib/python3.6/site-packages/pulpcore/content/handler.py", line 651, in _stream_remote_artifact download_result = await downloader.run() File "/usr/lib/python3.6/site-packages/pulpcore/download/base.py", line 227, in run return await self._run(extra_data=extra_data) File "/usr/lib/python3.6/site-packages/pulp_rpm/app/downloaders.py", line 90, in _run to_return = await self._handle_response(response) File "/usr/lib/python3.6/site-packages/pulpcore/download/http.py", line 189, in _handle_response await self.handle_data(chunk) File "/usr/lib/python3.6/site-packages/pulpcore/content/handler.py", line 636, in handle_data await response.write(data) File "/usr/lib64/python3.6/site-packages/aiohttp/web_response.py", line 470, in write await self._payload_writer.write(data) File "/usr/lib64/python3.6/site-packages/aiohttp/http_writer.py", line 107, in write self._write(chunk) File "/usr/lib64/python3.6/site-packages/aiohttp/http_writer.py", line 67, in _write raise ConnectionResetError("Cannot write to closing transport") ConnectionResetError: Cannot write to closing transport [04/Jun/2021:11:57:02 +0000] "GET /pulp/content/Demo/Library/custom/CentOS7/main/Packages/s/sg3_utils-1.37-19.el7.x86_64.rpm HTTP/1.1" 500 0 "-" "urlgrabber/3.10 yum/3.4.3"
and since anaconda receives the entire rpm instead of just the range it requested (the rpm header), it re-tries the request, and pulp continually just tries to return the entire file
Updated by dalley over 2 years ago
I'm not able to reproduce this on latest master, I will try with 3.11
In : Artifact.objects.all() Out: <QuerySet [<Artifact: pk=864fb941-43c8-4ff6-b747-0c8e755881c4>]> In : exit ## A completely different file from ^^ one, this one has never been downloaded before (pulp) [vagrant@pulp3-source-centos7 pulpcore]$ curl -k http://pulp3-source-centos7.localhost.example.com/pulp/content/fixture/duck-0.8-1.noarch.rpm -H "Range: bytes=1-200" > duck-0.8-1.noarch.rpm % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 200 100 200 0 0 464 0 --:--:-- --:--:-- --:--:-- 464 (pulp) [vagrant@pulp3-source-centos7 pulpcore]$ ls -al total 240 ... snip ... -rw-rw-r--. 1 vagrant vagrant 200 Jun 4 16:16 duck-0.8-1.noarch.rpm -rw-rw-r--. 1 vagrant vagrant 200 Jun 4 16:09 fox-1.1-2.noarch.rpm ... snip ... (pulp) [vagrant@pulp3-source-centos7 pulpcore]$ python manage.py shell_plus ... snip ... In : Artifact.objects.all() Out: <QuerySet [<Artifact: pk=864fb941-43c8-4ff6-b747-0c8e755881c4>, <Artifact: pk=47e95aca-8bf2-4268-8a74-2238df24eeb7>]>
Both files were streamed but both provided 200 bytes back as requested by curl.
Updated by bmbouter about 2 years ago
So what's the recommended behavior here?
- Have it stream and save all of the file even though the client asked for a portion of it (and serve just that to the client)?
- Have it fetch what the client is asking for every time (basically making ignoring policy=immediate)?
Updated by email@example.com about 2 years ago
I think it only fetches the ones it thinks it may need to install. I agree that option 1) is probably preferred, but i don't think that option 2) is that terrible (assuming the header request will use the on-disk rpm if available, that was a little unclear).
Most likely the header will be requested and then the rpm later on.
Updated by bmbouter about 2 years ago
I'm going to pursue option 1 as it will result in fewer requests to external servers over time.
Also I think we need to get the response headers right, so I'm going to mimic what is responded by an official centos mirror for example:
$ curl -i https://packages.oit.ncsu.edu/centos/7/os/x86_64/Packages/sg3_utils-1.37-19.el7.x86_64.rpm -H "Range: bytes=1384-44339" HTTP/1.1 206 Partial Content Date: Thu, 08 Jul 2021 14:35:58 GMT Server: Apache Last-Modified: Fri, 03 Apr 2020 21:08:05 GMT ETag: "a16b8-5a269504aa76b" Accept-Ranges: bytes Content-Length: 42956 Content-Range: bytes 1384-44339/661176 Content-Type: application/x-rpm Warning: Binary output can mess up your terminal. Use "--output -" to tell Warning: curl to output it to your terminal anyway, or consider "--output Warning: <FILE>" to save to a file.