Project

Profile

Help

Story #8088

closed

As a user, I can configure Pulp to virus scan files prior to saving or serving them

Added by ByteSore over 3 years ago. Updated over 2 years ago.

Status:
CLOSED - DUPLICATE
Priority:
Normal
Assignee:
-
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Platform Release:
Groomed:
No
Sprint Candidate:
No
Tags:
Sprint:
Quarter:

Description

Ticket moved to GitHub: "pulp/pulpcore/1957":https://github.com/pulp/pulpcore/issues/1957


Motivation

The business policy here is that all downloaded files should be scanned for viruses. The on_demand policy is being used with a PyPI remote because syncing all of PyPI to just use a few files is not practical.

User Experience

  1. The user sets up a virus scanner like clamav in daemon mode.
  2. Administrator configures the setting SECURITY_SCAN_SHELL.
  3. User restarts Pulp
  4. All content already in saved in Pulp is not scanned. Any new content brought into Pulp through any policy type, e.g. immediate|on_demand|streamed is first scanned before being saved or handed to a client.

Implementation

on_demand and streamed policies

The pulpcore_content app handles the policy=on_demand and policy=streamed modes. Currently those modes "stream" bits to clients meaning they begin serving data prior to receiving all of it. When scanning is enabled those modes can no longer do this, they have to "store, scan, and forward" the data to clients because security scanners can only scan entire files not a stream of data. So the first change to prototype is having a store and forward change when this setting is in place.

immediate policy

To handle the immediate policy a new "Artifact" stage should be created. The Artifact stages live here.

The pipeline is a series of stages, and this stage should only be used when the SECURITY_SCAN_SHELL setting is set. The default stage construction happens here.

Submitting data for scanning

In all policy types data should be submitted for scanning by calling the SECURITY_SCAN_SHELL with the path to the temporary file added into that command somehow. The response code 0 from the AV scanner shell subprocess call will tell Pulp to accept the file and that's its safe. Any other response code will tell Pulp to throw away the file and send the client an error code of some kind.

Notes

  • The on_demand and streamed policies have to submit files to the scanner one-by-one. This is unavoidable.
  • Having the clamav scanner run in daemon mode should resolve the "startup time" problem.
  • The calls to the shell need to be validly callable by the Pulp user on wherever the pulpcore_content process is running and any pulpcore reserved worker.

TBD

  • What HTTP error code to submit to clients when the AV scanner says the file is not safe?
  • The file saved may have a filename like /path/to/tmp/dirs/e0e9fe76-a80a-47c7-8255-1ec4c8f8c7da. This doesn't have a filetype. Does it need a filetype?
  • How will the subprocess call know where in the command to put the path to the file to scan?

Files

irclog.txt (2.14 KB) irclog.txt IRC log ByteSore, 01/13/2021 03:34 PM
Actions #1

Updated by bmbouter over 3 years ago

Extending the content app to make a call during the workflow could be pretty straight forward. This would cause the entire file to be downloaded before any part of it could be served, is that ok?

So the workflow for the content app would be:

  1. User requests the file to be downloaded from a repo with, e.g. policy="on_demand"
  2. pulpcore-content downloads that file (not streaming any bit to the user yet)
  3. Calls out to the virus scanner knowing the file path
  4. Reads an "ok" to proceed based on the return code maybe?
  5. Serves and saves the file per the on_demand policy.

For steps 3 and 4, I imagined there could be a system-wide config with a script that pulp would call that an admin would configure. The path to the file would be the first positional argument to the script. Then for set 4, the script would return an exit code of 0 would tell pulp to proceed, anything else to not.

@ByteSore what do you think about all this?

Actions #2

Updated by ByteSore over 3 years ago

Sounds like a plan.

Actions #3

Updated by fao89 over 3 years ago

  • Tracker changed from Issue to Story
  • % Done set to 0
  • Severity deleted (2. Medium)
  • OS deleted (RHEL 7)
  • Triaged deleted (No)
Actions #4

Updated by ByteSore over 3 years ago

I was thinking..
Since the scanner takes a while to spin up, scan all the files and do it's thing maybe it's a good thing to check if there are multiple files being downloaded. (ie package with all dependencies)
Park them all in a folder or store all the file path's to a textfile which can be used by the scanner.
if you look at clamav, it knows an option to scan files listed in a textfile: --file-list=FILE
Actions #5

Updated by bmbouter about 3 years ago

  • Subject changed from implementing virus scan on on_demand repository to As a user, I can configure Pulp to virus scan files prior to saving or serving them
  • Description updated (diff)
Actions #6

Updated by mdellweg about 3 years ago

Should we implement scanning files on upload too? ...to close all holes.

Actions #7

Updated by vriesdemichael almost 3 years ago

I've been looking into this issue and came to a few additional conclusions

  1. Having to wait for a virus scan result messes up the asynchronous nature of the streamed and on demand policy majorly. I am still unsure how to deal with heavy load on the virus scanner, especially with the streamed policy. With the on demand policy it at least triggers the virus scan, which means that somewhere in the future you can access the artifacts without the synchronous virus scanner call.
  2. It would make sense to store the scan result in the database to prevent an infected file from being scanned over and over every time it is requested.
  3. A rescan mechanism would be prudent once the basic functionality is implemented to prevent access to infected files which were not yet included in earlier antivirus definitions. Could be a TTL for a scan result or some batch action to periodically execute.
  4. A HTTP status code 409 (conflicted resource) makes sense with a message to indicate the nature of the conflict is an infected artifact.

In the attempts to implement the functionality so far it feels like creating two new policies for streamed+scanner and on-demand+scanner. I'll add a PR soon.

Some more food for though: Should we keep a content artifact for infected content or remove it from the repository?

Actions #8

Updated by bmbouter almost 3 years ago

vriesdemichael wrote:

I've been looking into this issue and came to a few additional conclusions

  1. Having to wait for a virus scan result messes up the asynchronous nature of the streamed and on demand policy majorly. I am still unsure how to deal with heavy load on the virus scanner, especially with the streamed policy. With the on demand policy it at least triggers the virus scan, which means that somewhere in the future you can access the artifacts without the synchronous virus scanner call.

Agreed. I think we'll have to see in practice how long we can delay clients and prevent timeouts. Some probably have built-in retry behaviors, and others can support receiving an HTTP 429 with maybe a Retry-After header. It's hard though because what I've learned is that each client is different and some just won't work. As an alternative, although it would be as great for users, perhaps systems with virus scanning should not offer on_demand or streamed at all.

  1. It would make sense to store the scan result in the database to prevent an infected file from being scanned over and over every time it is requested.

That could be helpful. One of my concerns is storing 'infected' artifacts at all because pulpcore and plugin code directly accesses Artifacts so if they are there they will be "used". For this reason I was thinking it would be best to not save Artifacts that fail virus scanning at all.

  1. A rescan mechanism would be prudent once the basic functionality is implemented to prevent access to infected files which were not yet included in earlier antivirus definitions. Could be a TTL for a scan result or some batch action to periodically execute.

Agreed. We have added administrator commands for these kinds of things here. In other cases we've added API endpoints for users to be able to trigger repo-level or system-wide checks, e.g. the repair command here. The former is good if administrators are the actor, the latter is if users are the actor. Getting them to run periodically can be wrapped around either approach.

  1. A HTTP status code 409 (conflicted resource) makes sense with a message to indicate the nature of the conflict is an infected artifact.

Sounds good to me.

In the attempts to implement the functionality so far it feels like creating two new policies for streamed+scanner and on-demand+scanner. I'll add a PR soon.

What did you think about disallowing these altogether? I'm ok with keeping them; I'm just worried about clients timing out while virus scanning occurs.

I'd like to help answer any questions you have and see the PR when it comes in to help it get through the process.

Some more food for though: Should we keep a content artifact for infected content or remove it from the repository?

I responded to this top-to-bottom, but similar to what I was pitching earlier, I think actually don't save the Artifact at all. A non-saved-Artifact also cannot be associated with Content and can't be associated with a repository.

Actions #9

Updated by wibbit over 2 years ago

Some more food for though: Should we keep a content artifact for infected content or remove it from the repository?

I responded to this top-to-bottom, but similar to what I was pitching earlier, I think actually don't save the Artifact at all. A non-saved-Artifact also cannot be associated with Content and can't be associated with a repository.

Just a thought on this.

If the data is not stored any where, it will make it quite hard to investigate issues after the fact (specifically false positives), if we cant store the data in a quarantined directory, then at least keeping a record of the source URL as well as the hash of the downloaded artifact some where, so that it can be retrieved and compared.

Of course clamav supports the --move switch to be able to quarantine infected files, would this be supported in any way?

Actions #10

Updated by wibbit over 2 years ago

Just reply

wibbit wrote:

Some more food for though: Should we keep a content artifact for infected content or remove it from the repository?

I responded to this top-to-bottom, but similar to what I was pitching earlier, I think actually don't save the Artifact at all. A non-saved-Artifact also cannot be associated with Content and can't be associated with a repository.

Just a thought on this.

If the data is not stored any where, it will make it quite hard to investigate issues after the fact (specifically false positives), if we cant store the data in a quarantined directory, then at least keeping a record of the source URL as well as the hash of the downloaded artifact some where, so that it can be retrieved and compared.

Of course clamav supports the --move switch to be able to quarantine infected files, would this be supported in any way?

Just replying to my self, it seems to make the most sense to leave any quarantining of data down to the scanning program, as it can also move the data to a location that pulp would have no access to.

This is certainly some thing that I can see us implementing in due course.

Actions #11

Updated by pulpbot over 2 years ago

  • Description updated (diff)
  • Status changed from NEW to CLOSED - DUPLICATE

Also available in: Atom PDF