Issue #1839
closedsn.dat is succeptible to race conditions
Description
In my case, /var/lib/pulp is mounted over NFS by two nodes. Both nodes accept connections over HTTP.
I have a situation where /var/lib/pulp/sn.dat is some 9000 bytes long, with the number being repeated:
33434526334358977334345263343592093343452633435897733434526334360199733434526334358977334345263343592093343452633435897733434526334360240942333434526334358977334345263343592093343452633435897733434526334360199733434526334358977334345263343592093343452633435897733434526334360240985657334345263343589773343452633435920933434526334358977334345263343601997334345263343589773343452633435920933434526334358977334345263343602409423334345263343589773343452633435920933434526
I believe the documentation should flag that file as necessary to be on local storage. There is already a configuration option to change the location, so maybe it's in the docs already and I haven't seen it.
Updated by bmbouter over 8 years ago
This issue was not triaged today because it needs more information.
What is the problem with having a 9000 byte sn.dat file with repeating numbers in it?
Why is that file not appropriate to live on shared storage?
As an aside, I'm not sure what the sn.dat file does so that would be helpful too.
Updated by mihai.ibanescu@gmail.com over 8 years ago
As far as I can tell from reading the code, sn.dat is used for generating the X509 certificates returned by the login action using the API.
The code attempts to "atomically" increment the integer in it, which is then passed to openssl as the serial number for the certificate.
The locking is only functional for blocking other processes on the same host; however, if two nodes mount the same /var/lib/pulp over NFS, then you may end up with a corrupted file, which is what I observed. At that point, openssl will throw a nastygram at you, embedding a string "unable to write random state" in the middle of the cert.
Updated by mihai.ibanescu@gmail.com over 8 years ago
You may consider storing the serial number in mongo, since you can atomically increment and retrieve a field in Mongo, like a sequence in an RDBMS. I think findAndUpdate would be the call.
Updated by rbarlow over 8 years ago
- Subject changed from Incrementing /var/lib/pulp/sn.dat is not NFS-safe to /login/ is not NFS-safe
- Description updated (diff)
Updated by mhrivnak over 8 years ago
- Severity changed from 2. Medium to 1. Low
- Triaged changed from No to Yes
- Tags deleted (
Documentation)
Updated by mihai.ibanescu@gmail.com over 8 years ago
Actually, even with the file on local storage, I believe there is still a race condition in how the file is locked before being written:
cat /var/lib/pulp.sn
X11 forwarding request failed on channel 0
11111529215046068806981115292150460688125751111152921504606880698111529215046068818351111115292150460688069811152921504606881257511111529215046068806981115292150460688322761111152921504606880698111529215046068812575111115292150460688069811152921504606881835111111529215046068806981115292150460688125751111152921504606880698111529215046068847758
The expected value should be 1111152921504606880698 plus several thousands.
Notice how that value got appended multiple times.
Looking at server/pulp/server/managers/auth/cert/cert_generator.py - RLock seems to be thread safe, but isn't that code running under apache, as different processes?
Updated by mhrivnak over 8 years ago
You're right. By default, the REST API runs in three independent processes per machine. So I can see how there might be a race condition.
Updated by mihai.ibanescu@gmail.com over 8 years ago
We are absolutely hitting this race condition now, hence why I want to figure out a solution.
Is there a reason NOT to use purely random serial numbers? That would practically solve the race condition (by eliminating the need for a serial file to begin with).
I see that codepath being used in two circumstances:
- by the login call via server/pulp/server/webservices/views/root_actions.py - the serial number itself does not get persisted anywhere, as far as I can tell
- by the consumer registration via server/pulp/server/managers/consumer/cud.py - again, I don't see the serial number stored
So a random value would work fine if you ignore the very low probability for a collision. http://www.ietf.org/rfc/rfc5280.txt only requires it to be a positive integer, and it MUST be unique, but I would interpret that only if you actually tracked to whom you issued a certificate, which in the case of pulp I don't think that's important.
Finally, instead of 100% random, here is what Microsoft does:
https://technet.microsoft.com/en-us/library/cc784789(v=WS.10).aspx
So one could pack the struct timeval from gettimeofday (in python that would be accessible via datetime.datetime.utcnow()) as 4+4=8 bytes, and add another 11 bytes of randomness for a total of 19 bytes.
Updated by bizhang about 8 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to bizhang
Updated by bizhang about 8 years ago
- Status changed from ASSIGNED to POST
Updated by mhrivnak almost 8 years ago
- Priority changed from Normal to High
- Severity changed from 1. Low to 3. High
I'm bumping the priority and severity. Yesterday a user hit this problem, which prevented consumer creation from working. The POST to the api/v2/consumers/ endpoint would hang for 5 minutes and then time out. It took two of us 3 hours of investigation to track down the cause as this bug.
Updated by bmbouter almost 8 years ago
- Subject changed from /login/ is not NFS-safe to sn.dat is succeptable to race conditions
As mhrivnak points out, the sn.dat is also used in the creation of certificates for consumers. That means there are at least two root causes, one for cert generation for /login/ and another for cert generation for consumer certificates. I'm not sure if we want to track these as one or two issues, but right now I'm assuming it's one. As such, I've retitled the bug to indicate that sn.dat is not race condition free generally.
Also, I don't think the shared filesystem is the issue. The system that mhrivnak and I investigated was a single server with sn.dat on local storage. Given that evidence, I believe the issue is that httpd is handling concurrent requests which is the root of the race condition.
Updated by mihai.ibanescu@gmail.com almost 8 years ago
Thank you for changing the summary. As you can see from comment #6, we were seeing this problem even when we switched to local storage.
We ended up patching Pulp to generate random numbers for the serial, and it works fine. We also stopped using the login call in our application that uses the REST API :-)
Updated by mihai.ibanescu@gmail.com almost 8 years ago
I think at this point it's worth asking: why would anyone want to use the login call at all?
The reason I was using it: you won't ferry the user's credentials for every single call.
In our setup we have a load balancer in front of the 3 pulp nodes. Load balancers can also be used for terminating the SSL connection and theoretically saving some CPU on the nodes. They typically will expose the client's cert as a custom HTTP header that they inject.
Because Pulp (to my knowledge and after reading the code) does not know how to look for the cert information in a header, this means the load balancer will not be able to terminate the SSL connection.
So we reverted back to basic auth (although we haven't gone through configuring the pulp nodes to speak plain HTTP and terminate the SSL connection on the balancer - we still had clients that were using the login call). We could do this, because we're behind a firewall. In a public internet scenario you probably do want the login call, but then you can't do anything about SSL termination.
Added by bizhang almost 8 years ago
Added by bizhang almost 8 years ago
Revision 6bb42d90 | View on GitHub
Remove sn.dat for serial number storage
Updated by bizhang almost 8 years ago
- Status changed from POST to MODIFIED
Applied in changeset pulp|6bb42d9029fa3d255cb47948c84a0058e948e833.
Updated by mihai.ibanescu@gmail.com over 7 years ago
- Subject changed from sn.dat is succeptable to race conditions to sn.dat is succeptible to race conditions
Updated by pcreech over 7 years ago
- Status changed from 5 to CLOSED - CURRENTRELEASE
Updated by bmbouter almost 7 years ago
- Sprint changed from Sprint 16 to Sprint 14
Updated by bmbouter over 4 years ago
- Category deleted (
14)
We are removing the 'API' category per open floor discussion June 16, 2020.
Remove sn.dat for serial number storage
closes #1839 https://pulp.plan.io/issues/1839