Project

Profile

Help

Issue #1839

closed

sn.dat is succeptible to race conditions

Added by mihai.ibanescu@gmail.com over 8 years ago. Updated over 4 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
High
Assignee:
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
3. High
Version:
2.7.1
Platform Release:
2.13.0
OS:
RHEL 7
Triaged:
Yes
Groomed:
Yes
Sprint Candidate:
Yes
Tags:
Pulp 2
Sprint:
Sprint 14
Quarter:

Description

In my case, /var/lib/pulp is mounted over NFS by two nodes. Both nodes accept connections over HTTP.

I have a situation where /var/lib/pulp/sn.dat is some 9000 bytes long, with the number being repeated:

33434526334358977334345263343592093343452633435897733434526334360199733434526334358977334345263343592093343452633435897733434526334360240942333434526334358977334345263343592093343452633435897733434526334360199733434526334358977334345263343592093343452633435897733434526334360240985657334345263343589773343452633435920933434526334358977334345263343601997334345263343589773343452633435920933434526334358977334345263343602409423334345263343589773343452633435920933434526

I believe the documentation should flag that file as necessary to be on local storage. There is already a configuration option to change the location, so maybe it's in the docs already and I haven't seen it.

Actions #1

Updated by bmbouter over 8 years ago

This issue was not triaged today because it needs more information.

What is the problem with having a 9000 byte sn.dat file with repeating numbers in it?
Why is that file not appropriate to live on shared storage?

As an aside, I'm not sure what the sn.dat file does so that would be helpful too.

Actions #2

Updated by mihai.ibanescu@gmail.com over 8 years ago

As far as I can tell from reading the code, sn.dat is used for generating the X509 certificates returned by the login action using the API.

The code attempts to "atomically" increment the integer in it, which is then passed to openssl as the serial number for the certificate.

The locking is only functional for blocking other processes on the same host; however, if two nodes mount the same /var/lib/pulp over NFS, then you may end up with a corrupted file, which is what I observed. At that point, openssl will throw a nastygram at you, embedding a string "unable to write random state" in the middle of the cert.

Actions #3

Updated by mihai.ibanescu@gmail.com over 8 years ago

You may consider storing the serial number in mongo, since you can atomically increment and retrieve a field in Mongo, like a sequence in an RDBMS. I think findAndUpdate would be the call.

Actions #4

Updated by rbarlow over 8 years ago

  • Subject changed from Incrementing /var/lib/pulp/sn.dat is not NFS-safe to /login/ is not NFS-safe
  • Description updated (diff)
Actions #5

Updated by mhrivnak over 8 years ago

  • Severity changed from 2. Medium to 1. Low
  • Triaged changed from No to Yes
  • Tags deleted (Documentation)
Actions #6

Updated by mihai.ibanescu@gmail.com over 8 years ago

Actually, even with the file on local storage, I believe there is still a race condition in how the file is locked before being written:

cat /var/lib/pulp.sn
X11 forwarding request failed on channel 0
11111529215046068806981115292150460688125751111152921504606880698111529215046068818351111115292150460688069811152921504606881257511111529215046068806981115292150460688322761111152921504606880698111529215046068812575111115292150460688069811152921504606881835111111529215046068806981115292150460688125751111152921504606880698111529215046068847758

The expected value should be 1111152921504606880698 plus several thousands.
Notice how that value got appended multiple times.

Looking at server/pulp/server/managers/auth/cert/cert_generator.py - RLock seems to be thread safe, but isn't that code running under apache, as different processes?

Actions #7

Updated by mhrivnak over 8 years ago

You're right. By default, the REST API runs in three independent processes per machine. So I can see how there might be a race condition.

Actions #8

Updated by mihai.ibanescu@gmail.com over 8 years ago

We are absolutely hitting this race condition now, hence why I want to figure out a solution.

Is there a reason NOT to use purely random serial numbers? That would practically solve the race condition (by eliminating the need for a serial file to begin with).

I see that codepath being used in two circumstances:

  • by the login call via server/pulp/server/webservices/views/root_actions.py - the serial number itself does not get persisted anywhere, as far as I can tell
  • by the consumer registration via server/pulp/server/managers/consumer/cud.py - again, I don't see the serial number stored

So a random value would work fine if you ignore the very low probability for a collision. http://www.ietf.org/rfc/rfc5280.txt only requires it to be a positive integer, and it MUST be unique, but I would interpret that only if you actually tracked to whom you issued a certificate, which in the case of pulp I don't think that's important.

Finally, instead of 100% random, here is what Microsoft does:

https://technet.microsoft.com/en-us/library/cc784789(v=WS.10).aspx

So one could pack the struct timeval from gettimeofday (in python that would be accessible via datetime.datetime.utcnow()) as 4+4=8 bytes, and add another 11 bytes of randomness for a total of 19 bytes.

Actions #9

Updated by bmbouter over 8 years ago

  • Sprint Candidate changed from No to Yes
Actions #10

Updated by amacdona@redhat.com over 8 years ago

  • Groomed changed from No to Yes
Actions #11

Updated by mhrivnak about 8 years ago

  • Sprint/Milestone set to 29
Actions #12

Updated by bizhang about 8 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to bizhang
Actions #14

Updated by mhrivnak almost 8 years ago

  • Sprint/Milestone changed from 29 to 32
Actions #15

Updated by mhrivnak almost 8 years ago

  • Priority changed from Normal to High
  • Severity changed from 1. Low to 3. High

I'm bumping the priority and severity. Yesterday a user hit this problem, which prevented consumer creation from working. The POST to the api/v2/consumers/ endpoint would hang for 5 minutes and then time out. It took two of us 3 hours of investigation to track down the cause as this bug.

Actions #16

Updated by bmbouter almost 8 years ago

  • Subject changed from /login/ is not NFS-safe to sn.dat is succeptable to race conditions

As mhrivnak points out, the sn.dat is also used in the creation of certificates for consumers. That means there are at least two root causes, one for cert generation for /login/ and another for cert generation for consumer certificates. I'm not sure if we want to track these as one or two issues, but right now I'm assuming it's one. As such, I've retitled the bug to indicate that sn.dat is not race condition free generally.

Also, I don't think the shared filesystem is the issue. The system that mhrivnak and I investigated was a single server with sn.dat on local storage. Given that evidence, I believe the issue is that httpd is handling concurrent requests which is the root of the race condition.

Actions #17

Updated by mihai.ibanescu@gmail.com almost 8 years ago

Thank you for changing the summary. As you can see from comment #6, we were seeing this problem even when we switched to local storage.

We ended up patching Pulp to generate random numbers for the serial, and it works fine. We also stopped using the login call in our application that uses the REST API :-)

Actions #18

Updated by mihai.ibanescu@gmail.com almost 8 years ago

I think at this point it's worth asking: why would anyone want to use the login call at all?

The reason I was using it: you won't ferry the user's credentials for every single call.

In our setup we have a load balancer in front of the 3 pulp nodes. Load balancers can also be used for terminating the SSL connection and theoretically saving some CPU on the nodes. They typically will expose the client's cert as a custom HTTP header that they inject.

Because Pulp (to my knowledge and after reading the code) does not know how to look for the cert information in a header, this means the load balancer will not be able to terminate the SSL connection.

So we reverted back to basic auth (although we haven't gone through configuring the pulp nodes to speak plain HTTP and terminate the SSL connection on the balancer - we still had clients that were using the login call). We could do this, because we're behind a firewall. In a public internet scenario you probably do want the login call, but then you can't do anything about SSL termination.

Added by bizhang almost 8 years ago

Revision 6bb42d90 | View on GitHub

Remove sn.dat for serial number storage

closes #1839 https://pulp.plan.io/issues/1839

Added by bizhang almost 8 years ago

Revision 6bb42d90 | View on GitHub

Remove sn.dat for serial number storage

closes #1839 https://pulp.plan.io/issues/1839

Actions #20

Updated by bizhang almost 8 years ago

  • Status changed from POST to MODIFIED
Actions #22

Updated by semyers almost 8 years ago

  • Platform Release set to 2.13.0
Actions #23

Updated by pcreech over 7 years ago

  • Status changed from MODIFIED to 5
Actions #24

Updated by mihai.ibanescu@gmail.com over 7 years ago

  • Subject changed from sn.dat is succeptable to race conditions to sn.dat is succeptible to race conditions
Actions #25

Updated by pcreech over 7 years ago

  • Status changed from 5 to CLOSED - CURRENTRELEASE
Actions #26

Updated by bmbouter almost 7 years ago

  • Sprint set to Sprint 16
Actions #27

Updated by bmbouter almost 7 years ago

  • Sprint changed from Sprint 16 to Sprint 14
Actions #28

Updated by bmbouter almost 7 years ago

  • Sprint/Milestone deleted (32)
Actions #29

Updated by bmbouter over 5 years ago

  • Tags Pulp 2 added
Actions #30

Updated by bmbouter over 4 years ago

  • Category deleted (14)

We are removing the 'API' category per open floor discussion June 16, 2020.

Also available in: Atom PDF