Content app can have unusable/closed db connections
We've seen multiple times the
django.db.utils.InterfaceError: connection already closed while using content app. See related issues.
A current workaround is to reset the db connection in multiple places in the code. This "solution" is likely not reliable if we are not resetting db connection before every db request.
This problem needs investigation to understand why Django doesn't take care of the db connections itself, and how to solve it properly.
#6 Updated by daviddavis about 2 months ago
Off the top of my head I see two possible solutions.
_reset_db_connection currently only resets the connection if it's unusable or obsolete. I imagine that this connection is becoming unusable or obsolete after this function is called but before it's used, so one solution might be to always reset the connection in
_reset_db_connection. I have no idea what impacts this will have (e.g. will it impact performance?).
Second possible solution is to wrap each db query in a function or something that will attempt to execute the query and then retry it if it fails.
Let me briefly recap a few thoughts I had in #9515 (and later in a discussion with the Katello Platform team):
The most trivial reproducer I could come up with is a restart of the PostgreSQL database and then trying to access the content index at
/pulp/content. Under normal circumstances, this will list content, but when the DB connection is broken, it yields a 500 error.
This issue is currently marked as "medium", but I would propose to raise it to at least High (maybe even Urgent).
Connection drops between the content app and the DB can happen for a multitude of reasons:
- Network issues when the DB is externally hosted
- Firewalls disliking long-running connections when the DB is externally hosted
- Any kind of maintenance done to the DB (update, config change, you name it)
All of them can "just happen" and all of them result in a broken content app, where the user might not be directly aware of the correlation (as they usually expect the app to "just reconnect").
The issue seems only to affect the content app.
Workers "just die" when the connection to the DB drops, and then systemd restarts them (we deploy them with
Restart=always, and so do you).
API seems to recover from the disconnect just fine without restarts (or any log messages, that I've seen).
reset DB connection¶
For me, it was sufficient to add a call to
Handler._reset_db_connection in the
get_status_blocking method, that gets called as part of the heartbeat. But given the origin of the issue, I am sure this is just papering over the real issue.
die like workers¶
There is probably a way to make the process die, like it happens to the workers, instead of hanging there, broken. This would allow systemd to restart it.
#14 Updated by email@example.com 2 days ago
In pulpcore 3.15 and 3.16 it is possible to check for this error in 2 places: the authenticate middleware which is run for every request and in the heartbeat code.
In pulpcore 3.14 we did not have this middleware. However, I propose adding a middleware just for checking the db connection. We would also want to check the db connection status in the heartbeat.
Please register to edit this issue