Issue #9276
closedContent app can have unusable/closed db connections in pulpcore 3.15/3.16
Description
We've seen multiple times the django.db.utils.InterfaceError: connection already closed
while using content app. See related issues.
A current workaround is to reset the db connection in multiple places in the code. This "solution" is likely not reliable if we are not resetting db connection before every db request.
This problem needs investigation to understand why Django doesn't take care of the db connections itself, and how to solve it properly.
Related issues
Updated by ttereshc over 3 years ago
So far the problem is seen with the content app only. I wonder if the fact that it's a standalone django script (with django.setup()
call) can be related.
Updated by ttereshc over 3 years ago
- Related to Issue #9275: Content app db connection can be closed while matching a distribution added
Updated by ttereshc over 3 years ago
- Related to Issue #8672: Registry handler loses database connection added
Updated by ttereshc over 3 years ago
- Related to Issue #6045: Pulp content app looses database connection added
Updated by daviddavis over 3 years ago
Off the top of my head I see two possible solutions.
First, _reset_db_connection
currently only resets the connection if it's unusable or obsolete. I imagine that this connection is becoming unusable or obsolete after this function is called but before it's used, so one solution might be to always reset the connection in _reset_db_connection
. I have no idea what impacts this will have (e.g. will it impact performance?).
Second possible solution is to wrap each db query in a function or something that will attempt to execute the query and then retry it if it fails.
Updated by daviddavis over 3 years ago
We discussed this at the pulpcore team meeting and we agreed that we need a better understanding of the issue (including a reproducer).
One thing that would help to reproduce the problem is to remove calls to _reset_db_connection
.
Updated by ipanova@redhat.com over 3 years ago
- Sprint/Milestone changed from 3.16.0 to 3.17.0
Updated by adam.winberg@smhi.se over 3 years ago
We encounter this every time we reboot the postgres server holding the pulp database. We have to restart the content service to get it working again.
Using python3-pulpcore-3.14.3-1.el8.noarch python3-pulp-rpm-3.14.0-1.el8.noarch
Updated by evgeni over 3 years ago
Still an issue in 3.16.0.
I had opened #9515 with some more details, before finding this one and can't close as duplicate :(
Updated by mdellweg over 3 years ago
- Has duplicate Issue #9515: content app doesn't survive PostgreSQL disconnect in pulpcore 3.14.7 added
Updated by evgeni over 3 years ago
Let me briefly recap a few thoughts I had in #9515 (and later in a discussion with the Katello Platform team):
Reproducer¶
The most trivial reproducer I could come up with is a restart of the PostgreSQL database and then trying to access the content index at /pulp/content
. Under normal circumstances, this will list content, but when the DB connection is broken, it yields a 500 error.
Severity¶
This issue is currently marked as "medium", but I would propose to raise it to at least High (maybe even Urgent).
Connection drops between the content app and the DB can happen for a multitude of reasons:
- Network issues when the DB is externally hosted
- Firewalls disliking long-running connections when the DB is externally hosted
- Any kind of maintenance done to the DB (update, config change, you name it)
All of them can "just happen" and all of them result in a broken content app, where the user might not be directly aware of the correlation (as they usually expect the app to "just reconnect").
Workers? API?¶
The issue seems only to affect the content app.
Workers "just die" when the connection to the DB drops, and then systemd restarts them (we deploy them with Restart=always
, and so do you).
API seems to recover from the disconnect just fine without restarts (or any log messages, that I've seen).
Workarounds¶
reset DB connection¶
For me, it was sufficient to add a call to Handler._reset_db_connection
in the get_status_blocking
method, that gets called as part of the heartbeat. But given the origin of the issue, I am sure this is just papering over the real issue.
die like workers¶
There is probably a way to make the process die, like it happens to the workers, instead of hanging there, broken. This would allow systemd to restart it.
Updated by ttereshc over 3 years ago
- Priority changed from Normal to High
- Sprint set to Sprint 107
- Tags Katello added
Updated by dkliban@redhat.com over 3 years ago
In pulpcore 3.15 and 3.16 it is possible to check for this error in 2 places: the authenticate middleware which is run for every request and in the heartbeat code.
In pulpcore 3.14 we did not have this middleware. However, I propose adding a middleware just for checking the db connection. We would also want to check the db connection status in the heartbeat.
Updated by dkliban@redhat.com about 3 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to dkliban@redhat.com
Updated by pulpbot about 3 years ago
- Status changed from ASSIGNED to POST
Updated by ttereshc about 3 years ago
- Has duplicate deleted (Issue #9515: content app doesn't survive PostgreSQL disconnect in pulpcore 3.14.7)
Updated by ttereshc about 3 years ago
- Related to Issue #9515: content app doesn't survive PostgreSQL disconnect in pulpcore 3.14.7 added
Updated by ttereshc about 3 years ago
- Subject changed from Content app can have unusable/closed db connections to Content app can have unusable/closed db connections in pulpcore 3.15/3.16
Updated by rchan about 3 years ago
- Sprint changed from Sprint 108 to Sprint 109
Added by dkliban@redhat.com about 3 years ago
Updated by dkliban@redhat.com about 3 years ago
- Status changed from POST to MODIFIED
Applied in changeset pulpcore|3faa649ddb0737c23d1e309a8c38ecb41804cebe.
Updated by dkliban@redhat.com about 3 years ago
- Copied to Backport #9598: Backport #9276 to 3.16: Content app can have unusable/closed db connections in pulpcore 3.15/3.16 added
Updated by pulpbot about 3 years ago
- Status changed from MODIFIED to CLOSED - CURRENTRELEASE
Handles closed db connections
When the authentication middleware was added in pulpcore 3.15, it became the first place in the content app that made an attempt to use the database. As a result, it is a convinient place to handle InterfaceError and OperationalError which are raised when the database connection has been closed. When this occurs, Handler._reset_db_connection() is called to clean up the database connection and the middleware tries to use the database again.
If the database connection is closed later in the handling of the request by the content app, the user will still get a 500 error. However, the next request will be handled properly.
This patch also adds a call to Handler._reset_db_connection() inside the heartbeat method.
fixes: #9276 https://pulp.plan.io/issues/9276