Project

Profile

Help

Issue #9276

closed

Content app can have unusable/closed db connections in pulpcore 3.15/3.16

Added by ttereshc over 2 years ago. Updated over 2 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
High
Category:
-
Sprint/Milestone:
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Katello
Sprint:
Sprint 109
Quarter:

Description

We've seen multiple times the django.db.utils.InterfaceError: connection already closed while using content app. See related issues.

A current workaround is to reset the db connection in multiple places in the code. This "solution" is likely not reliable if we are not resetting db connection before every db request.

This problem needs investigation to understand why Django doesn't take care of the db connections itself, and how to solve it properly.


Related issues

Related to Pulp - Issue #9275: Content app db connection can be closed while matching a distributionCLOSED - CURRENTRELEASEActions
Related to Container Support - Issue #8672: Registry handler loses database connectionCLOSED - CURRENTRELEASEipanova@redhat.comActions
Related to Pulp - Issue #6045: Pulp content app looses database connectionCLOSED - CURRENTRELEASEdaviddavisActions
Related to Pulp - Issue #9515: content app doesn't survive PostgreSQL disconnect in pulpcore 3.14.7CLOSED - CURRENTRELEASEdkliban@redhat.comActions
Copied to Pulp - Backport #9598: Backport #9276 to 3.16: Content app can have unusable/closed db connections in pulpcore 3.15/3.16CLOSED - CURRENTRELEASEdkliban@redhat.com

Actions
Actions #1

Updated by ttereshc over 2 years ago

So far the problem is seen with the content app only. I wonder if the fact that it's a standalone django script (with django.setup() call) can be related.

Actions #2

Updated by ttereshc over 2 years ago

  • Related to Issue #9275: Content app db connection can be closed while matching a distribution added
Actions #3

Updated by ttereshc over 2 years ago

  • Related to Issue #8672: Registry handler loses database connection added
Actions #4

Updated by ttereshc over 2 years ago

  • Related to Issue #6045: Pulp content app looses database connection added
Actions #5

Updated by fao89 over 2 years ago

  • Triaged changed from No to Yes
Actions #6

Updated by daviddavis over 2 years ago

Off the top of my head I see two possible solutions.

First, _reset_db_connection currently only resets the connection if it's unusable or obsolete. I imagine that this connection is becoming unusable or obsolete after this function is called but before it's used, so one solution might be to always reset the connection in _reset_db_connection. I have no idea what impacts this will have (e.g. will it impact performance?).

Second possible solution is to wrap each db query in a function or something that will attempt to execute the query and then retry it if it fails.

Actions #7

Updated by daviddavis over 2 years ago

We discussed this at the pulpcore team meeting and we agreed that we need a better understanding of the issue (including a reproducer).

One thing that would help to reproduce the problem is to remove calls to _reset_db_connection.

Actions #8

Updated by ipanova@redhat.com over 2 years ago

  • Sprint/Milestone changed from 3.16.0 to 3.17.0
Actions #9

Updated by adam.winberg@smhi.se over 2 years ago

We encounter this every time we reboot the postgres server holding the pulp database. We have to restart the content service to get it working again.

Using python3-pulpcore-3.14.3-1.el8.noarch python3-pulp-rpm-3.14.0-1.el8.noarch

Actions #10

Updated by evgeni over 2 years ago

Still an issue in 3.16.0.

I had opened #9515 with some more details, before finding this one and can't close as duplicate :(

Actions #11

Updated by mdellweg over 2 years ago

  • Has duplicate Issue #9515: content app doesn't survive PostgreSQL disconnect in pulpcore 3.14.7 added
Actions #12

Updated by evgeni over 2 years ago

Let me briefly recap a few thoughts I had in #9515 (and later in a discussion with the Katello Platform team):

Reproducer

The most trivial reproducer I could come up with is a restart of the PostgreSQL database and then trying to access the content index at /pulp/content. Under normal circumstances, this will list content, but when the DB connection is broken, it yields a 500 error.

Severity

This issue is currently marked as "medium", but I would propose to raise it to at least High (maybe even Urgent).

Connection drops between the content app and the DB can happen for a multitude of reasons:

  • Network issues when the DB is externally hosted
  • Firewalls disliking long-running connections when the DB is externally hosted
  • Any kind of maintenance done to the DB (update, config change, you name it)

All of them can "just happen" and all of them result in a broken content app, where the user might not be directly aware of the correlation (as they usually expect the app to "just reconnect").

Workers? API?

The issue seems only to affect the content app.

Workers "just die" when the connection to the DB drops, and then systemd restarts them (we deploy them with Restart=always, and so do you).

API seems to recover from the disconnect just fine without restarts (or any log messages, that I've seen).

Workarounds

reset DB connection

For me, it was sufficient to add a call to Handler._reset_db_connection in the get_status_blocking method, that gets called as part of the heartbeat. But given the origin of the issue, I am sure this is just papering over the real issue.

die like workers

There is probably a way to make the process die, like it happens to the workers, instead of hanging there, broken. This would allow systemd to restart it.

Actions #13

Updated by ttereshc over 2 years ago

  • Priority changed from Normal to High
  • Sprint set to Sprint 107
  • Tags Katello added
Actions #14

Updated by dkliban@redhat.com over 2 years ago

In pulpcore 3.15 and 3.16 it is possible to check for this error in 2 places: the authenticate middleware which is run for every request and in the heartbeat code.

In pulpcore 3.14 we did not have this middleware. However, I propose adding a middleware just for checking the db connection. We would also want to check the db connection status in the heartbeat.

Actions #15

Updated by rchan over 2 years ago

  • Sprint changed from Sprint 107 to Sprint 108
Actions #17

Updated by dkliban@redhat.com over 2 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to dkliban@redhat.com
Actions #18

Updated by pulpbot over 2 years ago

  • Status changed from ASSIGNED to POST
Actions #19

Updated by ttereshc over 2 years ago

  • Has duplicate deleted (Issue #9515: content app doesn't survive PostgreSQL disconnect in pulpcore 3.14.7)
Actions #20

Updated by ttereshc over 2 years ago

  • Related to Issue #9515: content app doesn't survive PostgreSQL disconnect in pulpcore 3.14.7 added
Actions #22

Updated by ttereshc over 2 years ago

  • Subject changed from Content app can have unusable/closed db connections to Content app can have unusable/closed db connections in pulpcore 3.15/3.16
Actions #23

Updated by rchan over 2 years ago

  • Sprint changed from Sprint 108 to Sprint 109

Added by dkliban@redhat.com over 2 years ago

Revision 3faa649d | View on GitHub

Handles closed db connections

When the authentication middleware was added in pulpcore 3.15, it became the first place in the content app that made an attempt to use the database. As a result, it is a convinient place to handle InterfaceError and OperationalError which are raised when the database connection has been closed. When this occurs, Handler._reset_db_connection() is called to clean up the database connection and the middleware tries to use the database again.

If the database connection is closed later in the handling of the request by the content app, the user will still get a 500 error. However, the next request will be handled properly.

This patch also adds a call to Handler._reset_db_connection() inside the heartbeat method.

fixes: #9276 https://pulp.plan.io/issues/9276

Actions #24

Updated by dkliban@redhat.com over 2 years ago

  • Status changed from POST to MODIFIED
Actions #25

Updated by dkliban@redhat.com over 2 years ago

  • Copied to Backport #9598: Backport #9276 to 3.16: Content app can have unusable/closed db connections in pulpcore 3.15/3.16 added
Actions #26

Updated by pulpbot over 2 years ago

  • Status changed from MODIFIED to CLOSED - CURRENTRELEASE

Also available in: Atom PDF