Project

Profile

Help

Issue #9276

Content app can have unusable/closed db connections

Added by ttereshc 2 months ago. Updated 1 day ago.

Status:
NEW
Priority:
High
Assignee:
-
Category:
-
Sprint/Milestone:
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Katello
Sprint:
Sprint 108
Quarter:

Description

We've seen multiple times the django.db.utils.InterfaceError: connection already closed while using content app. See related issues.

A current workaround is to reset the db connection in multiple places in the code. This "solution" is likely not reliable if we are not resetting db connection before every db request.

This problem needs investigation to understand why Django doesn't take care of the db connections itself, and how to solve it properly.


Related issues

Related to Pulp - Issue #9275: Content app db connection can be closed while matching a distributionCLOSED - CURRENTRELEASE<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>
Related to Container Support - Issue #8672: Registry handler loses database connectionCLOSED - CURRENTRELEASE<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>
Related to Pulp - Issue #6045: Pulp content app looses database connectionCLOSED - CURRENTRELEASE<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>
Has duplicate Pulp - Issue #9515: content app doesn't survive PostgreSQL disconnectCLOSED - DUPLICATE<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>

History

#1 Updated by ttereshc 2 months ago

So far the problem is seen with the content app only. I wonder if the fact that it's a standalone django script (with django.setup() call) can be related.

#2 Updated by ttereshc 2 months ago

  • Related to Issue #9275: Content app db connection can be closed while matching a distribution added

#3 Updated by ttereshc 2 months ago

  • Related to Issue #8672: Registry handler loses database connection added

#4 Updated by ttereshc 2 months ago

  • Related to Issue #6045: Pulp content app looses database connection added

#5 Updated by fao89 about 2 months ago

  • Triaged changed from No to Yes

#6 Updated by daviddavis about 2 months ago

Off the top of my head I see two possible solutions.

First, _reset_db_connection currently only resets the connection if it's unusable or obsolete. I imagine that this connection is becoming unusable or obsolete after this function is called but before it's used, so one solution might be to always reset the connection in _reset_db_connection. I have no idea what impacts this will have (e.g. will it impact performance?).

Second possible solution is to wrap each db query in a function or something that will attempt to execute the query and then retry it if it fails.

#7 Updated by daviddavis about 2 months ago

We discussed this at the pulpcore team meeting and we agreed that we need a better understanding of the issue (including a reproducer).

One thing that would help to reproduce the problem is to remove calls to _reset_db_connection.

#8 Updated by ipanova@redhat.com 29 days ago

  • Sprint/Milestone changed from 3.16.0 to 3.17.0

#9 Updated by adam.winberg@smhi.se 9 days ago

We encounter this every time we reboot the postgres server holding the pulp database. We have to restart the content service to get it working again.

Using python3-pulpcore-3.14.3-1.el8.noarch python3-pulp-rpm-3.14.0-1.el8.noarch

#10 Updated by evgeni 3 days ago

Still an issue in 3.16.0.

I had opened #9515 with some more details, before finding this one and can't close as duplicate :(

#11 Updated by mdellweg 3 days ago

  • Has duplicate Issue #9515: content app doesn't survive PostgreSQL disconnect added

#12 Updated by evgeni 2 days ago

Let me briefly recap a few thoughts I had in #9515 (and later in a discussion with the Katello Platform team):

Reproducer

The most trivial reproducer I could come up with is a restart of the PostgreSQL database and then trying to access the content index at /pulp/content. Under normal circumstances, this will list content, but when the DB connection is broken, it yields a 500 error.

Severity

This issue is currently marked as "medium", but I would propose to raise it to at least High (maybe even Urgent).

Connection drops between the content app and the DB can happen for a multitude of reasons:

  • Network issues when the DB is externally hosted
  • Firewalls disliking long-running connections when the DB is externally hosted
  • Any kind of maintenance done to the DB (update, config change, you name it)

All of them can "just happen" and all of them result in a broken content app, where the user might not be directly aware of the correlation (as they usually expect the app to "just reconnect").

Workers? API?

The issue seems only to affect the content app.

Workers "just die" when the connection to the DB drops, and then systemd restarts them (we deploy them with Restart=always, and so do you).

API seems to recover from the disconnect just fine without restarts (or any log messages, that I've seen).

Workarounds

reset DB connection

For me, it was sufficient to add a call to Handler._reset_db_connection in the get_status_blocking method, that gets called as part of the heartbeat. But given the origin of the issue, I am sure this is just papering over the real issue.

die like workers

There is probably a way to make the process die, like it happens to the workers, instead of hanging there, broken. This would allow systemd to restart it.

#13 Updated by ttereshc 2 days ago

  • Priority changed from Normal to High
  • Sprint set to Sprint 107
  • Tags Katello added

#14 Updated by dkliban@redhat.com 2 days ago

In pulpcore 3.15 and 3.16 it is possible to check for this error in 2 places: the authenticate middleware which is run for every request and in the heartbeat code.

In pulpcore 3.14 we did not have this middleware. However, I propose adding a middleware just for checking the db connection. We would also want to check the db connection status in the heartbeat.

#15 Updated by rchan 1 day ago

  • Sprint changed from Sprint 107 to Sprint 108

Please register to edit this issue

Also available in: Atom PDF