Actions

Send by e-mail Copy link

Issue #9276

closed

Content app can have unusable/closed db connections in pulpcore 3.15/3.16

Added by ttereshc about 3 years ago. Updated almost 3 years ago.

Status:

CLOSED - CURRENTRELEASE

Priority:

High

Assignee:

dkliban@redhat.com

Category:

Sprint/Milestone:

3.17.0

Start date:

Due date:

Estimated time:

Severity:

2. Medium

Version:

Platform Release:

OS:

Triaged:

Yes

Groomed:

Sprint Candidate:

Tags:

Katello

Sprint:

Sprint 109

Quarter:

Description

We've seen multiple times the django.db.utils.InterfaceError: connection already closed while using content app. See related issues.

A current workaround is to reset the db connection in multiple places in the code. This "solution" is likely not reliable if we are not resetting db connection before every db request.

This problem needs investigation to understand why Django doesn't take care of the db connections itself, and how to solve it properly.

Related issues

Actions

Copy link

Updated by ttereshc about 3 years ago

So far the problem is seen with the content app only. I wonder if the fact that it's a standalone django script (with django.setup() call) can be related.

Actions

Copy link

Updated by ttereshc about 3 years ago

Related to Issue #9275: Content app db connection can be closed while matching a distribution added

Actions

Copy link

Updated by ttereshc about 3 years ago

Related to Issue #8672: Registry handler loses database connection added

Actions

Copy link

Updated by ttereshc about 3 years ago

Related to Issue #6045: Pulp content app looses database connection added

Actions

Copy link

Updated by fao89 about 3 years ago

Triaged changed from No to Yes

Actions

Copy link

Updated by daviddavis about 3 years ago

Off the top of my head I see two possible solutions.

First, _reset_db_connection currently only resets the connection if it's unusable or obsolete. I imagine that this connection is becoming unusable or obsolete after this function is called but before it's used, so one solution might be to always reset the connection in _reset_db_connection. I have no idea what impacts this will have (e.g. will it impact performance?).

Second possible solution is to wrap each db query in a function or something that will attempt to execute the query and then retry it if it fails.

Actions

Copy link

Updated by daviddavis about 3 years ago

We discussed this at the pulpcore team meeting and we agreed that we need a better understanding of the issue (including a reproducer).

One thing that would help to reproduce the problem is to remove calls to _reset_db_connection.

Actions

Copy link

Updated by ipanova@redhat.com about 3 years ago

Sprint/Milestone changed from 3.16.0 to 3.17.0

Actions

Copy link

Updated by adam.winberg@smhi.se about 3 years ago

We encounter this every time we reboot the postgres server holding the pulp database. We have to restart the content service to get it working again.

Using python3-pulpcore-3.14.3-1.el8.noarch python3-pulp-rpm-3.14.0-1.el8.noarch

Actions

Copy link

#10

Updated by evgeni about 3 years ago

Still an issue in 3.16.0.

I had opened #9515 with some more details, before finding this one and can't close as duplicate :(

Actions

Copy link

#11

Updated by mdellweg about 3 years ago

Has duplicate Issue #9515: content app doesn't survive PostgreSQL disconnect in pulpcore 3.14.7 added

Actions

Copy link

#12

Updated by evgeni about 3 years ago

Let me briefly recap a few thoughts I had in #9515 (and later in a discussion with the Katello Platform team):

Reproducer¶

The most trivial reproducer I could come up with is a restart of the PostgreSQL database and then trying to access the content index at /pulp/content. Under normal circumstances, this will list content, but when the DB connection is broken, it yields a 500 error.

Severity¶

This issue is currently marked as "medium", but I would propose to raise it to at least High (maybe even Urgent).

Connection drops between the content app and the DB can happen for a multitude of reasons:

Network issues when the DB is externally hosted
Firewalls disliking long-running connections when the DB is externally hosted
Any kind of maintenance done to the DB (update, config change, you name it)

All of them can "just happen" and all of them result in a broken content app, where the user might not be directly aware of the correlation (as they usually expect the app to "just reconnect").

Workers? API?¶

The issue seems only to affect the content app.

Workers "just die" when the connection to the DB drops, and then systemd restarts them (we deploy them with Restart=always, and so do you).

API seems to recover from the disconnect just fine without restarts (or any log messages, that I've seen).

Workarounds¶

reset DB connection¶

For me, it was sufficient to add a call to Handler._reset_db_connection in the get_status_blocking method, that gets called as part of the heartbeat. But given the origin of the issue, I am sure this is just papering over the real issue.

die like workers¶

There is probably a way to make the process die, like it happens to the workers, instead of hanging there, broken. This would allow systemd to restart it.

Actions

Copy link

#13

Updated by ttereshc about 3 years ago

Priority changed from Normal to High
Sprint set to Sprint 107
Tags Katello added

Actions

Copy link

#14

Updated by dkliban@redhat.com about 3 years ago

In pulpcore 3.15 and 3.16 it is possible to check for this error in 2 places: the authenticate middleware which is run for every request and in the heartbeat code.

In pulpcore 3.14 we did not have this middleware. However, I propose adding a middleware just for checking the db connection. We would also want to check the db connection status in the heartbeat.

Actions

Copy link

#15

Updated by rchan about 3 years ago

Sprint changed from Sprint 107 to Sprint 108

Actions

Copy link

#17

Updated by dkliban@redhat.com about 3 years ago

Status changed from NEW to ASSIGNED
Assignee set to dkliban@redhat.com

Actions

Copy link

#18

Updated by pulpbot about 3 years ago

Status changed from ASSIGNED to POST

PR: https://github.com/pulp/pulpcore/pull/1698

Actions

Copy link

#19

Updated by ttereshc about 3 years ago

Has duplicate deleted (Issue #9515: content app doesn't survive PostgreSQL disconnect in pulpcore 3.14.7)

Actions

Copy link

#20

Updated by ttereshc about 3 years ago

Related to Issue #9515: content app doesn't survive PostgreSQL disconnect in pulpcore 3.14.7 added

Actions

Copy link

#22

Updated by ttereshc about 3 years ago

Subject changed from Content app can have unusable/closed db connections to Content app can have unusable/closed db connections in pulpcore 3.15/3.16

Actions

Copy link

#23

Updated by rchan about 3 years ago

Sprint changed from Sprint 108 to Sprint 109

Added by dkliban@redhat.com about 3 years ago

Revision 3faa649d | View on GitHub

Handles closed db connections

When the authentication middleware was added in pulpcore 3.15, it became the first place in the content app that made an attempt to use the database. As a result, it is a convinient place to handle InterfaceError and OperationalError which are raised when the database connection has been closed. When this occurs, Handler._reset_db_connection() is called to clean up the database connection and the middleware tries to use the database again.

If the database connection is closed later in the handling of the request by the content app, the user will still get a 500 error. However, the next request will be handled properly.

This patch also adds a call to Handler._reset_db_connection() inside the heartbeat method.

fixes: #9276 https://pulp.plan.io/issues/9276

Actions

Copy link

#24

Updated by dkliban@redhat.com about 3 years ago

Status changed from POST to MODIFIED

Applied in changeset pulpcore|3faa649ddb0737c23d1e309a8c38ecb41804cebe.

Actions

Copy link

#25

Updated by dkliban@redhat.com almost 3 years ago

Copied to Backport #9598: Backport #9276 to 3.16: Content app can have unusable/closed db connections in pulpcore 3.15/3.16 added

Actions

Copy link

#26

Updated by pulpbot almost 3 years ago

Status changed from MODIFIED to CLOSED - CURRENTRELEASE

Actions

Send by e-mail Copy link

Also available in: Atom PDF

Project

Profile

Help

Pulp

Agile boards

Custom queries

Issue #9276

Content app can have unusable/closed db connections in pulpcore 3.15/3.16

Updated by ttereshc about 3 years ago

Updated by ttereshc about 3 years ago

Updated by ttereshc about 3 years ago

Updated by ttereshc about 3 years ago

Updated by fao89 about 3 years ago

Updated by daviddavis about 3 years ago

Updated by daviddavis about 3 years ago

Updated by ipanova@redhat.com about 3 years ago

Updated by adam.winberg@smhi.se about 3 years ago

Updated by evgeni about 3 years ago

Updated by mdellweg about 3 years ago

Updated by evgeni about 3 years ago

Reproducer¶

Severity¶

Workers? API?¶

Workarounds¶

reset DB connection¶

die like workers¶

Updated by ttereshc about 3 years ago

Updated by dkliban@redhat.com about 3 years ago

Updated by rchan about 3 years ago

Updated by dkliban@redhat.com about 3 years ago

Updated by pulpbot about 3 years ago

Updated by ttereshc about 3 years ago

Updated by ttereshc about 3 years ago

Updated by ttereshc about 3 years ago

Updated by rchan about 3 years ago

Added by dkliban@redhat.com about 3 years ago

Updated by dkliban@redhat.com about 3 years ago

Updated by dkliban@redhat.com almost 3 years ago

Updated by pulpbot almost 3 years ago