Project

Profile

Help

Issue #7073

import test fails with psycopg2.extensions.TransactionRollbackError: deadlock detected

Added by dkliban@redhat.com about 1 month ago. Updated about 1 month ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
OS:
Triaged:
No
Groomed:
No
Sprint Candidate:
No
Tags:
Sprint:
Sprint 76

Description

The test_import test is failing intermittently on Travis:

=================================== FAILURES ===================================

________________________ PulpImportTestCase.test_import ________________________

self = 

    def test_import(self):

        """Test an import."""

        importer = self._create_importer()

>       task_group = self._perform_import(importer)

pulpcore/tests/functional/api/using_plugin/test_pulpimport.py:194: 

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

pulpcore/tests/functional/api/using_plugin/test_pulpimport.py:167: in _perform_import

    task_group = monitor_task_group(task_group_href)

pulpcore/tests/functional/utils.py:54: in monitor_task_group

    exit()

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = Use exit() or Ctrl-D (i.e. EOF) to exit, code = None

    def __call__(self, code=None):

        # Shells like IDLE catch the SystemExit, but listen when their

        # stdin wrapper is closed.

        try:

            sys.stdin.close()

        except:

            pass

>       raise SystemExit(code)

E       SystemExit: None

../../../virtualenv/python3.7.6/lib/python3.7/site.py:403: SystemExit

----------------------------- Captured stdout call -----------------------------

The task group failed: {'all_tasks_dispatched': True,

 'canceled': 0,

 'completed': 1,

 'description': 'Import of '

                '/tmp/0cf747ab-d5b3-454a-88f0-7aedf5fad6f3/export-5a84c6c2-4905-4400-b2ea-f48e8c603e8b-20200701_1100.tar.gz',

 'failed': 1,

 'group_progress_reports': [],

 'pulp_href': '/pulp/api/v3/task-groups/a8fc525b-eb7d-442d-a732-694648485198/',

 'running': 0,

 'skipped': 0,

 'waiting': 0}

The following logs appear.

pulp: rq.worker:ERROR: django.db.utils.OperationalError: deadlock detected

DETAIL:  Process 4137 waits for ShareLock on transaction 6954; blocked by process 4128.

Process 4128 waits for ShareLock on transaction 6951; blocked by process 4137.

HINT:  See server log for query details.

CONTEXT:  while updating tuple (0,29) in relation "core_content"

Traceback (most recent call last):

  File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute

    return self.cursor.execute(sql, params)

psycopg2.extensions.TransactionRollbackError: deadlock detected

DETAIL:  Process 4137 waits for ShareLock on transaction 6954; blocked by process 4128.

Process 4128 waits for ShareLock on transaction 6951; blocked by process 4137.

HINT:  See server log for query details.

CONTEXT:  while updating tuple (0,29) in relation "core_content"

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/usr/local/lib/python3.7/site-packages/rq/worker.py", line 883, in perform_job

    rv = job.perform()

  File "/usr/local/lib/python3.7/site-packages/rq/job.py", line 657, in perform

    self._result = self._execute()

  File "/usr/local/lib/python3.7/site-packages/rq/job.py", line 663, in _execute

    return self.func(*self.args, **self.kwargs)

  File "/usr/local/lib/python3.7/site-packages/pulpcore/app/tasks/importer.py", line 126, in import_repository_version

    c_result = _import_file(content_path, ContentResource)

  File "/usr/local/lib/python3.7/site-packages/pulpcore/app/tasks/importer.py", line 47, in _import_file

    return resource.import_data(data, raise_errors=True)

  File "/usr/local/lib/python3.7/site-packages/import_export/resources.py", line 627, in import_data

    return self.import_data_inner(dataset, dry_run, raise_errors, using_transactions, collect_failed_rows, **kwargs)

  File "/usr/local/lib/python3.7/site-packages/import_export/resources.py", line 673, in import_data_inner

    raise row_result.errors[-1].error

  File "/usr/local/lib/python3.7/site-packages/import_export/resources.py", line 569, in import_row

    self.save_instance(instance, using_transactions, dry_run)

  File "/usr/local/lib/python3.7/site-packages/import_export/resources.py", line 352, in save_instance

    instance.save()

  File "/usr/local/lib/python3.7/site-packages/pulpcore/app/models/base.py", line 110, in save

    return super().save(*args, **kwargs)

  File "/usr/local/lib/python3.7/site-packages/django/db/models/base.py", line 741, in save

    force_update=force_update, update_fields=update_fields)

  File "/usr/local/lib/python3.7/site-packages/django/db/models/base.py", line 779, in save_base

    force_update, using, update_fields,

  File "/usr/local/lib/python3.7/site-packages/django/db/models/base.py", line 851, in _save_table

    forced_update)

  File "/usr/local/lib/python3.7/site-packages/django/db/models/base.py", line 900, in _do_update

    return filtered._update(values) > 0

  File "/usr/local/lib/python3.7/site-packages/django/db/models/query.py", line 760, in _update

    return query.get_compiler(self.db).execute_sql(CURSOR)

  File "/usr/local/lib/python3.7/site-packages/django/db/models/sql/compiler.py", line 1469, in execute_sql

    cursor = super().execute_sql(result_type)

  File "/usr/local/lib/python3.7/site-packages/django/db/models/sql/compiler.py", line 1140, in execute_sql

    cursor.execute(sql, params)

  File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 67, in execute

    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)

  File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 76, in _execute_with_wrappers

    return executor(sql, params, many, context)

  File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute

    return self.cursor.execute(sql, params)

  File "/usr/local/lib/python3.7/site-packages/django/db/utils.py", line 89, in __exit__

    raise dj_exc_value.with_traceback(traceback) from exc_value

  File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute

    return self.cursor.execute(sql, params)

django.db.utils.OperationalError: deadlock detected

DETAIL:  Process 4137 waits for ShareLock on transaction 6954; blocked by process 4128.

Process 4128 waits for ShareLock on transaction 6951; blocked by process 4137.

HINT:  See server log for query details.

CONTEXT:  while updating tuple (0,29) in relation "core_content"

Traceback (most recent call last):

  File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute

    return self.cursor.execute(sql, params)

psycopg2.extensions.TransactionRollbackError: deadlock detected

DETAIL:  Process 4137 waits for ShareLock on transaction 6954; blocked by process 4128.

Process 4128 waits for ShareLock on transaction 6951; blocked by process 4137.

HINT:  See server log for query details.

CONTEXT:  while updating tuple (0,29) in relation "core_content"

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/usr/local/lib/python3.7/site-packages/rq/worker.py", line 883, in perform_job

    rv = job.perform()

  File "/usr/local/lib/python3.7/site-packages/rq/job.py", line 657, in perform

    self._result = self._execute()

  File "/usr/local/lib/python3.7/site-packages/rq/job.py", line 663, in _execute

    return self.func(*self.args, **self.kwargs)

  File "/usr/local/lib/python3.7/site-packages/pulpcore/app/tasks/importer.py", line 126, in import_repository_version

    c_result = _import_file(content_path, ContentResource)

  File "/usr/local/lib/python3.7/site-packages/pulpcore/app/tasks/importer.py", line 47, in _import_file

    return resource.import_data(data, raise_errors=True)

  File "/usr/local/lib/python3.7/site-packages/import_export/resources.py", line 627, in import_data

    return self.import_data_inner(dataset, dry_run, raise_errors, using_transactions, collect_failed_rows, **kwargs)

  File "/usr/local/lib/python3.7/site-packages/import_export/resources.py", line 673, in import_data_inner

    raise row_result.errors[-1].error

  File "/usr/local/lib/python3.7/site-packages/import_export/resources.py", line 569, in import_row

    self.save_instance(instance, using_transactions, dry_run)

  File "/usr/local/lib/python3.7/site-packages/import_export/resources.py", line 352, in save_instance

    instance.save()

  File "/usr/local/lib/python3.7/site-packages/pulpcore/app/models/base.py", line 110, in save

    return super().save(*args, **kwargs)

  File "/usr/local/lib/python3.7/site-packages/django/db/models/base.py", line 741, in save

    force_update=force_update, update_fields=update_fields)

  File "/usr/local/lib/python3.7/site-packages/django/db/models/base.py", line 779, in save_base

    force_update, using, update_fields,

  File "/usr/local/lib/python3.7/site-packages/django/db/models/base.py", line 851, in _save_table

    forced_update)

  File "/usr/local/lib/python3.7/site-packages/django/db/models/base.py", line 900, in _do_update

    return filtered._update(values) > 0

  File "/usr/local/lib/python3.7/site-packages/django/db/models/query.py", line 760, in _update

    return query.get_compiler(self.db).execute_sql(CURSOR)

  File "/usr/local/lib/python3.7/site-packages/django/db/models/sql/compiler.py", line 1469, in execute_sql

    cursor = super().execute_sql(result_type)

  File "/usr/local/lib/python3.7/site-packages/django/db/models/sql/compiler.py", line 1140, in execute_sql

    cursor.execute(sql, params)

  File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 67, in execute

    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)

  File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 76, in _execute_with_wrappers

    return executor(sql, params, many, context)

  File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute

    return self.cursor.execute(sql, params)

  File "/usr/local/lib/python3.7/site-packages/django/db/utils.py", line 89, in __exit__

    raise dj_exc_value.with_traceback(traceback) from exc_value

  File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute

    return self.cursor.execute(sql, params)

django.db.utils.OperationalError: deadlock detected

DETAIL:  Process 4137 waits for ShareLock on transaction 6954; blocked by process 4128.

Process 4128 waits for ShareLock on transaction 6951; blocked by process 4137.

HINT:  See server log for query details.

CONTEXT:  while updating tuple (0,29) in relation "core_content"

There is a theory behind such a deadlock in this stack overflow post: https://stackoverflow.com/a/42731706

Associated revisions

Revision 94cf0081 View on GitHub
Added by ggainey about 1 month ago

Prevents deadlock by locking core_content across update.

THIS IS A TEMPORARY FIX to prevent test_pulpimport failures from blocking unrelated commits. See the related issue for other approaches.

Success can be tested in a dev-env with the following bash cmd:

for i in {1..10}; do (pytest -v -r sx --color=yes --pyargs
pulpcore.tests.functional.api.using_plugin.test_pulpimport > $i.out &) done

Without this fix, each run sees 1-4 tests fail due to deadlock. With the fix, repeated executions have shown no failures.

fixes #7073

History

#1 Updated by dkliban@redhat.com about 1 month ago

  • Description updated (diff)

#2 Updated by daviddavis about 1 month ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to daviddavis

#3 Updated by daviddavis about 1 month ago

  • Sprint set to Sprint 76

#4 Updated by pulpbot about 1 month ago

  • Status changed from ASSIGNED to POST

#5 Updated by ggainey about 1 month ago

I forced a deadlock by running test_pulpimport in parallel :

for i in {1..10}; do (pytest -v -r sx --color=yes --pyargs pulpcore.tests.functional.api.using_plugin.test_pulpimport > $i.out &); done

One test failed of the ten. If I can make it happen reliably, I'll be able to test daviddavis' proposed PR and see how it affects the problem.

#6 Updated by ggainey about 1 month ago

Transaction-isolation had no impact, lots of deadlocks.

The code in https://github.com/pulp/pulpcore/pull/745 does NOT deadlock - most likely because that change no longer explicitly import/exports core_content, which is the problem here. Investigation continues.

#7 Updated by ggainey about 1 month ago

  • Status changed from POST to ASSIGNED
  • Assignee changed from daviddavis to ggainey

Moved back to assigned since the linked PR didn't fix the prob

#8 Updated by ggainey about 1 month ago

The following diff fixes the deadlock by locking the entire table before the offending _import_file - but it is definitely a (horribly-nonperformant) "big hammer" approach. This could be generalized and moved inside _import_file() - but locking every table we touch at import would lose us most of the bandwidth we're getting by parallelizing repo-import in the first place. This ijust a step in the right direction:

diff --git a/pulpcore/app/tasks/importer.py b/pulpcore/app/tasks/importer.py
index a8c822b05..aa711af53 100644
--- a/pulpcore/app/tasks/importer.py
+++ b/pulpcore/app/tasks/importer.py
@@ -123,7 +123,14 @@ def import_repository_version(destination_repo_pk, source_repo_pk, tar_path):
 
         # Untyped Content
         content_path = os.path.join(rv_path, CONTENT_FILE)
-        c_result = _import_file(content_path, ContentResource)
+        from django.db import (
+            connection,
+            transaction,
+        )
+        with transaction.atomic():
+            cursor = connection.cursor()
+            cursor.execute('LOCK TABLE %s IN %s MODE' % (Content._meta.db_table, "EXCLUSIVE"))
+            c_result = _import_file(content_path, ContentResource)
         content = Content.objects.filter(pk__in=[r.object_id for r in c_result.rows])
 
         # Content Artifacts

#9 Updated by ggainey about 1 month ago

Another option is to insure ordering on Content at export-time, by guaranteeing order at the Model level:

(master) ~/github/Pulp3/pulpcore $ git diff
diff --git a/pulpcore/app/models/content.py b/pulpcore/app/models/content.py
index 3f8d59890..e7eecfe24 100644
--- a/pulpcore/app/models/content.py
+++ b/pulpcore/app/models/content.py
@@ -260,6 +260,7 @@ class Content(MasterModel, QueryMixin):
     class Meta:
         verbose_name_plural = "content"
         unique_together = ()
+        ordering = ('pulp_id',)
 
     @classmethod
     def natural_key_fields(cls):

diff --git a/pulpcore/plugin/repo_version_utils.py b/pulpcore/plugin/repo_version_utils.py
index 70d7eb39c..e9052eaaa 100644
--- a/pulpcore/plugin/repo_version_utils.py
+++ b/pulpcore/plugin/repo_version_utils.py
@@ -93,6 +93,7 @@ def validate_duplicate_content(version):
         ).count()
         unique_new_content_total = (
             type_obj.objects.filter(pk__in=version.content.filter(pulp_type=pulp_type))
+            .order_by(*repo_key_fields)
             .distinct(*repo_key_fields)
             .count()
         )

That works to prevent the deadlock as well.

(Note the change to repo_version_utils is required, because distinct(keys) must match the ordering on the field. Since we've added a default ordering, which we're overriding, we must explicitly state the order_by() to avoid the dreaded error

ProgrammingError: SELECT DISTINCT ON expressions must match initial ORDER BY expressions

So - we now have three options:

  1. Explicitly lock the table before operating on it.
  2. Have an explicit default-order, to avoid the arbitrary-order-on-select postgres behavior that is the root of this issue.
  3. Stop working with that table

That last just prevents this specific deadlock. Of the three, having an explicit default-ordering for every Model is the safest thing to do for the health of Pulp3 over time. It comes, of course, with a performance price. The deadlocks it will prevent are "multiple threads accessing multiple intersecting Model-instances in the same Model-class, while attempting to update the Models". Without explicitly ordering such requests, this is just a deadly embrace waiting to happen. And the more multi-user, and busier, a Pulp instance is, the more likely we are to eventually start seeing this become a problem.

#10 Updated by pulpbot about 1 month ago

  • Status changed from ASSIGNED to POST

#11 Updated by ggainey about 1 month ago

  • Status changed from POST to MODIFIED

#12 Updated by fao89 about 1 month ago

  • Sprint/Milestone set to 3.5.0

#13 Updated by pulpbot about 1 month ago

  • Status changed from MODIFIED to CLOSED - CURRENTRELEASE

Please register to edit this issue

Also available in: Atom PDF