Project

Profile

Help

Issue #4060

closed

QueryExistingArtifacts stage does not prevent duplicates within a stream

Added by amacdona@redhat.com about 6 years ago. Updated about 5 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Category:
-
Sprint/Milestone:
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Sprint:
Sprint 45
Quarter:

Description

Note: This problem occurs because content_2 (below) either is passed by the QueryExisting stage twice. This problem will occur in the save stage. Either way, 2 or more dupes are in the Queue for the save stage at the same time. Whether the dupes are saved in separate batches or together in 1 batch, we will fail with an integrity error.

I've encountered this while implementing Docker sync, so there isn't a simple reproducer yet. This issue occurs when a sync has multiple metadata files that refer to the same content, which has the same artifact.

metadata_1.artifacts = [content_1, content_2]
metadata_2.artifacts = [content_2, content_3]

content_2 and its artifact are able to pass through the pipeline twice resulting in a duplicate key.

django.db.utils.IntegrityError: duplicate key value violates unique constraint "pulp_app_artifact_sha256_key"

The actual error occurs in the ArtifactSaver stage during a bulk save, but AFAICT this stage needs to rely on unique units to allow bulk_save.

Full traceback:

Traceback (most recent call last):
  File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/rq/worker.py", line 793, in perform_job
    rv = job.perform()
  File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/rq/job.py", line 599, in perform
    self._result = self._execute()
  File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/rq/job.py", line 605, in _execute
    return self.func(*self.args, **self.kwargs)
  File "/home/vagrant/devel/pulp_docker/pulp_docker/app/tasks/synchronizing.py", line 44, in synchronize
    DeclarativeVersion(first_stage, repository).create()
  File "/home/vagrant/devel/pulp/plugin/pulpcore/plugin/stages/declarative_version.py", line 125, in create
    loop.run_until_complete(pipeline)
  File "/usr/lib64/python3.6/asyncio/base_events.py", line 468, in run_until_complete
    return future.result()
  File "/home/vagrant/devel/pulp/plugin/pulpcore/plugin/stages/api.py", line 128, in create_pipeline
    await asyncio.gather(*futures)
  File "/home/vagrant/devel/pulp/plugin/pulpcore/plugin/stages/artifact_stages.py", line 323, in __call__
    Artifact.objects.bulk_create(artifacts_to_save)
  File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/django/db/models/manager.py", line 82, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/django/db/models/query.py", line 465, in bulk_create
    ids = self._batched_insert(objs_without_pk, fields, batch_size)
  File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/django/db/models/query.py", line 1149, in _batched_insert
    inserted_id = self._insert(item, fields=fields, using=self.db, return_id=True)
  File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/django/db/models/query.py", line 1136, in _insert
    return query.get_compiler(using=using).execute_sql(return_id)
  File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/django/db/models/sql/compiler.py", line 1289, in execute_sql
    cursor.execute(sql, params)
  File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/django/db/backends/utils.py", line 100, in execute
    return super().execute(sql, params)
  File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/django/db/backends/utils.py", line 68, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
  File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/django/db/backends/utils.py", line 77, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/django/db/backends/utils.py", line 85, in _execute
    return self.cursor.execute(sql, params)
  File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/django/db/utils.py", line 89, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/django/db/backends/utils.py", line 85, in _execute
    return self.cursor.execute(sql, params)
django.db.utils.IntegrityError: duplicate key value violates unique constraint "pulp_app_artifact_sha256_key"
DETAIL:  Key (sha256)=(8ddc19f16526912237dd8af81971d5e4dd0587907234be2b83e249518d5b673f) already exists.

Related issues

Related to Pulp - Issue #4085: ContentUnitSaver stage is vulnerable to race conditions.CLOSED - CURRENTRELEASEdkliban@redhat.comActions
Related to Container Support - Refactor #4177: Update sync to use ArtifactSaver StageCLOSED - CURRENTRELEASEipanova@redhat.com

Actions
Has duplicate Pulp - Issue #4086: ArtifactSaver stage is vulnerable to race conditions.CLOSED - DUPLICATEActions
Actions #1

Updated by CodeHeeler about 6 years ago

  • Triaged changed from No to Yes
  • Sprint set to Sprint 43
Actions #2

Updated by dkliban@redhat.com about 6 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to dkliban@redhat.com
Actions #3

Updated by rchan about 6 years ago

  • Sprint changed from Sprint 43 to Sprint 44

Added by dkliban@redhat.com about 6 years ago

Revision ac191479 | View on GitHub

Problem: ArtifactSaver stage doesn't handle IntegrityErrors

Solution: Use a QueryAndSaveArtifacts stage instead of ArtifactSaver

This patch introduces a new stage that first looks for existing artifact before trying to save any artifacts.

re: #4060 https://pulp.plan.io/issues/4060

Added by dkliban@redhat.com about 6 years ago

Revision ac191479 | View on GitHub

Problem: ArtifactSaver stage doesn't handle IntegrityErrors

Solution: Use a QueryAndSaveArtifacts stage instead of ArtifactSaver

This patch introduces a new stage that first looks for existing artifact before trying to save any artifacts.

re: #4060 https://pulp.plan.io/issues/4060

Added by dkliban@redhat.com about 6 years ago

Revision ac191479 | View on GitHub

Problem: ArtifactSaver stage doesn't handle IntegrityErrors

Solution: Use a QueryAndSaveArtifacts stage instead of ArtifactSaver

This patch introduces a new stage that first looks for existing artifact before trying to save any artifacts.

re: #4060 https://pulp.plan.io/issues/4060

Added by dkliban@redhat.com about 6 years ago

Revision ac191479 | View on GitHub

Problem: ArtifactSaver stage doesn't handle IntegrityErrors

Solution: Use a QueryAndSaveArtifacts stage instead of ArtifactSaver

This patch introduces a new stage that first looks for existing artifact before trying to save any artifacts.

re: #4060 https://pulp.plan.io/issues/4060

Actions #4

Updated by amacdona@redhat.com about 6 years ago

  • Subject changed from QueryExistingArtifacts stage does not prevent duplicates of new Artifacts to QueryExistingArtifacts stage does not prevent duplicates within a stream
  • Description updated (diff)
Actions #5

Updated by amacdona@redhat.com about 6 years ago

  • Related to Issue #4085: ContentUnitSaver stage is vulnerable to race conditions. added
Actions #6

Updated by jortel@redhat.com about 6 years ago

There are several models that are created using bulk_create() where duplicates should not result in failure to create the batch. The use cases for these models is as follows:

1. The batch contains duplicates.
2. Race condition whereby separate workers are attempting to create the same thing.

In both cases, the resulting IntegretyError is logged and/or reported to the user and the entire operation fails. The desired result is for the bulk_create to ignore the failure and create the batch.

Due to the race condition in (2), the proposed solution is to fall back to using Model.save() in a nested transaction committing the entire batch in a single overall transaction. When model.save() raises an IntegrityError, it would be fetched using the natural key and yielded instead. The bulk_create() method can be overridden in a Manager subclass for provide this behavior. This manager may be assigned to affected models.

Basic algorithm (transactions omitted for clarity but they will be critical).

1. return super().bulk_create().
2. catch IntegrityError
3. foreach model in the batch: save() and yield.
4. catch IntegrityError
5. get() the model from the DB and yield instead.

For a general solution, it would help if the pulp models would include a property or method that returned a models.Q object used to query the DB (step:5). Perhaps: getq() or just a property named q(). This would also simplify the query stages.

Identified Models:

  • Content
  • Artifact
  • ContentArtifact
  • RemoteArtifact

There is code in the pipeline that relies on bulk_create() yielding the same model instances as passed in. This needs to be accounted for. Also, I believe there is work related to flattening the Content tables that has the same expectation.

Actions #7

Updated by amacdona@redhat.com about 6 years ago

The pattern above preserves the queue order, so that won't be a problem for the docker use case.

Just to outline the problem, I have a use case (docker sync) that needs the queue order to stay the same. The reason for this is that Docker has many-to-many relationships between content units. Currently, the implementation is to put this relationship in DeclarativeContent.extra_data, and the many-to-many is saved after the stage that saves the units. However, once Docker can use the new implementation of this issue, the many-to-many's can be handled in the QueryAndSave "post_save" hook. Even in this case, it is necessary to preserve the order.

In case that's not clear, here's an example.
2 Content units, A and B, that are related.
In a previous stage, A is created (but not saved), and creates B as a side effect. A.extra_data stores that it will eventually have a relationship with B. The output order for this stage is (B, A).
In the relationship stage (or this stage), B must come through first to guarantee that B has been saved before we try to create a relationship (which will happen immediately after saving A).

Note: The only core stage that is not "Q order safe" is the ArtifactDownload stage. (However in Docker, this is still safe because the "fast lane" preserves order of units in the fast lane, and that's enough in our case.)

Actions #8

Updated by bmbouter about 6 years ago

@jortel this plan sounds pretty good. The only possible issue could be if there is any extra data attached to the in-memory model being saved which would be lost when the model is queried and replaced. Our usage of data attached to models as assigned attributes is not a good practice, but I remember we were doing it in a few places in the stages API to handle bulk_create with foreign keys between the models. Do we do this currently and is this done on the single content branch dalley is working on?

Also +1 to making a property that returns a Q object for that instance. Can we call it q_obj so it would be like: Artifact.q_obj Also can we add it into Artifact, Content, RemoteArtifact, and ContentArtifact using a mixin class that defines q_obj or $otheridea?

Actions #9

Updated by jortel@redhat.com about 6 years ago

bmbouter wrote:

@jortel this plan sounds pretty good.

The only possible issue could be if there is any extra data attached to the in-memory model being saved which would be lost when the model is queried and replaced. Our usage of data attached to models as assigned attributes is not a good practice, but I remember we were doing it in a few places in the stages API to handle bulk_create with foreign keys between the models. Do we do this currently and is this done on the single content branch dalley is working on?

Yes, I noted this in my comment as well.

Also +1 to making a property that returns a Q object for that instance. Can we call it q_obj so it would be like: Artifact.q_obj

A model.Q is an object. What is the purpose for the _obj suffix?

Also can we add it into Artifact, Content, RemoteArtifact, and ContentArtifact using a mixin class that defines q_obj or $otheridea?

Actions #10

Updated by bmbouter about 6 years ago

wrote:

bmbouter wrote:

@jortel this plan sounds pretty good.

The only possible issue could be if there is any extra data attached to the in-memory model being saved which would be lost when the model is queried and replaced. Our usage of data attached to models as assigned attributes is not a good practice, but I remember we were doing it in a few places in the stages API to handle bulk_create with foreign keys between the models. Do we do this currently and is this done on the single content branch dalley is working on?

Yes, I noted this in my comment as well.

Oh right I do see that mentioned there. So what's the plan for those code areas? If we could identify them we could see if another implementation of that is feasible. In fact that could be done first to undo the reliance on that need as it's own thing instead of at the same time as this piece of work.

Also +1 to making a property that returns a Q object for that instance. Can we call it q_obj so it would be like: Artifact.q_obj

A model.Q is an object. What is the purpose for the _obj suffix?

Django's documentation writes the Model.Q as "the Q object" https://docs.djangoproject.com/en/2.1/topics/db/queries/#complex-lookups-with-q-objects so this piggybacks off that name. My concern with just q or Q is that since the stages also use queues in lots of places I was hoping to avoid something that could be interpreted as related to a Queue object from asyncio.

Also can we add it into Artifact, Content, RemoteArtifact, and ContentArtifact using a mixin class that defines q_obj or $otheridea?

Actions #11

Updated by amacdona@redhat.com about 6 years ago

bmbouter wrote:

The only possible issue could be if there is any extra data attached to the in-memory model being saved which would be lost when the model is queried and replaced.

I'm don't think this is a problem.

In rough psuedocode:

dc = DeclarativeContent()
If dc.content is not unique:
    dc.content = SomeContent.objects.get(pk=dc.content.pk)

Therefore, the dc data is preserved. If 2 dupes go through a single pipeline, each is contained by it's own dc.

Actions #12

Updated by bmbouter about 6 years ago

wrote:

bmbouter wrote:

The only possible issue could be if there is any extra data attached to the in-memory model being saved which would be lost when the model is queried and replaced.

I'm don't think this is a problem.

I'm not explaining myself well. I don't mean the extra_data attribute on the DeclarativeContent unit, I mean an extra attribute assigned to the model itself. Something like:

content = dc.content
content.new_attribute = 'i need this later'
for content in MyContent.bulk_create(new_attribute)
    print(content.new_attribute)  # Does this raise an AttributeError?

In rough psuedocode:

[...]

Therefore, the dc data is preserved. If 2 dupes go through a single pipeline, each is contained by it's own dc.

Actions #13

Updated by bmbouter about 6 years ago

During IRC convo we confirmed that currently in core, and in @dalley's upcoming single-content branch PR there is no reliance on this 'attribute attachment' implementation so I think core is good w/ the change.

I did have an idea to share, which is that instead of replacing the in-memory object we could take the db attributes we just fetched and assign them to the in-memory one and continue to use that.

I'll point out that instance attachment approach works with bulk_create so if we're replacing bulk_create on some of our models with our implementation, it would be good if we could not deviate from django's behavior.

Actions #14

Updated by daviddavis about 6 years ago

What's the benefit to overriding bulk_create instead of just defining a new method?

Actions #15

Updated by daviddavis about 6 years ago

I reproduced this with pulp_file. Here were the steps:

mkdir test && cd test
echo "test" > test1
echo "test" > test2
for file in `ls test*`; do echo $file,`sha256sum $file | awk '{ print $1 }'`,`stat -L -c '%s' $file`; done > PULP_MANIFEST
http :8000/pulp/api/v3/repositories/ name=test
http :8000/pulp/api/v3/remotes/file/ name=test url="file://`pwd`/PULP_MANIFEST"
http :8000/pulp/api/v3/remotes/file/1/sync/ repository=/pulp/api/v3/repositories/1/
Actions #16

Updated by rchan about 6 years ago

  • Sprint changed from Sprint 44 to Sprint 45
Actions #17

Updated by dkliban@redhat.com about 6 years ago

daviddavis wrote:

What's the benefit to overriding bulk_create instead of just defining a new method?

Since there is always a race condition when multiple workers are syncing content, it seems like the plugin writer would always want to prevent an IntegrityError from failing the sync. Providing only one method for bulk creating would prevent the plugin writer from using the wrong method that doesn't account for the IntegrityError.

Added by dkliban@redhat.com about 6 years ago

Revision 676a3740 | View on GitHub

Problem: bulk_create can fail with IntegrityError

Solution: add a manager that provides a bulk_get_or_create() method

This patch introduces a BulkCreateManager that provides the bulk_get_or_create method. This method handles IntegrityErrors encountered during bulk_create() by inserting each object into the database serially. When an IntegrityError or ValueError is encountered during serial saving of objects, the object being saved is replaced with an instance from the database.

This patch introduces a mixin used by Artifact, Content, ContentArtifact, and RemoteArtifact models. The mixin provides the q() method which returns a Q object that can be used to retreive the database instance of the model.

closes #4060 https://pulp.plan.io/issues/4060

Added by dkliban@redhat.com about 6 years ago

Revision 676a3740 | View on GitHub

Problem: bulk_create can fail with IntegrityError

Solution: add a manager that provides a bulk_get_or_create() method

This patch introduces a BulkCreateManager that provides the bulk_get_or_create method. This method handles IntegrityErrors encountered during bulk_create() by inserting each object into the database serially. When an IntegrityError or ValueError is encountered during serial saving of objects, the object being saved is replaced with an instance from the database.

This patch introduces a mixin used by Artifact, Content, ContentArtifact, and RemoteArtifact models. The mixin provides the q() method which returns a Q object that can be used to retreive the database instance of the model.

closes #4060 https://pulp.plan.io/issues/4060

Actions #18

Updated by dkliban@redhat.com about 6 years ago

  • Status changed from ASSIGNED to MODIFIED
Actions #19

Updated by ttereshc about 6 years ago

  • Tags Pulp 3 added
Actions #21

Updated by amacdona@redhat.com about 6 years ago

  • Related to Refactor #4177: Update sync to use ArtifactSaver Stage added
Actions #22

Updated by amacdona@redhat.com about 6 years ago

  • Has duplicate Issue #4086: ArtifactSaver stage is vulnerable to race conditions. added
Actions #23

Updated by daviddavis over 5 years ago

  • Sprint/Milestone set to 3.0.0
Actions #24

Updated by bmbouter over 5 years ago

  • Tags deleted (Pulp 3)
Actions #25

Updated by bmbouter about 5 years ago

  • Status changed from MODIFIED to CLOSED - CURRENTRELEASE

Also available in: Atom PDF