Project

Profile

Help

Issue #4060

QueryExistingArtifacts stage does not prevent duplicates within a stream

Added by amacdona@redhat.com about 1 year ago. Updated 6 months ago.

Status:
MODIFIED
Priority:
Normal
Category:
-
Sprint/Milestone:
Start date:
Due date:
Severity:
2. Medium
Version:
Platform Release:
Blocks Release:
OS:
Backwards Incompatible:
No
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
QA Contact:
Complexity:
Smash Test:
Verified:
No
Verification Required:
No
Sprint:
Sprint 45

Description

Note: This problem occurs because content_2 (below) either is passed by the QueryExisting stage twice. This problem will occur in the save stage. Either way, 2 or more dupes are in the Queue for the save stage at the same time. Whether the dupes are saved in separate batches or together in 1 batch, we will fail with an integrity error.

I've encountered this while implementing Docker sync, so there isn't a simple reproducer yet. This issue occurs when a sync has multiple metadata files that refer to the same content, which has the same artifact.

metadata_1.artifacts = [content_1, content_2]
metadata_2.artifacts = [content_2, content_3]

content_2 and its artifact are able to pass through the pipeline twice resulting in a duplicate key.

django.db.utils.IntegrityError: duplicate key value violates unique constraint "pulp_app_artifact_sha256_key" 

The actual error occurs in the ArtifactSaver stage during a bulk save, but AFAICT this stage needs to rely on unique units to allow bulk_save.

Full traceback:

Traceback (most recent call last):
  File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/rq/worker.py", line 793, in perform_job
    rv = job.perform()
  File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/rq/job.py", line 599, in perform
    self._result = self._execute()
  File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/rq/job.py", line 605, in _execute
    return self.func(*self.args, **self.kwargs)
  File "/home/vagrant/devel/pulp_docker/pulp_docker/app/tasks/synchronizing.py", line 44, in synchronize
    DeclarativeVersion(first_stage, repository).create()
  File "/home/vagrant/devel/pulp/plugin/pulpcore/plugin/stages/declarative_version.py", line 125, in create
    loop.run_until_complete(pipeline)
  File "/usr/lib64/python3.6/asyncio/base_events.py", line 468, in run_until_complete
    return future.result()
  File "/home/vagrant/devel/pulp/plugin/pulpcore/plugin/stages/api.py", line 128, in create_pipeline
    await asyncio.gather(*futures)
  File "/home/vagrant/devel/pulp/plugin/pulpcore/plugin/stages/artifact_stages.py", line 323, in __call__
    Artifact.objects.bulk_create(artifacts_to_save)
  File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/django/db/models/manager.py", line 82, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/django/db/models/query.py", line 465, in bulk_create
    ids = self._batched_insert(objs_without_pk, fields, batch_size)
  File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/django/db/models/query.py", line 1149, in _batched_insert
    inserted_id = self._insert(item, fields=fields, using=self.db, return_id=True)
  File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/django/db/models/query.py", line 1136, in _insert
    return query.get_compiler(using=using).execute_sql(return_id)
  File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/django/db/models/sql/compiler.py", line 1289, in execute_sql
    cursor.execute(sql, params)
  File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/django/db/backends/utils.py", line 100, in execute
    return super().execute(sql, params)
  File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/django/db/backends/utils.py", line 68, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
  File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/django/db/backends/utils.py", line 77, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/django/db/backends/utils.py", line 85, in _execute
    return self.cursor.execute(sql, params)
  File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/django/db/utils.py", line 89, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/home/vagrant/.virtualenvs/pulp/lib64/python3.6/site-packages/django/db/backends/utils.py", line 85, in _execute
    return self.cursor.execute(sql, params)
django.db.utils.IntegrityError: duplicate key value violates unique constraint "pulp_app_artifact_sha256_key" 
DETAIL:  Key (sha256)=(8ddc19f16526912237dd8af81971d5e4dd0587907234be2b83e249518d5b673f) already exists.


Related issues

Related to Pulp - Issue #4085: ContentUnitSaver stage is vulnerable to race conditions. MODIFIED Actions
Related to Docker Support - Refactor #4177: Update sync to use ArtifactSaver Stage MODIFIED Actions
Duplicated by Pulp - Issue #4086: ArtifactSaver stage is vulnerable to race conditions. CLOSED - DUPLICATE Actions

Associated revisions

Revision ac191479 View on GitHub
Added by dkliban@redhat.com about 1 year ago

Problem: ArtifactSaver stage doesn't handle IntegrityErrors

Solution: Use a QueryAndSaveArtifacts stage instead of ArtifactSaver

This patch introduces a new stage that first looks for existing artifact before trying to save any
artifacts.

re: #4060
https://pulp.plan.io/issues/4060

Revision ac191479 View on GitHub
Added by dkliban@redhat.com about 1 year ago

Problem: ArtifactSaver stage doesn't handle IntegrityErrors

Solution: Use a QueryAndSaveArtifacts stage instead of ArtifactSaver

This patch introduces a new stage that first looks for existing artifact before trying to save any
artifacts.

re: #4060
https://pulp.plan.io/issues/4060

Revision 676a3740 View on GitHub
Added by dkliban@redhat.com 11 months ago

Problem: bulk_create can fail with IntegrityError

Solution: add a manager that provides a bulk_get_or_create() method

This patch introduces a BulkCreateManager that provides the bulk_get_or_create method. This method
handles IntegrityErrors encountered during bulk_create() by inserting each object into the database
serially. When an IntegrityError or ValueError is encountered during serial saving of objects, the
object being saved is replaced with an instance from the database.

This patch introduces a mixin used by Artifact, Content, ContentArtifact, and RemoteArtifact models.
The mixin provides the q() method which returns a Q object that can be used to retreive the database
instance of the model.

closes #4060
https://pulp.plan.io/issues/4060

Revision 676a3740 View on GitHub
Added by dkliban@redhat.com 11 months ago

Problem: bulk_create can fail with IntegrityError

Solution: add a manager that provides a bulk_get_or_create() method

This patch introduces a BulkCreateManager that provides the bulk_get_or_create method. This method
handles IntegrityErrors encountered during bulk_create() by inserting each object into the database
serially. When an IntegrityError or ValueError is encountered during serial saving of objects, the
object being saved is replaced with an instance from the database.

This patch introduces a mixin used by Artifact, Content, ContentArtifact, and RemoteArtifact models.
The mixin provides the q() method which returns a Q object that can be used to retreive the database
instance of the model.

closes #4060
https://pulp.plan.io/issues/4060

Revision 676a3740 View on GitHub
Added by dkliban@redhat.com 11 months ago

Problem: bulk_create can fail with IntegrityError

Solution: add a manager that provides a bulk_get_or_create() method

This patch introduces a BulkCreateManager that provides the bulk_get_or_create method. This method
handles IntegrityErrors encountered during bulk_create() by inserting each object into the database
serially. When an IntegrityError or ValueError is encountered during serial saving of objects, the
object being saved is replaced with an instance from the database.

This patch introduces a mixin used by Artifact, Content, ContentArtifact, and RemoteArtifact models.
The mixin provides the q() method which returns a Q object that can be used to retreive the database
instance of the model.

closes #4060
https://pulp.plan.io/issues/4060

History

#1 Updated by CodeHeeler about 1 year ago

  • Triaged changed from No to Yes
  • Sprint set to Sprint 43

#2 Updated by dkliban@redhat.com about 1 year ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to dkliban@redhat.com

#3 Updated by rchan about 1 year ago

  • Sprint changed from Sprint 43 to Sprint 44

#4 Updated by amacdona@redhat.com 12 months ago

  • Subject changed from QueryExistingArtifacts stage does not prevent duplicates of new Artifacts to QueryExistingArtifacts stage does not prevent duplicates within a stream
  • Description updated (diff)

#5 Updated by amacdona@redhat.com 12 months ago

  • Related to Issue #4085: ContentUnitSaver stage is vulnerable to race conditions. added

#6 Updated by jortel@redhat.com 12 months ago

There are several models that are created using bulk_create() where duplicates should not result in failure to create the batch. The use cases for these models is as follows:

1. The batch contains duplicates.
2. Race condition whereby separate workers are attempting to create the same thing.

In both cases, the resulting IntegretyError is logged and/or reported to the user and the entire operation fails. The desired result is for the bulk_create to ignore the failure and create the batch.

Due to the race condition in (2), the proposed solution is to fall back to using Model.save() in a nested transaction committing the entire batch in a single overall transaction. When model.save() raises an IntegrityError, it would be fetched using the natural key and yielded instead. The bulk_create() method can be overridden in a Manager subclass for provide this behavior. This manager may be assigned to affected models.

Basic algorithm (transactions omitted for clarity but they will be critical).

1. return super().bulk_create().
2. catch IntegrityError
3. foreach model in the batch: save() and yield.
4. catch IntegrityError
5. get() the model from the DB and yield instead.

For a general solution, it would help if the pulp models would include a property or method that returned a models.Q object used to query the DB (step:5). Perhaps: getq() or just a property named q(). This would also simplify the query stages.

Identified Models:

  • Content
  • Artifact
  • ContentArtifact
  • RemoteArtifact

There is code in the pipeline that relies on bulk_create() yielding the same model instances as passed in. This needs to be accounted for. Also, I believe there is work related to flattening the Content tables that has the same expectation.

#7 Updated by amacdona@redhat.com 12 months ago

The pattern above preserves the queue order, so that won't be a problem for the docker use case.

Just to outline the problem, I have a use case (docker sync) that needs the queue order to stay the same. The reason for this is that Docker has many-to-many relationships between content units. Currently, the implementation is to put this relationship in DeclarativeContent.extra_data, and the many-to-many is saved after the stage that saves the units. However, once Docker can use the new implementation of this issue, the many-to-many's can be handled in the QueryAndSave "post_save" hook. Even in this case, it is necessary to preserve the order.

In case that's not clear, here's an example.
2 Content units, A and B, that are related.
In a previous stage, A is created (but not saved), and creates B as a side effect. A.extra_data stores that it will eventually have a relationship with B. The output order for this stage is (B, A).
In the relationship stage (or this stage), B must come through first to guarantee that B has been saved before we try to create a relationship (which will happen immediately after saving A).

Note: The only core stage that is not "Q order safe" is the ArtifactDownload stage. (However in Docker, this is still safe because the "fast lane" preserves order of units in the fast lane, and that's enough in our case.)

#8 Updated by bmbouter 12 months ago

@jortel this plan sounds pretty good. The only possible issue could be if there is any extra data attached to the in-memory model being saved which would be lost when the model is queried and replaced. Our usage of data attached to models as assigned attributes is not a good practice, but I remember we were doing it in a few places in the stages API to handle bulk_create with foreign keys between the models. Do we do this currently and is this done on the single content branch @dalley is working on?

Also +1 to making a property that returns a Q object for that instance. Can we call it q_obj so it would be like: Artifact.q_obj Also can we add it into Artifact, Content, RemoteArtifact, and ContentArtifact using a mixin class that defines q_obj or $otheridea?

#9 Updated by jortel@redhat.com 12 months ago

bmbouter wrote:

@jortel this plan sounds pretty good.

The only possible issue could be if there is any extra data attached to the in-memory model being saved which would be lost when the model is queried and replaced. Our usage of data attached to models as assigned attributes is not a good practice, but I remember we were doing it in a few places in the stages API to handle bulk_create with foreign keys between the models. Do we do this currently and is this done on the single content branch @dalley is working on?

Yes, I noted this in my comment as well.

Also +1 to making a property that returns a Q object for that instance. Can we call it q_obj so it would be like: Artifact.q_obj

A model.Q is an object. What is the purpose for the _obj suffix?

Also can we add it into Artifact, Content, RemoteArtifact, and ContentArtifact using a mixin class that defines q_obj or $otheridea?

#10 Updated by bmbouter 12 months ago

wrote:

bmbouter wrote:

@jortel this plan sounds pretty good.

The only possible issue could be if there is any extra data attached to the in-memory model being saved which would be lost when the model is queried and replaced. Our usage of data attached to models as assigned attributes is not a good practice, but I remember we were doing it in a few places in the stages API to handle bulk_create with foreign keys between the models. Do we do this currently and is this done on the single content branch @dalley is working on?

Yes, I noted this in my comment as well.

Oh right I do see that mentioned there. So what's the plan for those code areas? If we could identify them we could see if another implementation of that is feasible. In fact that could be done first to undo the reliance on that need as it's own thing instead of at the same time as this piece of work.

Also +1 to making a property that returns a Q object for that instance. Can we call it q_obj so it would be like: Artifact.q_obj

A model.Q is an object. What is the purpose for the _obj suffix?

Django's documentation writes the Model.Q as "the Q object" https://docs.djangoproject.com/en/2.1/topics/db/queries/#complex-lookups-with-q-objects so this piggybacks off that name. My concern with just q or Q is that since the stages also use queues in lots of places I was hoping to avoid something that could be interpreted as related to a Queue object from asyncio.

Also can we add it into Artifact, Content, RemoteArtifact, and ContentArtifact using a mixin class that defines q_obj or $otheridea?

#11 Updated by amacdona@redhat.com 12 months ago

bmbouter wrote:

The only possible issue could be if there is any extra data attached to the in-memory model being saved which would be lost when the model is queried and replaced.

I'm don't think this is a problem.

In rough psuedocode:

dc = DeclarativeContent()
If dc.content is not unique:
    dc.content = SomeContent.objects.get(pk=dc.content.pk)

Therefore, the dc data is preserved. If 2 dupes go through a single pipeline, each is contained by it's own dc.

#12 Updated by bmbouter 12 months ago

wrote:

bmbouter wrote:

The only possible issue could be if there is any extra data attached to the in-memory model being saved which would be lost when the model is queried and replaced.

I'm don't think this is a problem.

I'm not explaining myself well. I don't mean the extra_data attribute on the DeclarativeContent unit, I mean an extra attribute assigned to the model itself. Something like:

content = dc.content
content.new_attribute = 'i need this later'
for content in MyContent.bulk_create(new_attribute)
    print(content.new_attribute)  # Does this raise an AttributeError?

In rough psuedocode:

[...]

Therefore, the dc data is preserved. If 2 dupes go through a single pipeline, each is contained by it's own dc.

#13 Updated by bmbouter 12 months ago

During IRC convo we confirmed that currently in core, and in @dalley's upcoming single-content branch PR there is no reliance on this 'attribute attachment' implementation so I think core is good w/ the change.

I did have an idea to share, which is that instead of replacing the in-memory object we could take the db attributes we just fetched and assign them to the in-memory one and continue to use that.

I'll point out that instance attachment approach works with bulk_create so if we're replacing bulk_create on some of our models with our implementation, it would be good if we could not deviate from django's behavior.

#14 Updated by daviddavis 12 months ago

What's the benefit to overriding bulk_create instead of just defining a new method?

#15 Updated by daviddavis 12 months ago

I reproduced this with pulp_file. Here were the steps:

mkdir test && cd test
echo "test" > test1
echo "test" > test2
for file in `ls test*`; do echo $file,`sha256sum $file | awk '{ print $1 }'`,`stat -L -c '%s' $file`; done > PULP_MANIFEST
http :8000/pulp/api/v3/repositories/ name=test
http :8000/pulp/api/v3/remotes/file/ name=test url="file://`pwd`/PULP_MANIFEST" 
http :8000/pulp/api/v3/remotes/file/1/sync/ repository=/pulp/api/v3/repositories/1/

#16 Updated by rchan 11 months ago

  • Sprint changed from Sprint 44 to Sprint 45

#17 Updated by dkliban@redhat.com 11 months ago

daviddavis wrote:

What's the benefit to overriding bulk_create instead of just defining a new method?

Since there is always a race condition when multiple workers are syncing content, it seems like the plugin writer would always want to prevent an IntegrityError from failing the sync. Providing only one method for bulk creating would prevent the plugin writer from using the wrong method that doesn't account for the IntegrityError.

#18 Updated by dkliban@redhat.com 11 months ago

  • Status changed from ASSIGNED to MODIFIED

#19 Updated by ttereshc 11 months ago

  • Tags Pulp 3 added

#21 Updated by amacdona@redhat.com 11 months ago

  • Related to Refactor #4177: Update sync to use ArtifactSaver Stage added

#22 Updated by amacdona@redhat.com 11 months ago

  • Duplicated by Issue #4086: ArtifactSaver stage is vulnerable to race conditions. added

#23 Updated by daviddavis 6 months ago

  • Sprint/Milestone set to 3.0

#24 Updated by bmbouter 6 months ago

  • Tags deleted (Pulp 3)

Please register to edit this issue

Also available in: Atom PDF