Project

Profile

Help

Issue #4085

ContentUnitSaver stage is vulnerable to race conditions.

Added by jortel@redhat.com over 1 year ago. Updated 6 months ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Category:
-
Sprint/Milestone:
Start date:
Due date:
Severity:
2. Medium
Version:
Platform Release:
Blocks Release:
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
QA Contact:
Complexity:
Smash Test:
Sprint:

Description

The ContentUnitSaver stage is creating Content using save() and both ContentArtifact and RemoteArtifact using bulk_create() which will raise an IntegretyError on any constraint violation. The same content can be created concurrently during operations such as sync and upload when running more than 1 worker. The race condition exists between the QueryExistingContentUnits and ContentUnitSaver stages. As a result, the un-handled IntegretyError will cause one of the operations to fail.

The impact on users is that syncs will randomly fail with an IntegretyError which will be very concerning.


Related issues

Related to Pulp - Issue #4060: QueryExistingArtifacts stage does not prevent duplicates within a stream CLOSED - CURRENTRELEASE Actions
Related to Container Support - Refactor #4178: Update sync to use ContentSaver Stage CLOSED - CURRENTRELEASE Actions

Associated revisions

Revision 676a3740 View on GitHub
Added by dkliban@redhat.com over 1 year ago

Problem: bulk_create can fail with IntegrityError

Solution: add a manager that provides a bulk_get_or_create() method

This patch introduces a BulkCreateManager that provides the bulk_get_or_create method. This method handles IntegrityErrors encountered during bulk_create() by inserting each object into the database serially. When an IntegrityError or ValueError is encountered during serial saving of objects, the object being saved is replaced with an instance from the database.

This patch introduces a mixin used by Artifact, Content, ContentArtifact, and RemoteArtifact models. The mixin provides the q() method which returns a Q object that can be used to retreive the database instance of the model.

closes #4060 https://pulp.plan.io/issues/4060

Revision 3f5cc8f8 View on GitHub
Added by dkliban@redhat.com over 1 year ago

Problem: Artifact Saver fails to associate existing Artifacts

Solution: Update Declarative Artifacts with Artifacts returned by bulk_get_or_create()

The Content unit saver did not account for duplicate units at all. This patch also addresses that problem.

[noissue]

Revision db96e27f View on GitHub
Added by dkliban@redhat.com over 1 year ago

Problem: duplicate content can't by synced

Solution: fix the content unit saver stage

This patch addresses 2 problems:

  1. Error because an exception was being caught during a transaction

  2. The q() method did not provide the correct query for Content objects

closes #4170 https://pulp.plan.io/issues/4170

History

#1 Updated by jortel@redhat.com over 1 year ago

  • Tags Pulp 3 added

#2 Updated by CodeHeeler over 1 year ago

  • Triaged changed from No to Yes
  • Sprint set to Sprint 44

#3 Updated by bmbouter over 1 year ago

This is totally different on the single-content branch which converts the last call to bulk_create() and all of them occur in the transaction. So I think this is maybe NOTABUG. I missed triage so I can only comment now.

#4 Updated by jortel@redhat.com over 1 year ago

  • Description updated (diff)

#5 Updated by daviddavis over 1 year ago

  • Sprint deleted (Sprint 44)

Removing from the sprint due to lack of agreement on around how to handle this.

#6 Updated by jortel@redhat.com over 1 year ago

bmbouter wrote:

This is totally different on the single-content branch which converts the last call to bulk_create() and all of them occur in the transaction. So I think this is maybe NOTABUG. I missed triage so I can only comment now.

What do you mean by "single-content branch" ?

#7 Updated by amacdona@redhat.com over 1 year ago

  • Related to Issue #4060: QueryExistingArtifacts stage does not prevent duplicates within a stream added

#8 Updated by amacdona@redhat.com over 1 year ago

There has been some confusion with this one and https://pulp.plan.io/issues/4060. 4060 is a problem within a single task (sync) of duplicates existing in a stream at the same time. This problem is related, but distinct. As noted in the description, this problem exists when 2 or more tasks are happening at once. If a unit passes through QueryExisting as a non-dupe, and while it is waiting in the Queue for the save stage, and another task creates that unit, it this one becomes a dupe and will fail when it is bulk_saved.

I think it is important to document these problems separately since they occur for different reasons. Some possible solutions might fix both at once. For instance, with docker sync we solve both problems by implementing our own save stage which does not use bulk writes. This allows us to catch and handle integrity errors as they happen. Adding this stage to pulpcore would solve both problems at the expense of database write performance. Since db performance could be a bottleneck, it is worth considering other options if we have any-- but those options would need to take into account both problems (dupes in stream, parallel tasks).

#9 Updated by jortel@redhat.com over 1 year ago

@bmbouter, What makes you think that this cannot happen?

worker 1: query content ABC  (QueryStage)
  not-found

worker 2: query content ABC (QueryStage)
  not-found

worker 1: create content ABC  (CreateStage)
  Inserted

worker 2: create content ABC  (CreateStage)
  IntegrityError

#10 Updated by daviddavis over 1 year ago

@jortel, I agree it could happen. We saw the same problem when we worked on base_paths. We thought running both the query and the create in a single transaction would solve it but it did not:

https://github.com/pulp/pulp/pull/3380#issuecomment-380477101

#11 Updated by bmbouter over 1 year ago

Recapping some of the discussion from last week, we do need to handle IntegrityErrors-with-requery support in the ContentUnitSaver stage. This ticket can track that. We also need to handle it in the ArtifactSaver stage too, which I believe is tracked as issue https://pulp.plan.io/issues/4060

Does that sound right ^?

#12 Updated by daviddavis over 1 year ago

I thought @jortel was going to work on a general solution that could be applied to anywhere we call bulk_create and write a ticket for it.

#13 Updated by amacdona@redhat.com over 1 year ago

  • Related to Refactor #4178: Update sync to use ContentSaver Stage added

#14 Updated by dkliban@redhat.com over 1 year ago

  • Status changed from NEW to MODIFIED
  • Assignee set to dkliban@redhat.com

#15 Updated by daviddavis about 1 year ago

  • Sprint/Milestone set to 3.0.0

#16 Updated by bmbouter about 1 year ago

  • Tags deleted (Pulp 3)

#17 Updated by bmbouter 6 months ago

  • Status changed from MODIFIED to CLOSED - CURRENTRELEASE

Please register to edit this issue

Also available in: Atom PDF