Project

Profile

Help

Issue #4316

Content with same natural may be shared when not completely identical.

Added by jortel@redhat.com 9 months ago. Updated 6 months ago.

Status:
NEW
Priority:
Normal
Assignee:
-
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Severity:
2. Medium
Version:
Platform Release:
Blocks Release:
OS:
Backwards Incompatible:
No
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
QA Contact:
Complexity:
Smash Test:
Verified:
No
Verification Required:
No
Sprint:

Description

The Problem

During content creation stages, content is de-duplicated by comparing the natural key of the DeclarativeContent.content and of Content found in the DB. Although the matched content has the same natural key, there is no guarantee that the full content definition is the same. There could be differences in attributes and/or (number and/or rel-path of) artifacts. Although this is unlikely, it could happen. The concern is that content which may be created by multiple sources is (silently) shared without verification that it is 100% identical.

Example:

Content (name=apache, version=1.0)
   |__(one.json)__ Artifact (digest=A)
   |__(two.json)__ Artifact (digest=B)

Content (name=apache, version=1.0)
   |__(one.json)__ Artifact (digest=A)
   |__(two.json)__ Artifact (digest=B)
   |__(three.json)__ Artifact (digest=C)

Content (name=apache, version=1.0)
   |__(files/one.json)__ Artifact (digest=A)
   |__(files/two.json)__ Artifact (digest=B)

Detection

This is the tough part. The primary goal is to detect occurrences and alert users.

Perhaps the Content could provide a comparison method that is used by the stage. The base implementation could compare the number of artifacts and their rel-paths. Plugins writers would override in concrete content types to perform deeper comparison as needed.

This comparison will come with some cost.

Remedies

Currently, the user would need to remove the offending content from all repositories and delete it as part of orphan clean up. Other ideas?

History

#1 Updated by mdellweg 9 months ago

I know, this would radically change the data model, but we could stop reusing content across repositories at all. Then you would only assume the same combination of say [name, version, architecture] would need to be consistent per repository. Artifacts and data in storage would still be reused, of course.

I thought, this was (at least very similar) discussed elsewhere, but i cannot find the ticket.

#2 Updated by CodeHeeler 9 months ago

  • Triaged changed from No to Yes

#3 Updated by bmbouter 6 months ago

  • Tags deleted (Pulp 3)

Please register to edit this issue

Also available in: Atom PDF