Issue #4316: Content with same natural may be shared when not completely identical. - Pulp

Actions

Send by e-mail Copy link

Issue #4316

closed

Content with same natural may be shared when not completely identical.

Added by jortel@redhat.com over 5 years ago. Updated over 3 years ago.

Status:

CLOSED - WONTFIX

Priority:

Normal

Assignee:

Category:

Sprint/Milestone:

Start date:

Due date:

Estimated time:

Severity:

2. Medium

Version:

Platform Release:

OS:

Triaged:

Yes

Groomed:

Sprint Candidate:

Tags:

Sprint:

Quarter:

Description

The Problem¶

During content creation stages, content is de-duplicated by comparing the natural key of the DeclarativeContent.content and of Content found in the DB. Although the matched content has the same natural key, there is no guarantee that the full content definition is the same. There could be differences in attributes and/or (number and/or rel-path of) artifacts. Although this is unlikely, it could happen. The concern is that content which may be created by multiple sources is (silently) shared without verification that it is 100% identical.

Example:

Content (name=apache, version=1.0)
   |__(one.json)__ Artifact (digest=A)
   |__(two.json)__ Artifact (digest=B)

Content (name=apache, version=1.0)
   |__(one.json)__ Artifact (digest=A)
   |__(two.json)__ Artifact (digest=B)
   |__(three.json)__ Artifact (digest=C)

Content (name=apache, version=1.0)
   |__(files/one.json)__ Artifact (digest=A)
   |__(files/two.json)__ Artifact (digest=B)

Detection¶

This is the tough part. The primary goal is to detect occurrences and alert users.

Perhaps the Content could provide a comparison method that is used by the stage. The base implementation could compare the number of artifacts and their rel-paths. Plugins writers would override in concrete content types to perform deeper comparison as needed.

This comparison will come with some cost.

Remedies¶

Currently, the user would need to remove the offending content from all repositories and delete it as part of orphan clean up. Other ideas?

Actions

Copy link

Updated by mdellweg over 5 years ago

I know, this would radically change the data model, but we could stop reusing content across repositories at all. Then you would only assume the same combination of say [name, version, architecture] would need to be consistent per repository. Artifacts and data in storage would still be reused, of course.

I thought, this was (at least very similar) discussed elsewhere, but i cannot find the ticket.

Actions

Copy link