Issue #4176
closedfrom_file() and from_metadata() create different data structures in mongodb
Description
When a debian package is added to mongodb using the from_file() function, then the REL_FIELDS will contain lists of dicts.
If they are added using the from_metadata() function instead, then the same information is stored in a single string as found in a Packages file instead.
Both functions are located in:
plugins/pulp_deb/plugins/db/models.py
Currently sync.py uses from_metadata, while importer.py uses from_file.
This can lead to structurally inconsistent databases.
While this is surely undesirable I am not aware of any known issues being caused by it.
I only noticed this issue because it causes a bug in a PR of mine: https://github.com/pulp/pulp_deb/pull/57
I will amend the PR to be able to handle both types of db entries.
In my estimation, to fully fix the problem such that all future db entries will be consistent, would require some significant work.
I am unsure if it is worth fixing at all, given that we all hope to switch to pulp3 in the not to distant future.
That being said, any code that accesses any REL_FIELDS in the db should be aware of this issue.
A test to explicitly test for compatibility with the two cases is also highly desirable, since the current test suite did not catch this issue in my PR.
Once I get around to it, I may add a complete example how some example package would end up in the db for each of the two functions.
Updated by quba42 almost 6 years ago
So I promised a reproducible example:
When I synchronize the following repository:
pulp-admin deb repo create --repo-id 'tuxedo' --feed 'https://deb.tuxedocomputers.com/debian/' --releases 'stretch' --components 'main' --architectures 'amd64' --relative-url 'tuxedo'
pulp-admin deb repo sync run --repo-id='tuxedo'
I can find the following in the mongodb:
mongo pulp_database
> db.units_deb.find({name:{$eq: "tuxedo-wmi-dkms"}}).pretty()
{
"_id" : "7394174d-43cd-49ef-97cd-9a57fb2e9023",
"pulp_user_metadata" : {
},
"_last_updated" : 1542880361,
"_storage_path" : "/var/lib/pulp/content/units/deb/fc/bebc0595c70921b38cc57477e5e8aa8491d5118e92478c5b534ca5e5029523/tuxedo-wmi-dkms_1.5.1_all.deb",
"downloaded" : true,
"name" : "tuxedo-wmi-dkms",
"version" : "1.5.1",
"architecture" : "all",
"checksumtype" : "sha256",
"checksum" : "e5446c4c327752ddc74f59bc9e8239d912aee6ef10b121d0690c2b5f68262057",
"size" : 23294,
"filename" : "tuxedo-wmi-dkms_1.5.1_all.deb",
"conflicts" : "clevo-laptop, clevo-wmi",
"depends" : "dkms (>= 1.95), gcc, make",
"_ns" : "units_deb",
"_content_type_id" : "deb",
"maintainer" : "Christoph Jaeger <cj@linux.com>",
"installed_size" : "103",
"section" : "misc",
"priority" : "optional",
"description" : "tuxedo-wmi driver in DKMS format."
}
>
When I manually upload the same package on a new clean pulp instance instead:
pulp-admin deb repo create --repo-id 'test'
pulp-admin deb repo uploads deb --repo-id 'test' -f ~/devel/tuxedo-wmi-dkms_1.5.1_all.deb
I can find the following in the mongodb:
mongo pulp_database
> db.units_deb.find({name:{$eq: "tuxedo-wmi-dkms"}}).pretty()
{
"_id" : "05d82509-8591-49c1-92d3-de2662cc2079",
"pulp_user_metadata" : {
},
"_last_updated" : 1542879900,
"_storage_path" : "/var/lib/pulp/content/units/deb/fc/bebc0595c70921b38cc57477e5e8aa8491d5118e92478c5b534ca5e5029523/tuxedo-wmi-dkms_1.5.1_all.deb",
"downloaded" : true,
"name" : "tuxedo-wmi-dkms",
"version" : "1.5.1",
"architecture" : "all",
"checksumtype" : "sha256",
"checksum" : "e5446c4c327752ddc74f59bc9e8239d912aee6ef10b121d0690c2b5f68262057",
"size" : 23294,
"filename" : "tuxedo-wmi-dkms_1.5.1_all.deb",
"breaks" : [ ],
"conflicts" : [
{
"name" : "clevo-laptop"
},
{
"name" : "clevo-wmi"
}
],
"depends" : [
{
"flag" : "GE",
"version" : "1.95",
"name" : "dkms"
},
{
"name" : "gcc"
},
{
"name" : "make"
}
],
"enhances" : [ ],
"pre_depends" : [ ],
"provides" : [ ],
"recommends" : [ ],
"replaces" : [ ],
"suggests" : [ ],
"_ns" : "units_deb",
"_content_type_id" : "deb",
"installed_size" : "103"
}
>
Synchronizing a repository uses from_metadata, while the manual package upload uses from_file.
Updated by quba42 almost 6 years ago
Update: After some more investigating, I have now developed a strong preference for modifying the from_file function such that all future db entries will store Debian relationship field strings (https://www.debian.org/doc/debian-policy/ch-relationships.html) rather than the same information structured as a list of dicts.
My reasons are the following:
- Debian relationship field strings are used in Packages files and are therefore a well known type to many tools.
- Disassembling and reassembling these strings adds a lot of unneeded complexity to pulp_deb and complexity is bug prone.
- The performance improvements I am trying to add (https://github.com/pulp/pulp_deb/pull/57) require Debian relationship strings.
- I believe I have found a case where the current string parsing mechanism loses information* (aka has a bug, see also number 2. above)
*I had a package that included "python:any" in a Debian relationship string and the "any" did not appear anywhere in the resultant db dict. (However, I am not sure what "python:any" means).
The main counter point is presumably that db fields are meant to structure information in a searchable way, but unless we have an actual use case for it I don't think it is worth the effort.
Important: This change would need to be accompanied by a db migration to ensure all existing entries are made consistent.
Updated by mihai.ibanescu@gmail.com almost 6 years ago
I am pretty sure this is a bug. I need to identify why it's happening to begin with. I agree the data should be consistent.
Updated by mihai.ibanescu@gmail.com almost 6 years ago
You were right on all the points. Indeed, using from_metadata will not run the dependency strings through DependencyParser.
The original idea was to store the dependency strings in a manner similar to pulp_rpm. For instance, rpm will store:
"requires": [
{
"release": null,
"epoch": null,
"version": null,
"flags": null,
"name": "/bin/sh"
},
{
"release": null,
"epoch": null,
"version": null,
"flags": null,
"name": "/bin/sh"
},
{
"release": null,
"epoch": null,
"version": null,
"flags": null,
"name": "chkconfig"
}
]
Also, you are right that the intention was to support writing a dependency solver that only queries pulp. Being able to quickly say "show me all packages requiring X", with or without additional operators/flags, just with pure pulp/mongo queries, was interesting at the time.
For that reason I would prefer the dictionaries to plain strings, even though I am not aware of anyone making pulp/mongo queries like the one described above.
Would it help if you had both the string representation and the parsed one?
Updated by mbucher almost 6 years ago
I do not think having rel_fields as original-string and parsed-dict is a good idea because of redundancy.
Unless someone really requires the parsed rel_fields, I'd also vote for storing the strings:
- it is less complex and therefore makes synching of repositories faster
- we can be sure that the Package file created from the DB has the same values for the rel_fields as the one used to create the DB-entries.
Updated by mihai.ibanescu@gmail.com almost 6 years ago
Let me check how we're using those fields internally.
Updated by quba42 almost 6 years ago
Though I am not a huge fan of storing information redundantly, I would prefer redundant storage to not having the plain strings.
I am simply not confident I could write a version of the DependencyParser (and it's reverse) that can ensure it will never lose any information.
I also agree, that if there is but one user/use case out there that requires the parsed dependencies than they need to be retained.
I do think that the majority of all pulp_deb users simply do not have the parsed dependencies right now, since the most common use case for pulp_deb is presumably syncing upstream repositories. This is at least anecdotal evidence that we might not need the parsed dependencies at all?
Ulitmately, I am willing to work on whatever solution we agree on, since I do want to get https://github.com/pulp/pulp_deb/pull/57 merged which is currently blocked by this issue.
Updated by mihai.ibanescu@gmail.com almost 6 years ago
I guess we can agree to disagree :-)
When I started modifying the plugin, the primary use case was for us to publish our own content. Syncing came after that, since mirroring is not my primary use case. I believe that's code contributed by mdellweg
Diversity is good, actually, it makes us keep both use cases in mind.
I can see how writing a parser that is guaranteed to not lose information is pretty much impossible, since I can't seem to find any good specs for it.
So, if you would not mind, let's add the unparsed strings side by side with the parsed ones. I would imagine we'd have to either append a suffix to the key name (like "depends-raw"), or move them to a "deeper" dict field like "raw-dependencies/depends", "raw-dependencies/requires", etc. What do you think?
Updated by quba42 almost 6 years ago
Not sure I dsagree, or rather I guess I am disagreeing to disagree ;-)
Jokes aside, I guess it was a bit presumptuous to infer from the use case I know of to those of other people.
I agree that given use cases for both parsed and plain strings, the best solution is simply storing both.
You also managed to preempt my next question, namely how to name and structure which fields going forward. If I have understood correctly there is some code/tools using the existing fields in the parsed form? If so it would be least disruptive to use the existing names/fields for the parsed/structured dependencies.
The changes in https://github.com/pulp/pulp_deb/pull/57 would work best with a set of separate top level fields containing a plain dependency string each. (Something like "depends-raw" etc.)
That being said, something like a "raw_relations" dict might make for more tidy db entries.
I am undecided between these competing considerations.
I will think on it some more, and try to ask around. (mdellweg, @mbucher)
Updated by quba42 almost 6 years ago
After bouncing some ideas off @mbucher (thanks for that!) I now propose the following:
- The existing "depends et al" fields should only ever store parsed dependency information going forward.
- There should be a new db field named something like "raw_control_fields" containing a dict of strings exactly as stored in the relevant .deb packages control file (if read from file) and exactly the corresponding fields from an upstream Packages file if synchronized from an upstream repository (using the from_metadata function).
- There should be a test (or tests) to ensure an upstream sync (from_metadata) and a manual package upload (from_file) will result in the same db entry for some given package.
There should be some configuration option or similar to disable dependency field parsing for usage scenarios where this feature is not needed. (The default behaviour should be to include the parsed strings.)
Further explanation for why a "raw_control_fields" dict is better than just a "raw_dependency_fields" dict:
The "raw_control_fields" dict will contain all control file fields irrespective of whether they are explicitly known to pulp_deb. This will automatically include any raw dependency strings. This way the db entries are guaranteed to contain all needed information for publishing repositories in a way that is compliant (at least with respect to control file fields) with the Debian Repository Fromat specification (https://wiki.debian.org/DebianRepository/Format).
Updated by mihai.ibanescu@gmail.com almost 6 years ago
That seems very reasonable.
Updated by quba42 almost 6 years ago
I have now opened a PR that aims to fix this issue: https://github.com/pulp/pulp_deb/pull/61
At the moment, the PR is just a basic implementation without attention to the tests, but I will keep working on it, and we can start discussing it.
After some more deliberation we also concluded that a configuration option that influences the contents of the data base was a bad idea. As a result, I have abandoned the final bullet from my previous post.
Added by quba42 almost 6 years ago
Added by quba42 almost 6 years ago
Revision 1c8ded30 | View on GitHub
Cleanup the PulpDeb class fields
Added by quba42 almost 6 years ago
Revision cf098bab | View on GitHub
Cleanup now unneeded legacy code
Added by quba42 almost 6 years ago
Revision 88f03861 | View on GitHub
Reuse the from_file tests for from_deb_file
Added by quba42 almost 6 years ago
Revision e6050737 | View on GitHub
fixup (revert the from_file name change)
Added by quba42 almost 6 years ago
Revision 082e7d1c | View on GitHub
fixup (improve the _check_for_required_fields function)
Added by quba42 almost 6 years ago
Revision bda466c4 | View on GitHub
fixup (ensure user_metadata can't be None)
Added by quba42 almost 6 years ago
Revision eefc0501 | View on GitHub
fixup (revert parameter name change)
Updated by quba42 almost 6 years ago
The PR is now merged into master.
Thanks @mihai for his timely and insightful review.
Updated by mdellweg over 5 years ago
- Status changed from NEW to POST
- Assignee set to quba42
Updated by ttereshc over 5 years ago
- Sprint/Milestone set to 2.19.0
- Platform Release set to 2.19.0
Updated by ttereshc over 5 years ago
- Status changed from POST to MODIFIED
Moving to MODIFIED state, since PR is merged, @x9c4 asked to add this one to the upcoming release.
I associated all the commits from the PR with this issue.
Updated by ttereshc over 5 years ago
- Status changed from 5 to CLOSED - CURRENTRELEASE
Ensure the db is used consistently