Project

Profile

Help

Issue #6347

closed

File content list does not return all unique content units

Added by sajha almost 5 years ago. Updated over 4 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Katello
Sprint:
Sprint 70
Quarter:

Description

https://pulp-file.readthedocs.io/en/latest/restapi.html#operation/content_file_files_list does not return all unique content units when used with a page size of ~2000 and there are multiple pages of results. To reproduce:

  1. Create and sync a large file repo (http://quartet.usersys.redhat.com/pub/fake-repos/large_file/)
  2. Retrieve content units for the repo using page_size 2000
  3. Out of 70,000 content units, roughly 69k unique records are returned.

Several units are repeated across pages leading to this discrepancy since we only query till offset 68000 with the 2000 page_size and 70k results don't consist of 70k unique units.


Files

test.sh (857 Bytes) test.sh lmjachky, 04/03/2020 11:36 PM
Actions #1

Updated by fao89 almost 5 years ago

  • Triaged changed from No to Yes
  • Sprint set to Sprint 68
Actions #2

Updated by rchan almost 5 years ago

  • Sprint changed from Sprint 68 to Sprint 69
Actions #3

Updated by jsherril@redhat.com over 4 years ago

  • Tags Katello-P1 added
  • Tags deleted (Katello-P2)
Actions #4

Updated by dkliban@redhat.com over 4 years ago

This may be a problem for other plugins such as pulp_container. We should see if we can add the sort to all content by default.

Actions #5

Updated by dkliban@redhat.com over 4 years ago

All Content models should be sorted by the 'pulp_created' field.

Actions #6

Updated by lmjachky over 4 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to lmjachky
Actions #7

Updated by dkliban@redhat.com over 4 years ago

  • Project changed from File Support to Pulp
  • Sprint/Milestone set to 3.3.0
Actions #8

Updated by rchan over 4 years ago

  • Sprint changed from Sprint 69 to Sprint 70
Actions #9

Updated by lmjachky over 4 years ago

  • File test.sh test.sh added
  • Status changed from ASSIGNED to NEW
  • Assignee deleted (lmjachky)

I was able to reproduce the issue. I used the attached test to do so. The output of the test looked like this:

2000: 4000
4000: 6000
6000: 8000
8000: 10000
10000: 12000
12000: 14000
14000: 16000
16000: 18000
18000: 20000
20000: 22000
22000: 24000
24000: 25295
26000: 26597
28000: 27908
30000: 29271
32000: 30531
34000: 31868
36000: 33239
38000: 34526
40000: 35833
42000: 37146
44000: 38486
46000: 39804
48000: 41137
50000: 42435
52000: 43745
54000: 45055
56000: 46362
58000: 47703
60000: 49003
62000: 50342
64000: 51626
66000: 52978
68000: 54312
70000: 54312
Expected: 70000
Actual: 54312

The content started to be non-unique after 24,000 fetched units (the number probably depends upon a running system). And it did not matter whether I was retrieving the content units using page_size=2,000 or page_size=200.

I tried to sort synced content by the field "pulp_created" (as dkliban had mentioned) using the built-in Django ordering options (https://docs.djangoproject.com/en/3.0/ref/models/options/#django.db.models.Options.ordering), but with no success. I added "ordering = ['-pulp_created']" to the meta class of the model FileContent, however, the sync failed with the following error message:

"traceback": "  File \"/usr/local/lib/pulp/lib64/python3.7/site-packages/rq/worker.py\", line 884, in perform_job\n   
 rv = job.perform()\n  File \"/usr/local/lib/pulp/lib64/python3.7/site-packages/rq/job.py\", line 664, in perform\n   
 self._result = self._execute()\n  File \"/usr/local/lib/pulp/lib64/python3.7/site-packages/rq/job.py\", line 670, in _execute\n 
   return self.func(*self.args, **self.kwargs)\n  File \"/home/fedora/devel/pulp_file/pulp_file/app/tasks/synchronizing.py\", line 45, 
in synchronize\n    dv.create()\n  File \"/home/fedora/devel/pulpcore/pulpcore/plugin/stages/declarative_version.py\", line 149, 
in create\n    loop.run_until_complete(pipeline)\n  File \"/home/fedora/devel/pulpcore/pulpcore/app/models/repository.py\", 
line 753, in __exit__\n    repository.finalize_new_version(self)\n  File \"/home/fedora/devel/pulp_file/pulp_file/app/models.py\", 
line 68, in finalize_new_version\n    validate_repo_version(new_version)\n  File 
\"/home/fedora/devel/pulpcore/pulpcore/plugin/repo_version_utils.py\", line 138, in validate_repo_version\n    
validate_duplicate_content(version)\n  File \"/home/fedora/devel/pulpcore/pulpcore/plugin/repo_version_utils.py\",
line 96, in validate_duplicate_content\n    ).distinct(*repo_key_fields).count()\n  
File \"/usr/local/lib/pulp/lib64/python3.7/site-packages/django/db/models/query.py\", line 392, in count\n  
  return self.query.get_count(using=self.db)\n  File \"/usr/local/lib/pulp/lib64/python3.7/site-packages/django/db/models/sql/query.py\",
 line 504, in get_count\n    number = obj.get_aggregation(using, ['__count'])['__count']\n 
 File \"/usr/local/lib/pulp/lib64/python3.7/site-packages/django/db/models/sql/query.py\", line 489, 
in get_aggregation\n    result = compiler.execute_sql(SINGLE)\n  File \"/usr/local/lib/pulp/lib64/python3.7/site-packages/django/db/models/sql/compiler.py\", line 1133, in execute_sql\n   
 cursor.execute(sql, params)\n 
 File \"/usr/local/lib/pulp/lib64/python3.7/site-packages/django/db/backends/utils.py\", line 67, in execute\n  
  return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)\n 
 File \"/usr/local/lib/pulp/lib64/python3.7/site-packages/django/db/backends/utils.py\", line 76, in _execute_with_wrappers\n  
  return executor(sql, params, many, context)\n  File \"/usr/local/lib/pulp/lib64/python3.7/site-packages/django/db/backends/utils.py\",
 line 84, in _execute\n    return self.cursor.execute(sql, params)\n  
File \"/usr/local/lib/pulp/lib64/python3.7/site-packages/django/db/utils.py\", line 89, in __exit__\n 
   raise dj_exc_value.with_traceback(traceback) from exc_value\n  
File \"/usr/local/lib/pulp/lib64/python3.7/site-packages/django/db/backends/utils.py\", line 84, in _execute\n   
 return self.cursor.execute(sql, params)\n",

    "description": "SELECT DISTINCT ON expressions must match initial ORDER BY expressions\nLINE 1: SELECT COUNT(*) FROM (SELECT DISTINCT ON (\"file_filecontent\"...\n   

I did not experience the above error message during the sync when I added that ordering to the model MasterModel. Still, Pulp did not return unique values. Specifying a default ordering (https://www.django-rest-framework.org/api-guide/filtering/#specifying-a-default-ordering) in FileContentViewSet did not help either way.

The only approach, that was working for me, was to create a custom class that handles the actual ordering, like shown in the commit https://github.com/lubosmj/pulp_file/commit/379ce9298307a4d8886a1cb1f4e2a5bd08906ef7. I just stole the workaround proposed here: https://github.com/encode/django-rest-framework/issues/6886#issuecomment-547120480. What is interesting here is that I did not need to order content by the field "pulp_created" and the test passed as supposed. Yet, It seems like the changes did break some functionality according to Travis (see the build https://travis-ci.org/github/lubosmj/pulp_file/builds/670764582, only one test failed, but I am not experiencing such a failure in my VM).

Also, there is an option to use a different pagination. Currently, we are using LimitOffsetPagination. There exists CursorPagination (https://www.django-rest-framework.org/api-guide/pagination/#cursorpagination) which requires that there is a unique, unchanging ordering of items in the result set.

Actions #10

Updated by fao89 over 4 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to fao89
Actions #11

Updated by pulpbot over 4 years ago

  • Status changed from ASSIGNED to POST
Actions #13

Updated by fao89 over 4 years ago

  • Assignee changed from fao89 to lmjachky

Added by Fabricio Aguiar over 4 years ago

Revision 5ef9ee83 | View on GitHub

Fixed non unique content units on content list

https://pulp.plan.io/issues/6347 closes #6347

Actions #14

Updated by Anonymous over 4 years ago

  • Status changed from POST to MODIFIED

Added by Fabricio Aguiar over 4 years ago

Revision 937fbddf | View on GitHub

Fixed non unique content units on content list

https://pulp.plan.io/issues/6347 closes #6347

(cherry picked from commit 5ef9ee839bb23d15f0754e44e53358789ce57eb0)

Actions #15

Updated by Anonymous over 4 years ago

Actions #17

Updated by bmbouter over 4 years ago

  • Status changed from MODIFIED to ASSIGNED
Actions #18

Updated by bmbouter over 4 years ago

  • Status changed from ASSIGNED to POST
  • Assignee changed from lmjachky to fao89

Added by Fabricio Aguiar over 4 years ago

Revision 8f6fb1bd | View on GitHub

StableOrderingFilter only for NamedModelViewSet

https://pulp.plan.io/issues/6347 fixes #6347

Actions #19

Updated by Anonymous over 4 years ago

  • Status changed from POST to MODIFIED
Actions #20

Updated by ttereshc over 4 years ago

  • Status changed from MODIFIED to CLOSED - CURRENTRELEASE
Actions #21

Updated by ggainey over 4 years ago

  • Tags Katello added
  • Tags deleted (Katello-P1)

Also available in: Atom PDF