Story #20
closedAs a user, my applicability data is calculated in parallel
Added by Anonymous almost 10 years ago. Updated over 5 years ago.
100%
Description
Our applicability algorithm would be straightforward to convert into a parallel operation, wherein each consumer's or each repo's applicability calculation could be done as independent Celery tasks. This would allow Pulp to calculate applicability n times faster, where n is the number of Celery workers available.
Related issues
Updated by rbarlow over 9 years ago
- Groomed set to No
- Sprint Candidate set to Yes
Updated by rbarlow over 9 years ago
It might be worth thinking about whether we can make a patch that will apply cleanly against 2.4 since there are users who are having problems with DB cursor timeouts. Patching against 2.6 might also be fine if we are comfortable requiring users to upgrade to a newer Pulp to fix this.
Updated by rbarlow over 9 years ago
On 06/04/2015 11:00 AM, Pulp wrote:
It might be worth thinking about whether we can make a patch that will
apply cleanly against 2.4 since there are users who are having problems
with DB cursor timeouts. Patching against 2.6 might also be fine if we
are comfortable requiring users to upgrade to a newer Pulp to fix this.
On second thought, this might have to be done with "spawned tasks" which
would change the API to the task. One way to work around this not being
backwards-incompatible would be to add an optional boolean to the API
call that lets the user state whether they want to do the calculation in
parallel or not, and if the bool isn't provided we default to the
current behavior. Then, with Pulp 3.0 we can just change to always doing
it in parallel and drop the boolean.
--
Randy Barlow
Updated by dkliban@redhat.com over 9 years ago
Here is a possible implementation:
Define TaskMonitorTask as a regular celery task that takes two parameters: 'parent_task_id' and 'tasks'. 'tasks' is a list of task id's for tasks that need to be monitored. The task will check the status of all tasks in the list and then update the status of parent task. If not all of the tasks are in a final state, the task dispatches itself again with a list of remaining tasks and the same parent task id. Each time this task is dispatched with a delay of 5 minutes or another configurable value.
Define RepoProfileApplicabilityCalculation task as a celery task that takes an existing repo profile applicability and perform the work here [0]
Create a new RepoApplicabilityCalculationTask as a Pulp Task that will dispatch 1 RepoProfileApplicabilityCalculation task for each repo applicability profile that needs to be updated. Then it dispatches TaskMonitorTask and passes it the list of RepoProfileApplicabilityCalculation tasks that were dispatched and the id of itself (RepoApplicabilityCalculationTask)
Updated by mhrivnak over 9 years ago
- Groomed changed from No to Yes
Please review the final plan with the team before implementing.
This diff must apply cleanly on 2.6, but may have to be released with 2.7.
Updated by bmbouter over 9 years ago
@dkliban When you say RepoApplicabilityCalculationTask is a Pulp task do you mean it inherits from Pulp's base Task? If so then it will be auto-marked as completed as soon as it is finished because of the on_success or on_failure handlers that provides.
That would need to be somehow disabled and the final call to TaskMonitorTask would need to set it specifically. What can we do in that area?
Also one other important point to consider is using apply_async versus apply_async_with_reservation. Does the task RepoApplicabilityCalculationTask need a reservation to ensure a repo operation doesn't happen underneath it? What do you think?
Updated by dkliban@redhat.com about 9 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to dkliban@redhat.com
I looked into using Celery chords to do this work, however, I have discovered that Celery chords rely on using the results backend [0]. Since we are trying to move away from depending on the results backend, Brian and I have come up with a plan to introduce an implementation of ParallelTasks using the TaskStatus in the database. I'll update this story once I have the plan fully written out.
[0] http://blog.untrod.com/2015/03/how-celery-chord-synchronization-works.html
Updated by rbarlow about 9 years ago
On 07/21/2015 02:41 PM, Pulp wrote:
Since we are trying to move away from depending on the results backend
IMO, it's OK to use the broker as a results backend for this purpose.
Have you considered that since it may be easier?
Updated by dkliban@redhat.com about 9 years ago
- Blocked by Story #1205: As a developer I can dispatch a task that can dispatch a group of tasks added
Updated by dkliban@redhat.com about 9 years ago
- Blocks Story #1206: As an API user, I can get summary status for a task group added
Updated by dkliban@redhat.com about 9 years ago
- Blocks deleted (Story #1206: As an API user, I can get summary status for a task group)
Updated by dkliban@redhat.com about 9 years ago
- Blocked by Story #1206: As an API user, I can get summary status for a task group added
Updated by dkliban@redhat.com about 9 years ago
- Status changed from ASSIGNED to NEW
Updated by mhrivnak almost 9 years ago
- Assignee deleted (
dkliban@redhat.com) - Platform Release set to 2.8.0
Updated by mhrivnak almost 9 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to dkliban@redhat.com
Updated by dkliban@redhat.com almost 9 years ago
- Status changed from ASSIGNED to POST
Added by dkliban@redhat.com almost 9 years ago
Added by dkliban@redhat.com almost 9 years ago
Revision 0ecc2dfd | View on GitHub
Parallelizes applicability regeneration for updated repository
This patch provides a new Celery task for performing applicability regenration for a batch of applicability profiles. The ApplicabilityRegenerationManager dispatches a series of tasks with the same group id. Each task is dispatched with a list of up to 10 RepoProfileApplicabilities to reevaluate.
The API endpoint for generating content applicability for updated repositories changes as part of this patch. Instead of returning 202 with a call report, the server returns 202 with a group call report.
This patch does not make any changes to the algorithm used to calculate content applicability.
Updated by dkliban@redhat.com almost 9 years ago
- Status changed from POST to MODIFIED
- % Done changed from 0 to 100
Applied in changeset pulp:pulp|0ecc2dfdb9a2d5e1af2ed39c71ba387b2a2565b4.
Updated by dkliban@redhat.com over 8 years ago
- Blocked by deleted (Story #1205: As a developer I can dispatch a task that can dispatch a group of tasks)
Updated by dkliban@redhat.com over 8 years ago
- Status changed from 5 to CLOSED - CURRENTRELEASE
Parallelizes applicability regeneration for updated repository
This patch provides a new Celery task for performing applicability regenration for a batch of applicability profiles. The ApplicabilityRegenerationManager dispatches a series of tasks with the same group id. Each task is dispatched with a list of up to 10 RepoProfileApplicabilities to reevaluate.
The API endpoint for generating content applicability for updated repositories changes as part of this patch. Instead of returning 202 with a call report, the server returns 202 with a group call report.
This patch does not make any changes to the algorithm used to calculate content applicability.
https://pulp.plan.io/issues/20 closes #20