Project

Profile

Help

Story #2519

closed

Enable workers to record their own heartbeat records to the database

Added by dkliban@redhat.com over 5 years ago. Updated about 3 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
% Done:

100%

Estimated time:
Platform Release:
2.13.0
Groomed:
Yes
Sprint Candidate:
Yes
Tags:
Pulp 2
Sprint:
Sprint 14
Quarter:

Description

Problem: pulp_celerybeat writes hearbeat records to the database

Solution: All workers write their own records to the db instead of sending them through the message bus.

Without this the pulp-manage-db waiting for workers could continue mistakenly if pulp-manage-db was killed even though workers are still running. This is because their timestamps would not be updated even though they are still heartbeating.


Related issues

Blocks Pulp - Issue #2496: Killing pulp_workers, pulp_celerybeat, and pulp_resource_manager causes the status API still shows them as runningCLOSED - CURRENTRELEASEdkliban@redhat.comActions
Actions #1

Updated by dkliban@redhat.com over 5 years ago

  • Related to Issue #2496: Killing pulp_workers, pulp_celerybeat, and pulp_resource_manager causes the status API still shows them as running added
Actions #2

Updated by dkliban@redhat.com over 5 years ago

  • Tracker changed from Issue to Story
  • Subject changed from Celerebeat is responsible for writing all hearbeat records to the database to Enable workers to record their own heartbeat records to the database
  • Status changed from NEW to ASSIGNED
  • Assignee set to dkliban@redhat.com
  • Sprint/Milestone set to 31
  • % Done set to 0
  • Groomed changed from No to Yes
Actions #3

Updated by bmbouter over 5 years ago

  • Tracker changed from Story to Issue
  • Subject changed from Enable workers to record their own heartbeat records to the database to Celerebeat is responsible for writing all hearbeat records to the database
  • Status changed from ASSIGNED to NEW
  • Assignee deleted (dkliban@redhat.com)
  • Sprint/Milestone deleted (31)
  • Severity set to 2. Medium
  • Triaged set to No
  • Groomed changed from Yes to No

Here is a quick testplan I would expect for this:

# manually ensure there are no records in the workers table. In Mongu using: db.workers.find().pretty() to return nothing.
For component in ['pulp_celerybeat', 'pulp_resource_manager', 'pulp_workers']:
# start on that component
# verify that only its records are written to db.workers.find().pretty()
# stop the component

Each component should be reporting its own records to the db, without another component having to be online.

Also note, that it's important that Pulp generate the timestamps that are recorded. We've had bugs when Celery timestamps are used due to inconsistent timezone handling in Celery.

Actions #4

Updated by bmbouter over 5 years ago

  • Tracker changed from Issue to Story
  • Subject changed from Celerebeat is responsible for writing all hearbeat records to the database to Enable workers to record their own heartbeat records to the database
  • Status changed from NEW to ASSIGNED
  • Assignee set to dkliban@redhat.com
  • Sprint/Milestone set to 31
  • % Done set to 0
  • Groomed changed from No to Yes

Overwrote changes from Comment 2, so now I'm putting those back.

Actions #5

Updated by dkliban@redhat.com over 5 years ago

  • Status changed from ASSIGNED to POST
Actions #6

Updated by bmbouter over 5 years ago

  • Related to deleted (Issue #2496: Killing pulp_workers, pulp_celerybeat, and pulp_resource_manager causes the status API still shows them as running)
Actions #7

Updated by bmbouter over 5 years ago

  • Blocks Issue #2496: Killing pulp_workers, pulp_celerybeat, and pulp_resource_manager causes the status API still shows them as running added
Actions #8

Updated by dkliban@redhat.com over 5 years ago

  • Status changed from POST to ASSIGNED
Actions #9

Updated by dalley over 5 years ago

  • Assignee changed from dkliban@redhat.com to dalley
Actions #10

Updated by dkliban@redhat.com over 5 years ago

  • Sprint/Milestone changed from 31 to 32
Actions #11

Updated by dalley over 5 years ago

  • Status changed from ASSIGNED to POST
Actions #13

Updated by bmbouter over 5 years ago

After seeing this implementation, I now realize that we should remove the `--heartbeat-interval=5` entirely from the pulp_resource_manager, and pulp_workers. We already have it removed from pulp_celerybeat so there is nothing to do there. I've added checklist items since this story specifically makes those those unnecessary.

Added by dalley over 5 years ago

Revision fd19f890

Workers write their own hearbeat records to database.

All workers will write their own records to the database instead of relying on pulp_celerybeat to do so for them using celery heartbeats.

This patch makes use of the Consumer blueprint that celery runs at the start time of a worker. An extra boot step has been added which sets a timer to periodically update the worker record in the database.

http://docs.celeryproject.org/en/master/userguide/extending.html https://groups.google.com/d/msg/celery-users/3fs0ocREYqw/C7U1lCAp56sJ

closes #2519 https://pulp.plan.io/issues/2519 closes #2516 https://pulp.plan.io/issues/2516

Added by dalley over 5 years ago

Revision fd19f890

Workers write their own hearbeat records to database.

All workers will write their own records to the database instead of relying on pulp_celerybeat to do so for them using celery heartbeats.

This patch makes use of the Consumer blueprint that celery runs at the start time of a worker. An extra boot step has been added which sets a timer to periodically update the worker record in the database.

http://docs.celeryproject.org/en/master/userguide/extending.html https://groups.google.com/d/msg/celery-users/3fs0ocREYqw/C7U1lCAp56sJ

closes #2519 https://pulp.plan.io/issues/2519 closes #2516 https://pulp.plan.io/issues/2516

Actions #15

Updated by dalley over 5 years ago

  • Status changed from POST to MODIFIED
  • % Done changed from 0 to 100
Actions #16

Updated by semyers about 5 years ago

  • Platform Release set to 2.13.0
Actions #17

Updated by pcreech about 5 years ago

  • Status changed from MODIFIED to 5
Actions #18

Updated by pthomas@redhat.com about 5 years ago

Verified manually.


1.  When  ['pulp_celerybeat', 'pulp_resource_manager', 'pulp_workers'] are all running

> db.workers.find().pretty()
{
    "_id" : "reserved_resource_worker-3@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:51:33.655Z")
}
{
    "_id" : "reserved_resource_worker-0@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:51:33.872Z")
}
{
    "_id" : "reserved_resource_worker-2@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:51:33.906Z")
}
{
    "_id" : "reserved_resource_worker-10@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:51:33.936Z")
}
{
    "_id" : "reserved_resource_worker-4@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:51:33.982Z")
}
{
    "_id" : "reserved_resource_worker-1@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:51:34.009Z")
}
{
    "_id" : "reserved_resource_worker-9@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:51:34.023Z")
}
{
    "_id" : "reserved_resource_worker-5@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:51:34.024Z")
}
{
    "_id" : "reserved_resource_worker-6@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:51:34.038Z")
}
{
    "_id" : "reserved_resource_worker-11@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:51:34.062Z")
}
{
    "_id" : "reserved_resource_worker-7@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:51:34.070Z")
}
{
    "_id" : "reserved_resource_worker-8@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:51:34.097Z")
}
{
    "_id" : "reserved_resource_worker-13@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:51:34.139Z")
}
{
    "_id" : "reserved_resource_worker-14@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:51:34.171Z")
}
{
    "_id" : "reserved_resource_worker-12@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:51:34.171Z")
}
{
    "_id" : "reserved_resource_worker-15@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:51:34.188Z")
}
{
    "_id" : "reserved_resource_worker-19@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:51:34.190Z")
}
{
    "_id" : "reserved_resource_worker-16@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:51:34.199Z")
}
{
    "_id" : "reserved_resource_worker-18@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:51:34.191Z")
}
{
    "_id" : "reserved_resource_worker-17@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:51:34.196Z")
}
Type "it" for more
> 
> 
> 
> 
2. With  ['pulp_celerybeat', 'pulp_resource_manager', 'pulp_workers'] stopped

> db.workers.find().pretty()
> 
> 
3.  sudo systemctl start httpd pulp_workers

> db.workers.find().pretty()
{
    "_id" : "reserved_resource_worker-5@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:53:20.386Z")
}
{
    "_id" : "reserved_resource_worker-2@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:53:20.401Z")
}
{
    "_id" : "reserved_resource_worker-4@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:53:20.429Z")
}
{
    "_id" : "reserved_resource_worker-0@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:53:20.448Z")
}
{
    "_id" : "reserved_resource_worker-6@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:53:20.471Z")
}
{
    "_id" : "reserved_resource_worker-1@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:53:20.474Z")
}
{
    "_id" : "reserved_resource_worker-13@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:53:20.517Z")
}
{
    "_id" : "reserved_resource_worker-12@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:53:20.537Z")
}
{
    "_id" : "reserved_resource_worker-3@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:53:20.544Z")
}
{
    "_id" : "reserved_resource_worker-16@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:53:20.585Z")
}
{
    "_id" : "reserved_resource_worker-11@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:53:20.591Z")
}
{
    "_id" : "reserved_resource_worker-15@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:53:20.611Z")
}
{
    "_id" : "reserved_resource_worker-9@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:53:20.623Z")
}
{
    "_id" : "reserved_resource_worker-18@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:53:20.656Z")
}
{
    "_id" : "reserved_resource_worker-10@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:53:20.655Z")
}
{
    "_id" : "reserved_resource_worker-14@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:53:20.665Z")
}
{
    "_id" : "reserved_resource_worker-8@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:53:20.673Z")
}
{
    "_id" : "reserved_resource_worker-17@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:53:20.686Z")
}
{
    "_id" : "reserved_resource_worker-21@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:53:20.689Z")
}
{
    "_id" : "reserved_resource_worker-20@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:53:20.714Z")
}
Type "it" for more
> 
> 
4.  sudo systemctl stop httpd pulp_workers

> db.workers.find().pretty()
>

5. sudo systemctl start httpd pulp_resource_manager
> db.workers.find().pretty()
{
    "_id" : "resource_manager@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:54:02.656Z")
}

6. sudo systemctl stop httpd pulp_resource_manager

> db.workers.find().pretty()
> 
7. sudo systemctl start httpd pulp_celerybeat

> db.workers.find().pretty()
{
    "_id" : "scheduler@ibm-x3550m3-07.lab.eng.brq.redhat.com",
    "last_heartbeat" : ISODate("2017-04-24T15:55:15.017Z")
}

8. sudo systemctl stop httpd pulp_celerybeat

> db.workers.find().pretty()
> 
Actions #19

Updated by pcreech about 5 years ago

  • Status changed from 5 to CLOSED - CURRENTRELEASE
Actions #20

Updated by bmbouter about 4 years ago

  • Sprint set to Sprint 16
Actions #21

Updated by bmbouter about 4 years ago

  • Sprint changed from Sprint 16 to Sprint 14
Actions #22

Updated by bmbouter about 4 years ago

  • Sprint/Milestone deleted (32)
Actions #23

Updated by bmbouter about 3 years ago

  • Tags Pulp 2 added

Also available in: Atom PDF