Project

Profile

Help

Issue #1380

closed

Pulp workers are going missing continually

Added by bmbouter over 8 years ago. Updated about 5 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
High
Assignee:
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
3. High
Version:
2.7.0
Platform Release:
2.7.1
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Quarter:

Description

A thread on pulp-list reported that pulp workers are going missing: https://www.redhat.com/archives/pulp-list/2015-November/msg00037.html

When workers continually go missing, their work is constantly cancelled effectively causing Pulp to be unusable. It appears to be not broker specific and is reported to affect rabbitMQ and Qpid alike. It appears that workers are being discovered and lost repeatedly.

Perhaps this is related to a timezone/UTC issue?


Files

0001-droping-workers-test-fix.patch (2.14 KB) 0001-droping-workers-test-fix.patch jluza, 11/19/2015 11:36 AM
rpm.list.diff (28.2 KB) rpm.list.diff jluza, 11/19/2015 02:49 PM

Related issues

Related to Pulp - Task #1400: Develop a plan for how Pulp's master branch should determine when a worker was last seenCLOSED - WONTFIX

Actions
Actions #1

Updated by bmbouter over 8 years ago

I'm setting up a reproducer against the 2.7.0 release on EL7.

Actions #2

Updated by bmbouter over 8 years ago

I did a Pulp 2.7.0 installation w/ Qpid on a yum updated EL7.1 installation. I was not able to reproduce the issue. Here are my pulp packages:

pulp-admin-client-2.7.0-0.8.rc.el7.noarch
pulp-server-2.7.0-0.8.rc.el7.noarch
pulp-rpm-plugins-2.7.0-0.8.rc.el7.noarch
pulp-puppet-admin-extensions-2.7.0-0.8.rc.el7.noarch
pulp-rpm-yumplugins-2.7.0-0.8.rc.el7.noarch
python-pulp-common-2.7.0-0.8.rc.el7.noarch
python-pulp-client-lib-2.7.0-0.8.rc.el7.noarch
pulp-consumer-client-2.7.0-0.8.rc.el7.noarch
pulp-puppet-plugins-2.7.0-0.8.rc.el7.noarch
pulp-rpm-admin-extensions-2.7.0-0.8.rc.el7.noarch
pulp-agent-2.7.0-0.8.rc.el7.noarch
python-pulp-puppet-common-2.7.0-0.8.rc.el7.noarch
python-kombu-3.0.24-10.pulp.el7.noarch
python-pulp-repoauth-2.7.0-0.8.rc.el7.noarch
python-pulp-bindings-2.7.0-0.8.rc.el7.noarch
pulp-puppet-handlers-2.7.0-0.8.rc.el7.noarch
pulp-rpm-consumer-extensions-2.7.0-0.8.rc.el7.noarch
pulp-selinux-2.7.0-0.8.rc.el7.noarch
python-pulp-agent-lib-2.7.0-0.8.rc.el7.noarch
python-pulp-oid_validation-2.7.0-0.8.rc.el7.noarch
pulp-puppet-consumer-extensions-2.7.0-0.8.rc.el7.noarch
pulp-rpm-handlers-2.7.0-0.8.rc.el7.noarch
python-isodate-0.5.0-4.pulp.el7.noarch
python-pulp-rpm-common-2.7.0-0.8.rc.el7.noarch

Using mongo cli, I captured the following contents of the 'workers' collection. That collection holds the heartbeats that the resource manager believes are "old" records. If a worker hasn't checked in within 5 minutes its considered missing and its work is cancelled, which is what seems to be happening here.

Here is what the workers collection looks like on my system that could not reproduce the issue:

[cloud-user@bmbouter-1380-reproducer yum.repos.d]$ mongo
MongoDB shell version: 2.6.11
connecting to: test
Welcome to the MongoDB shell.
For interactive help, type "help".
For more comprehensive documentation, see
        http://docs.mongodb.org/
Questions? Try the support group
        http://groups.google.com/group/mongodb-user
Server has startup warnings: 
2015-11-18T17:00:30.789-0500 [initandlisten] 
2015-11-18T17:00:30.789-0500 [initandlisten] ** WARNING: Readahead for /var/lib/mongodb is set to 4096KB
2015-11-18T17:00:30.789-0500 [initandlisten] **          We suggest setting it to 256KB (512 sectors) or less
2015-11-18T17:00:30.789-0500 [initandlisten] **          http://dochub.mongodb.org/core/readahead
> use pulp_database
switched to db pulp_database
> db.workers.find().pretty()
{
        "_id" : "scheduler@bmbouter-1380-reproducer.example.com",
        "last_heartbeat" : ISODate("2015-11-18T22:08:34.368Z")
}
{
        "_id" : "reserved_resource_worker-2@bmbouter-1380-reproducer.example.com",
        "last_heartbeat" : ISODate("2015-11-18T23:09:16.665Z")
}
{
        "_id" : "reserved_resource_worker-3@bmbouter-1380-reproducer.example.com",
        "last_heartbeat" : ISODate("2015-11-18T23:09:16.664Z")
}
{
        "_id" : "reserved_resource_worker-0@bmbouter-1380-reproducer.example.com",
        "last_heartbeat" : ISODate("2015-11-18T23:09:16.664Z")
}
{
        "_id" : "reserved_resource_worker-1@bmbouter-1380-reproducer.example.com",
        "last_heartbeat" : ISODate("2015-11-18T23:08:53.192Z")
}
{
        "_id" : "resource_manager@bmbouter-1380-reproducer.example.com",
        "last_heartbeat" : ISODate("2015-11-18T23:09:04.383Z")
}

Can someone who is experiencing the bug connect to their mongodb and paste their contents of the 'workers' table to this ticket?
Also someone who is experiencing the bug talk some about their timezone configs. Perhaps this is a timezone related issue?
Is someone out there using 2.7.0 and not experiencing this issue (besides myself)?

Actions #3

Updated by chandlermelton over 8 years ago

I started experiencing this after trying to run updates over the weekend.
There was a dependency issue related to libqpid-proton.so.2.

CentOS 6.7 - no timezone changes

DB output:

db.workers.find().pretty()
{
        "_id" : "scheduler@reposrv",
        "last_heartbeat" : ISODate("2015-11-18T22:33:02.586Z")
}
{
        "_id" : "reserved_resource_worker-0@reposrv",
        "last_heartbeat" : ISODate("2015-11-18T17:33:16.554Z")
}
{
        "_id" : "reserved_resource_worker-1@reposrv",
        "last_heartbeat" : ISODate("2015-11-18T17:33:16.555Z")
}
{
        "_id" : "reserved_resource_worker-2@reposrv",
        "last_heartbeat" : ISODate("2015-11-18T17:33:16.555Z")
}
{
        "_id" : "reserved_resource_worker-3@reposrv",
        "last_heartbeat" : ISODate("2015-11-18T17:33:16.559Z")
}

Dependency issue:

Resolving Dependencies
--> Running transaction check
---> Package qpid-proton-c.x86_64 0:0.9-3.el6 will be updated
--> Processing Dependency: libqpid-proton.so.2()(64bit) for package: qpid-cpp-server-0.26-9.el6.x86_64
--> Processing Dependency: libqpid-proton.so.2()(64bit) for package: qpid-cpp-client-0.26-9.el6.x86_64
---> Package qpid-proton-c.x86_64 0:0.10-2.el6 will be an update
--> Finished Dependency Resolution
Error: Package: qpid-cpp-client-0.26-9.el6.x86_64 (@pulp-2-stable)
           Requires: libqpid-proton.so.2()(64bit)
           Removing: qpid-proton-c-0.9-3.el6.x86_64 (@epel)
               libqpid-proton.so.2()(64bit)
           Updated By: qpid-proton-c-0.10-2.el6.x86_64 (epel-6)
               Not found
Error: Package: qpid-cpp-server-0.26-9.el6.x86_64 (@pulp-2-stable)
           Requires: libqpid-proton.so.2()(64bit)
           Removing: qpid-proton-c-0.9-3.el6.x86_64 (@epel)
               libqpid-proton.so.2()(64bit)
           Updated By: qpid-proton-c-0.10-2.el6.x86_64 (epel-6)
               Not found

Installed packages:

grep pulp:

python-pulp-client-lib-2.7.0-1.el6.noarch
python-pulp-oid_validation-2.7.0-1.el6.noarch
pulp-nodes-consumer-extensions-2.7.0-1.el6.noarch
pulp-puppet-admin-extensions-2.7.0-1.el6.noarch
python-pulp-docker-common-1.0.2-1.el6.noarch
python-pulp-python-common-1.0.1-1.el6.noarch
python-rhsm-1.8.0-2.pulp.el6.x86_64
pulp-docker-plugins-1.0.2-1.el6.noarch
python-pulp-bindings-2.7.0-1.el6.noarch
python-pulp-puppet-common-2.7.0-1.el6.noarch
python-pulp-repoauth-2.7.0-1.el6.noarch
pulp-nodes-common-2.7.0-1.el6.noarch
pulp-rpm-plugins-2.7.0-1.el6.noarch
pulp-rpm-admin-extensions-2.7.0-1.el6.noarch
pulp-rpm-handlers-2.7.0-1.el6.noarch
python-kombu-3.0.24-10.pulp.el6.noarch
pulp-docker-admin-extensions-1.0.2-1.el6.noarch
mod_wsgi-3.4-2.pulp.el6.x86_64
pulp-python-admin-extensions-1.0.1-1.el6.noarch
python-pulp-common-2.7.0-1.el6.noarch
python-pulp-agent-lib-2.7.0-1.el6.noarch
pulp-admin-client-2.7.0-1.el6.noarch
pulp-server-2.7.0-1.el6.noarch
pulp-puppet-plugins-2.7.0-1.el6.noarch
pulp-puppet-tools-2.7.0-1.el6.noarch
pulp-rpm-consumer-extensions-2.7.0-1.el6.noarch
pulp-selinux-2.7.0-1.el6.noarch
python-isodate-0.5.0-4.pulp.el6.noarch
python-pulp-rpm-common-2.7.0-1.el6.noarch
pulp-consumer-client-2.7.0-1.el6.noarch
pulp-nodes-admin-extensions-2.7.0-1.el6.noarch
pulp-agent-2.7.0-1.el6.noarch
pulp-puppet-handlers-2.7.0-1.el6.noarch
pulp-puppet-consumer-extensions-2.7.0-1.el6.noarch
pulp-rpm-yumplugins-2.7.0-1.el6.noarch
pulp-python-plugins-1.0.1-1.el6.noarch

grep qpid:

python-gofer-qpid-2.6.6-2.el6.noarch
qpid-cpp-client-0.26-9.el6.x86_64
python-qpid-common-0.26-3.el6.noarch
python-qpid-qmf-0.26-2.el6.x86_64
qpid-cpp-server-store-0.26-9.el6.x86_64
python-qpid-0.26-3.el6.noarch
qpid-qmf-0.26-2.el6.x86_64
qpid-proton-c-0.9-3.el6.x86_64
qpid-cpp-server-0.26-9.el6.x86_64
Actions #4

Updated by bmbouter over 8 years ago

@chandlermelton Regarding the dependency issue, that is filed under #1340. For this issue, I think I can why it's happening.

Your output and mine were taken close to each in actual time (< 1 hour) I believe, yet my timestamps show times like

ISODate("2015-11-18T23:09:16.664Z")

versus yours like:

ISODate("2015-11-18T17:33:16.554Z")

Note, that the scheduler gets its timestamps from inside Pulp code so that one looks right. Make sure to compare the resource_manager and reserved_resource_worker entries.

The code that checks for missing workers is likely getting it wrong because of the non-UTC format. It looks like the timestamp from your system is wrong because (IIRC), that wasn't the UTC when you captured your output and yet the timestamp format shows indicates it is UTC because it is in Zulu time (the 'Z' at the end).

Can you run this python snippet an paste back the output?

from datetime import datetime, timedelta
oldest_heartbeat = datetime.utcnow() - timedelta(seconds=300)
oldest_heartbeat.isoformat()

Also can you repaste your db.workers.find().pretty() once more around the same moment you capture the output from the snippet above?

Is everyone experiencing this issue running CentOS?

The root cause it would seem is that celery (which provides the heartbeats) is not reporting it in UTC format when it is sent over.

Actions #5

Updated by bmbouter over 8 years ago

I strongly suspect this is affecting CentOS 2.7.0 users, but not RHEL or Fedora users. I couldn't reproduce on RHEL and I've run the 2.7.0 beta code in the past on Fedora without issue. Other users (including our QA) have run it on both as well without this issue.

Also, I just spoke with another developer who verbally reported he was using CentOS briefly with a 2.7.0 alpha/beta and he noted that he had some strange worker's going missing problems at the same time.

What celery version do you have installed on CentOS w/ Pulp 2.7.0?

Actions #6

Updated by chandlermelton over 8 years ago

python: '2015-11-19T00:46:29.429320'

mongo:
{
"_id" : "scheduler@reposrv",
"last_heartbeat" : ISODate("2015-11-19T00:48:03.615Z")
}
{
"_id" : "reserved_resource_worker-0@reposrv",
"last_heartbeat" : ISODate("2015-11-18T19:48:20.951Z")
}
{
"_id" : "reserved_resource_worker-3@reposrv",
"last_heartbeat" : ISODate("2015-11-18T19:48:20.954Z")
}
{
"_id" : "reserved_resource_worker-1@reposrv",
"last_heartbeat" : ISODate("2015-11-18T19:48:20.945Z")
}
{
"_id" : "reserved_resource_worker-2@reposrv",
"last_heartbeat" : ISODate("2015-11-18T19:48:20.947Z")
}
{
"_id" : "resource_manager@reposrv",
"last_heartbeat" : ISODate("2015-11-18T19:48:00.515Z")
}

python-celery-3.1.11-1.el6.noarch

Actions #7

Updated by jluza over 8 years ago

bmbouter wrote:

A thread on pulp-list reported that pulp workers are going missing: https://www.redhat.com/archives/pulp-list/2015-November/msg00037.html

When workers continually go missing, their work is constantly cancelled effectively causing Pulp to be unusable. It appears to be not broker specific and is reported to affect rabbitMQ and Qpid alike. It appears that workers are being discovered and lost repeatedly.

Perhaps this is related to a timezone/UTC issue?

So it is timezone related issue, for some reason, workers are reporing heartbeat in really strange timezone (neighter localtime nor utc).

I did some investigation and found out workaround for this (attached)

If you don't apply the patch and only add the log output, you will see strange difference between dates. I'm cooperating with ipanova now to test this on some proper pulp installation, because I'm running 2.7 on docker now.

So I can confirm, there's something wrong with timezone conversion in celery. Strange is difference between timestamp and local_received in celery event.

Btw, I'm running on Fedora 21
python-celery-3.1.11-1.fc21.noarch
From repo : pulp-2-stable

Actions #8

Updated by millerjl1701 over 8 years ago

I saw in the email thread (after posting) that discussion was moved here. What appears to be the requested info based on thread and above:

- RHEL 6.7 x86_64 with all updates applied except for the qpid issue mentioned above
- qpid-cpp-server 0.26-9.el6.x86_64
- python-celery-3.1.11-1.el6.noarch
- All packages are installed from the pulp-2 stable repo as well.

db.workers.find().pretty()

{
"_id" : "",
"last_heartbeat" : ISODate("2015-11-19T12:44:00.543Z")
}
{
"_id" : "",
"last_heartbeat" : ISODate("2015-11-19T07:45:05.162Z")
}
{
"_id" : "",
"last_heartbeat" : ISODate("2015-11-19T07:45:05.156Z")
}
{
"_id" : "",
"last_heartbeat" : ISODate("2015-11-19T07:44:51.458Z")

Actions #9

Updated by ashbyj@imsweb.com over 8 years ago

For completeness, I'm running Pulp 2.7.0 on CentOS 6.6 using rabbmitmq 3.5.3. (Though I'm 95% sure this issue existed when I was using qpid and upgraded to 2.7, which I wrote off as being part of the broken dependencies per #1340, so with that in mind I switched to rabbit.)

$ mongo
MongoDB shell version: 2.4.14
connecting to: test
>  use pulp_database
switched to db pulp_database

> db.workers.find().pretty()
{
        "_id" : "scheduler@pulp-btp-01",
        "last_heartbeat" : ISODate("2015-11-19T13:40:00.331Z")
}
{
        "_id" : "reserved_resource_worker-0@pulp-btp-01",
        "last_heartbeat" : ISODate("2015-11-19T09:40:22.604Z")
}
{
        "_id" : "reserved_resource_worker-2@pulp-btp-01",
        "last_heartbeat" : ISODate("2015-11-19T09:40:22.607Z")
}
{
        "_id" : "reserved_resource_worker-3@pulp-btp-01",
        "last_heartbeat" : ISODate("2015-11-19T09:40:22.610Z")
}
{
        "_id" : "reserved_resource_worker-1@pulp-btp-01",
        "last_heartbeat" : ISODate("2015-11-19T09:40:22.609Z")
}
{
        "_id" : "resource_manager@pulp-btp-01",
        "last_heartbeat" : ISODate("2015-11-19T09:40:17.371Z")
}
$ rpm -qa | grep pulp
python-pulp-client-lib-2.7.0-1.el6.noarch
pulp-rpm-plugins-2.7.0-1.el6.noarch
pulp-rpm-yumplugins-2.7.0-1.el6.noarch
mod_wsgi-3.4-2.pulp.el6.x86_64
python-pulp-rpm-common-2.7.0-1.el6.noarch
pulp-nodes-consumer-extensions-2.7.0-1.el6.noarch
pulp-agent-2.7.0-1.el6.noarch
pulp-python-admin-extensions-1.0.1-1.el6.noarch
pulp-admin-client-2.7.0-1.el6.noarch
pulp-server-2.7.0-1.el6.noarch
pulp-puppet-admin-extensions-2.7.0-1.el6.noarch
python-pulp-common-2.7.0-1.el6.noarch
python-pulp-agent-lib-2.7.0-1.el6.noarch
python-pulp-repoauth-2.7.0-1.el6.noarch
pulp-nodes-common-2.7.0-1.el6.noarch
pulp-puppet-plugins-2.7.0-1.el6.noarch
pulp-puppet-consumer-extensions-2.7.0-1.el6.noarch
pulp-rpm-handlers-2.7.0-1.el6.noarch
python-pulp-python-common-1.0.1-1.el6.noarch
python-pulp-bindings-2.7.0-1.el6.noarch
python-pulp-oid_validation-2.7.0-1.el6.noarch
pulp-nodes-admin-extensions-2.7.0-1.el6.noarch
pulp-rpm-consumer-extensions-2.7.0-1.el6.noarch
pulp-selinux-2.7.0-1.el6.noarch
pulp-python-plugins-1.0.1-1.el6.noarch
m2crypto-0.21.1.pulp-8.el6.x86_64
pulp-consumer-client-2.7.0-1.el6.noarch
pulp-rpm-admin-extensions-2.7.0-1.el6.noarch
python-isodate-0.5.0-4.pulp.el6.noarch
python-rhsm-1.8.0-2.pulp.el6.x86_64
python-pulp-puppet-common-2.7.0-1.el6.noarch
pulp-nodes-parent-2.7.0-1.el6.noarch
pulp-puppet-handlers-2.7.0-1.el6.noarch
python-kombu-3.0.24-10.pulp.el6.noarch
zzjason@pulp-btp-01:~> rpm -qa | grep celery
python-celery-3.1.11-1.el6.noarch
Actions #10

Updated by jluza over 8 years ago

Ina provided f20 pulp-2.7 ansible machine, which doesn't suffer from workers dropping, so I did rpm -qa | sort on both machines (my f21 docker na f20 ansible). Diff is attached. Notice gofer, qpid and celery versions there.

Actions #11

Updated by ashbyj@imsweb.com over 8 years ago

jluza I applied the patch to my instance and wanted to give feedback.

# apply patch
cd /usr/lib/python2.6/site-packages/pulp/server/
sudo wget https://pulp.plan.io/attachments/download/176/0001-droping-workers-test-fix.patch
sudo patch -p3 < 0001-droping-workers-test-fix.patch

# restart services
sudo service pulp_workers restart
sudo service pulp_resource_manager restart
sudo service pulp_celerybeat restart
sudo service goferd restart

and the "Worker x has gone missing" errors no longer show up, but instead show every 60 seconds the following:

Nov 19 10:24:03 pulp-btp-01 pulp: pulp.server.async.scheduler:ERROR: worker.last:2015-11-19 15:24:01.586000 worker.name:reserved_resource_worker-0@pulp-btp-01 oldest:2015-11-19 15:19:03.227505
Nov 19 10:24:03 pulp-btp-01 pulp: pulp.server.async.scheduler:ERROR: worker.last:2015-11-19 15:24:02.683000 worker.name:reserved_resource_worker-1@pulp-btp-01 oldest:2015-11-19 15:19:03.227505
Nov 19 10:24:03 pulp-btp-01 pulp: pulp.server.async.scheduler:ERROR: worker.last:2015-11-19 15:24:01.583000 worker.name:reserved_resource_worker-2@pulp-btp-01 oldest:2015-11-19 15:19:03.227505
Nov 19 10:24:03 pulp-btp-01 pulp: pulp.server.async.scheduler:ERROR: worker.last:2015-11-19 15:24:02.679000 worker.name:reserved_resource_worker-3@pulp-btp-01 oldest:2015-11-19 15:19:03.227505
Nov 19 10:24:03 pulp-btp-01 pulp: pulp.server.async.scheduler:ERROR: worker.last:2015-11-19 15:23:57.874000 worker.name:resource_manager@pulp-btp-01 oldest:2015-11-19 15:19:03.227505

Tasks remain stuck in the queue in a Waiting state.

Actions #12

Updated by jluza over 8 years ago

Yeah, that one is my debugging message, I included it into patch, to show the differences between times. To remove it, just remove _logger.error line from scheduler.py.

Actions #13

Updated by ashbyj@imsweb.com over 8 years ago

You mentioned the patch as a workaround, but it doesn't appear to fix the issue for me. Or are you still testing/working on a true fix?

Actions #14

Updated by jluza over 8 years ago

wrote:

and the "Worker x has gone missing" errors no longer show up, but instead show every 60 seconds the following:

From above I understood, it solved the dropping workers issue. Are you still experiencing the dropping problem?

I'm not member of pulp team, however I believe, true fix lies in chose of right version of messaging related packages. As I wrote earlier, I tested (big thanks to ipanova) pulp-2.7 on f20 didn't have have this issue.

Actions #15

Updated by jluza over 8 years ago

Today I tried to install gofer-2.7.0-1.git.66.8a1ecd6.fc20 but it didn't help

Actions #16

Updated by ashbyj@imsweb.com over 8 years ago

Yes, I believe I am still experiencing the issues on CentOS 6 pulp 2.7. After applying the patch, I still get errors every 60 seconds. The only difference is that it is no longer hitting the "if worker.last_heartbeat < oldest_heartbeat_time" condition here:

# /usr/lib/python2.6/site-packages/pulp/server/async/scheduler.py
    def check_celery_processes(self):
...
        for worker in worker_list:
            _logger.error("worker.last:%s worker.name:%s oldest:%s" % (worker.last_heartbeat, worker.name, oldest_heartbeat_time))
            if worker.last_heartbeat < oldest_heartbeat_time:
                msg = _("Worker '%s' has gone missing, removing from list of workers") % worker.name
                _logger.error(msg)
                _delete_worker(worker.name)
...

Tasks get stuck in a waiting state as I assume they are never able to be served.

Anything you would like me to test on my end?

Actions #17

Updated by jluza over 8 years ago

cause of the error message is this line:

_logger.error("worker.last:%s worker.name:%s oldest:%s" % (worker.last_heartbeat, worker.name, oldest_heartbeat_time))

I have very bad habit to use error logs as my debug logs (because it's easier to search them)

Can you remove the line mention above from your pulp code and restart the pulp services? Then if you could check /var/log/messages and search for errors. Also if you can start your tasks and then connect to your mongo db and paste output for db.workers.find() and db.reserved_resources.find()

Actions #18

Updated by ashbyj@imsweb.com over 8 years ago

No prob, we've all got bad habits :). I removed the _logger.error line and restarted pulp services and /var/log/messages looks clean now.

Here is the output you asked for.

$ date
Fri Nov 20 09:33:52 EST 2015
$ mongo
MongoDB shell version: 2.4.14
connecting to: test
> use pulp_database
switched to db pulp_database
> db.workers.find().pretty()
{
        "_id" : "scheduler@pulp-btp-01",
        "last_heartbeat" : ISODate("2015-11-20T19:33:21Z")
}
{
        "_id" : "reserved_resource_worker-0@pulp-btp-01",
        "last_heartbeat" : ISODate("2015-11-20T14:33:56.464Z")
}
{
        "_id" : "reserved_resource_worker-1@pulp-btp-01",
        "last_heartbeat" : ISODate("2015-11-20T14:33:57.506Z")
}
{
        "_id" : "reserved_resource_worker-2@pulp-btp-01",
        "last_heartbeat" : ISODate("2015-11-20T14:33:56.517Z")
}
{
        "_id" : "reserved_resource_worker-3@pulp-btp-01",
        "last_heartbeat" : ISODate("2015-11-20T14:33:57.620Z")
}
{
        "_id" : "resource_manager@pulp-btp-01",
        "last_heartbeat" : ISODate("2015-11-20T14:33:52.011Z")
}
> db.reserved_resources.find()
>

db.reserved_resources.find() did not return anything as you can see.

Actions #19

Updated by mhrivnak over 8 years ago

  • Triaged changed from No to Yes
Actions #20

Updated by bmbouter over 8 years ago

From user reports, users running 2.7.0 on EL6 or CentOS6 will experience this problem, but users on EL7, CentOS7, or the fedoras will not.

I'm setting up an EL6 reproducer now. I'll investigate why the heartbeat times in a different format there. It's likely a packaging different. One thing I'm not clear on yet is why now with 2.7.0 and not with 2.6.x?

Actions #21

Updated by bmbouter over 8 years ago

I've reproduced the problem on EL6 with rabbitMQ. Every 60 seconds I receive the logs:

Nov 20 14:02:58 bmbouter-1380-reproducer-el6 pulp: pulp.server.async.scheduler:ERROR: Worker 'reserved_resource_worker-0@bmbouter-1380-reproducer-el6' has gone 
missing, removing from list of workers
Nov 20 14:02:58 bmbouter-1380-reproducer-el6 pulp: pulp.server.async.tasks:ERROR: The worker named reserved_resource_worker-0@bmbouter-1380-reproducer-el6 is mi
ssing. Canceling the tasks in its queue.
Nov 20 14:02:58 bmbouter-1380-reproducer-el6 pulp: pulp.server.async.scheduler:ERROR: Worker 'reserved_resource_worker-1@bmbouter-1380-reproducer-el6' has gone 
missing, removing from list of workers
Nov 20 14:02:58 bmbouter-1380-reproducer-el6 pulp: pulp.server.async.tasks:ERROR: The worker named reserved_resource_worker-1@bmbouter-1380-reproducer-el6 is mi
ssing. Canceling the tasks in its queue.
Nov 20 14:02:58 bmbouter-1380-reproducer-el6 pulp: pulp.server.async.scheduler:ERROR: Worker 'reserved_resource_worker-2@bmbouter-1380-reproducer-el6' has gone 
missing, removing from list of workers
Nov 20 14:02:58 bmbouter-1380-reproducer-el6 pulp: pulp.server.async.tasks:ERROR: The worker named reserved_resource_worker-2@bmbouter-1380-reproducer-el6 is mi
ssing. Canceling the tasks in its queue.
Nov 20 14:02:58 bmbouter-1380-reproducer-el6 pulp: pulp.server.async.scheduler:ERROR: Worker 'reserved_resource_worker-3@bmbouter-1380-reproducer-el6' has gone 
missing, removing from list of workers
Nov 20 14:02:58 bmbouter-1380-reproducer-el6 pulp: pulp.server.async.tasks:ERROR: The worker named reserved_resource_worker-3@bmbouter-1380-reproducer-el6 is mi
ssing. Canceling the tasks in its queue.
Nov 20 14:02:58 bmbouter-1380-reproducer-el6 pulp: pulp.server.async.scheduler:ERROR: Worker 'resource_manager@bmbouter-1380-reproducer-el6' has gone missing, r
emoving from list of workers
Nov 20 14:02:58 bmbouter-1380-reproducer-el6 pulp: pulp.server.async.tasks:ERROR: The worker named resource_manager@bmbouter-1380-reproducer-el6 is missing. Canceling the tasks in its queue.
Nov 20 14:02:58 bmbouter-1380-reproducer-el6 pulp: pulp.server.async.scheduler:ERROR: There are 0 pulp_resource_manager processes running. Pulp will not operate correctly without at least one pulp_resource_mananger process running.
Nov 20 14:03:18 bmbouter-1380-reproducer-el6 pulp: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-0@bmbouter-1380-reproducer-el6' discovered
Nov 20 14:03:19 bmbouter-1380-reproducer-el6 pulp: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-1@bmbouter-1380-reproducer-el6' discovered
Nov 20 14:03:21 bmbouter-1380-reproducer-el6 pulp: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-2@bmbouter-1380-reproducer-el6' discovered
Nov 20 14:03:22 bmbouter-1380-reproducer-el6 pulp: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-3@bmbouter-1380-reproducer-el6' discovered
Nov 20 14:03:26 bmbouter-1380-reproducer-el6 pulp: pulp.server.async.worker_watcher:INFO: New worker 'resource_manager@bmbouter-1380-reproducer-el6' discovered

Indeed my timestamps look like yours:

{
        "_id" : "scheduler@bmbouter-1380-reproducer-el6",
        "last_heartbeat" : ISODate("2015-11-20T19:05:29.056Z")
}
{
        "_id" : "reserved_resource_worker-0@bmbouter-1380-reproducer-el6",
        "last_heartbeat" : ISODate("2015-11-20T15:06:48.217Z")
}
{
        "_id" : "reserved_resource_worker-1@bmbouter-1380-reproducer-el6",
        "last_heartbeat" : ISODate("2015-11-20T15:06:49.811Z")
}
{
        "_id" : "reserved_resource_worker-2@bmbouter-1380-reproducer-el6",
        "last_heartbeat" : ISODate("2015-11-20T15:06:51.633Z")
}
{
        "_id" : "reserved_resource_worker-3@bmbouter-1380-reproducer-el6",
        "last_heartbeat" : ISODate("2015-11-20T15:06:52.955Z")
}
{
        "_id" : "resource_manager@bmbouter-1380-reproducer-el6",
        "last_heartbeat" : ISODate("2015-11-20T15:06:26.780Z")
}

Now to investigate why...

Actions #22

Updated by ashbyj@imsweb.com over 8 years ago

In the meantime, is there anything I can do to "force" a task to complete? I've got some servers I need to provision and registering/binding to the pulp server is a necessary step. Also, let me know if there is anything I can troubleshoot or investigate to help.

Actions #23

Updated by bmbouter over 8 years ago

I have an experimental fix for this. It's not a good idea to modify files installed with RPM, so be sure you can back out your change properly if you choose to try this. It would be good if someone could confirm that it resolves the issue. I've tested it, but it needs tests written and to be included in a build. In the meantime, if you know how to apply the following safely here is the diff:

[bmbouter@localhost pulp]$ git diff
diff --git a/server/pulp/server/async/scheduler.py b/server/pulp/server/async/scheduler.py
index ac4bfe6..fc048cd 100644
--- a/server/pulp/server/async/scheduler.py
+++ b/server/pulp/server/async/scheduler.py
@@ -235,7 +235,7 @@ class Scheduler(beat.Scheduler):

         # this is not an event that gets sent anywhere. We process it
         # immediately.
-        scheduler_event = {'timestamp': time.time(),
+        scheduler_event = {'local_received': time.time(),
                            'type': 'scheduler-event',
                            'hostname': ("%s@%s" % (SCHEDULER_WORKER_NAME, platform.node()))}
         worker_watcher.handle_worker_heartbeat(scheduler_event)
diff --git a/server/pulp/server/async/worker_watcher.py b/server/pulp/server/async/worker_watcher.py
index 5052442..4ab3b6e 100644
--- a/server/pulp/server/async/worker_watcher.py
+++ b/server/pulp/server/async/worker_watcher.py
@@ -41,7 +41,7 @@ def _parse_and_log_event(event):
              (ie: 'worker-heartbeat'), and 'worker_name' is the name of the worker as a string.
     :rtype: dict
     """
-    event_info = {'timestamp': datetime.utcfromtimestamp(event['timestamp']),
+    event_info = {'timestamp': datetime.utcfromtimestamp(event['local_received']),
                   'type': event['type'],
                   'worker_name': event['hostname']}
Actions #24

Updated by lastmikoi over 8 years ago

I can confirm that I was blocked by that issue when I upgraded my Centos6-based pulp from 2.6 to 2.7.

However, by applying bmbouter's patch to my local installation, the issue was solved and I was able to schedule tasks using pulp-admin.

Actions #25

Updated by millerjl1701 over 8 years ago

I can confirm (RHEL 6.7 x86_64, Pulp 2.7 from pulp-2 stable repo, qpid-cpp-server 0.26-9.el6.x86_64, python-celery-3.1.11-1.el6.noarch) that this fixes both the issue of the disappearing workers and the inability to schedule and run tasks.

Looks good from where I'm sitting. Thank you!

Actions #26

Updated by ashbyj@imsweb.com over 8 years ago

After applying the patch, I'm having trouble getting the pulp_celerybeat service to stay running.

# /var/log/messages
Dec  2 10:32:45 pulp-btp-01 pulp: celery.beat:INFO: beat: Starting...
Dec  2 10:32:45 pulp-btp-01 pulp: pulp.server.db.connection:INFO: Attempting to connect to localhost:27017
Dec  2 10:32:45 pulp-btp-01 pulp: pulp.server.async.scheduler:INFO: Worker Timeout Monitor Started
Dec  2 10:32:45 pulp-btp-01 pulp: pulp.server.db.connection:INFO: Write concern for Mongo connection: {'w': 1}
Dec  2 10:32:45 pulp-btp-01 pulp: pulp.server.async.scheduler:INFO: Event Monitor Starting
Dec  2 10:23:48 pulp-btp-01 pulp: kombu.mixins:INFO: Connected to amqp://pulp:**@rabbit-btp-dev.example.com:5672//
Dec  2 10:23:48 pulp-btp-01 pulp: celery.beat:CRITICAL: (4209-17216) beat raised exception <type 'exceptions.KeyError'>: KeyError('timestamp',)
Dec  2 10:23:48 pulp-btp-01 pulp: celery.beat:CRITICAL: (4209-17216) Traceback (most recent call last):
Dec  2 10:23:48 pulp-btp-01 pulp: celery.beat:CRITICAL: (4209-17216)   File "/usr/lib/python2.6/site-packages/celery/apps/beat.py", line 112, in start_scheduler
Dec  2 10:23:48 pulp-btp-01 pulp: celery.beat:CRITICAL: (4209-17216)     beat.start()
Dec  2 10:23:48 pulp-btp-01 pulp: celery.beat:CRITICAL: (4209-17216)   File "/usr/lib/python2.6/site-packages/celery/beat.py", line 462, in start
Dec  2 10:23:48 pulp-btp-01 pulp: celery.beat:CRITICAL: (4209-17216)     interval = self.scheduler.tick()
Dec  2 10:23:48 pulp-btp-01 pulp: celery.beat:CRITICAL: (4209-17216)   File "/usr/lib/python2.6/site-packages/pulp/server/async/scheduler.py", line 241, in tick
Dec  2 10:23:48 pulp-btp-01 pulp: celery.beat:CRITICAL: (4209-17216)     worker_watcher.handle_worker_heartbeat(scheduler_event)
Dec  2 10:23:48 pulp-btp-01 pulp: celery.beat:CRITICAL: (4209-17216)   File "/usr/lib/python2.6/site-packages/pulp/server/async/worker_watcher.py", line 84, in handle_worker_heartbeat
Dec  2 10:23:48 pulp-btp-01 pulp: celery.beat:CRITICAL: (4209-17216)     update_one(set__last_heartbeat=datetime.utcfromtimestamp(event.get("local_received", event['timestamp'])), upsert=True)
Dec  2 10:23:48 pulp-btp-01 pulp: celery.beat:CRITICAL: (4209-17216) KeyError: 'timestamp'

and service pulp_celerybeat status then indicates stopped.

Maybe there are some old tasks that need to be flushed out? I'm CentOS 6 w/ rabbitmq.

UPDATE

Scratch that, I needed to revert the original debugging patch from earlier. Pulp celerybeat runs fine and no errors in the logs, however, new tasks are stuck in a Waiting state for me. I don't see anything in the logs. Connections to rabbitmq appear successful all around though. Any ideas?

UPDATE 2

and scratch all other comments. Gofer wasn't running on the consumers since I had shut it down due to this issue. I can confirm that the patch works for me on CentOS 6 w/ rabbitmq. Thanks!!

Actions #27

Updated by rbarlow over 8 years ago

  • Assignee changed from bmbouter to rbarlow
Actions #28

Updated by rbarlow over 8 years ago

  • Platform Release set to 2.7.1
  • Tags Easy Fix added

I spent a lot of time on this yesterday/today/tonight. I wanted to dig in to Celery to determine why
the timestamp field is different from the local_received time. It seems that the timestamp field is
being adjusted to match the local timezone before it is being handed to Pulp, but then Pulp is
interpreting that timestamp with datetime.utcfromtimestamp(). Here is the Celery code that receives
the Events:

    def event_from_message(self, body, localize=True,
                           now=time.time, tzfields=_TZGETTER,
                           adjust_timestamp=adjust_timestamp,
                           CLIENT_CLOCK_SKEW=CLIENT_CLOCK_SKEW):
        type = body['type']
        if type == 'task-sent':
            # clients never sync so cannot use their clock value
            _c = body['clock'] = (self.clock.value or 1) + CLIENT_CLOCK_SKEW
            self.adjust_clock(_c)
        else:
            try:
                clock = body['clock']
            except KeyError:
                body['clock'] = self.forward_clock()
            else:
                self.adjust_clock(clock)

        if localize:
            try:
                offset, timestamp = tzfields(body)
            except KeyError:
                pass
            else:
                body['timestamp'] = adjust_timestamp(timestamp, offset)
        body['local_received'] = now()
        return type, body

Source: https://github.com/celery/celery/blob/def92d580fe7a2c42fcc5f47feed802fd6f7ff48/celery/events/__init__.py#L371-L396

Note the call to adjust_timestamp in there. Here is what that does:

def utcoffset(time=_time):
    if time.daylight:
        return time.altzone // 3600
    return time.timezone // 3600

def adjust_timestamp(ts, offset, here=utcoffset):
    return ts - (offset - here()) * 3600

Source: https://github.com/celery/celery/blob/def92d580fe7a2c42fcc5f47feed802fd6f7ff48/celery/utils/timeutils.py#L358-L365

Notice that the here() function will return the user's local offset from UTC and that offset is
being handed in from the message (i.e., the sender told you their local offset). This code is trying
to translate the sender's local time to the receiver's local time. The bizarre thing is that the
input ts and return value of this function were both the localized versions of the time since the
epoch. This means that it would be inappropriate to compare this value to the output of
time.time() which is the number of seconds since the POSIX epoch which is UTC by definition.

Furthermore, here is how an Event is created:

def Event(type, _fields=None, __dict__=dict, __now__=time.time, **fields):
    """Create an event.
    An event is a dictionary, the only required field is ``type``.
    A ``timestamp`` field will be set to the current time if not provided.
    """
    event = __dict__(_fields, **fields) if _fields else fields
    if 'timestamp' not in event:
        event.update(timestamp=__now__(), type=type)
    else:
        event['type'] = type
    return event

Source: https://github.com/celery/celery/blob/def92d580fe7a2c42fcc5f47feed802fd6f7ff48/celery/events/__init__.py#L50-L62

And this is how an Event gets published:

    def publish(self, type, fields, producer,
                blind=False, Event=Event, **kwargs):
        """Publish event using a custom :class:`~kombu.Producer`
        instance.
        :param type: Event type name, with group separated by dash (`-`).
        :param fields: Dictionary of event fields, must be json serializable.
        :param producer: :class:`~kombu.Producer` instance to use,
            only the ``publish`` method will be called.
        :keyword retry: Retry in the event of connection failure.
        :keyword retry_policy: Dict of custom retry policy, see
            :meth:`~kombu.Connection.ensure`.
        :keyword blind: Don't set logical clock value (also do not forward
            the internal logical clock).
        :keyword Event: Event type used to create event,
            defaults to :func:`Event`.
        :keyword utcoffset: Function returning the current utcoffset in hours.
        """
        clock = None if blind else self.clock.forward()
        event = Event(type, hostname=self.hostname, utcoffset=utcoffset(),
                      pid=self.pid, clock=clock, **fields)
        with self.mutex:
            return self._publish(event, producer,
                                 routing_key=type.replace('-', '.'), **kwargs)

Source: https://github.com/celery/celery/blob/def92d580fe7a2c42fcc5f47feed802fd6f7ff48/celery/events/__init__.py#L173-L197

Note that call to that same utcoffset() in there. Thus, I believe I have established that in Pulp
the timestamp field is the number of seconds since 1970-01-01T00:00:00 in the receiver's local time
zone, not UTC.

The issue now is what Pulp does with this value:

def _parse_and_log_event(event):
    """
    Parse and return the event information we are interested in. Also log it.
    A new dict is returned containing the keys 'timestamp', 'type', and 'worker_name'. The
    only data transformation here is on the timestamp. The timestamp arrives as seconds since
    the epoch, and is converted to UTC time, and returned as a naive datetime.datetime object.
    Logging is done through a call to _log_event()
    :param event: A celery event
    :type event: dict
    :return: A dict containing the keys 'timestamp', 'type', and 'worker_name'. 'timestamp'
             is a naive datetime.datetime reported in UTC. 'type' is the event name as a string
             (ie: 'worker-heartbeat'), and 'worker_name' is the name of the worker as a string.
    :rtype: dict
    """
    event_info = {'timestamp': datetime.utcfromtimestamp(event['timestamp']),
                  'type': event['type'],
                  'worker_name': event['hostname']}
    _log_event(event_info)
    return event_info

Source: https://github.com/pulp/pulp/blob/1114433f8727492c5bfb0359ac378e97be0565f0/server/pulp/server/async/worker_watcher.py#L27-L48

Note that Pulp is interpreting the timestamp as the number of seconds since the epoch, but has not
accounted for the local offset. You may have noticed that this phenomenon does not happen to the
scheduler's timestamps. This is because the scheduler heartbeat event is generated by Pulp's own
code inside its tick() method, and it uses time.time() to generate that value (and thus is truly a
UTC timestamp):

        # this is not an event that gets sent anywhere. We process it
        # immediately.
        scheduler_event = {'timestamp': time.time(),
                           'type': 'scheduler-event',
                           'hostname': celerybeat_name}
        worker_watcher.handle_worker_heartbeat(scheduler_event)

Source: https://github.com/pulp/pulp/blob/1114433f8727492c5bfb0359ac378e97be0565f0/server/pulp/server/async/scheduler.py#L248-L253

I had been wondering why this phenomenon does not happen in the Pulp development environment. The
Vagrant environment is set to UTC for its local time:

$ python
Python 2.7.10 (default, Sep  8 2015, 17:20:17)
[GCC 5.1.1 20150618 (Red Hat 5.1.1-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import time
>>> time.timezone
0

This made reproducing this difficult, as I do not experience this issue at all ☺

I can now also explain what the local_received attribute is and why it works. The local_received is
also a floating point timestamp and it is placed on the Event by the receiver instead of the sender.
The difference in time between timestamp and local_received is used to measure lag, and Celery will
send warning messages if this difference becomes too large.

If you noticed in the first code I quoted above, the code that sets the received time does this:

        body['local_received'] = now()

now() is just time.time() (from the method signature), which returns the number of seconds since the
epoch in UTC. Therefore by switching to local_received, we've avoided the problem because the time
is truly in UTC.

If you've been following along this far, I hope you are appreciating the level of digging it took me
to get to this!

What is the true solution? We could use Celery's adjust_timestamp algorithm to get the timestamp
back into UTC and continue using the timestamp attribute of the Event. Or we could use Brian's
proposed patch which already does have the UTC time we so desire.

The other difference between the two values is that timestamp is the time that the Event was
created by the sender, and local_received is the time that the Event was received by the listener.
The difference between these indicates the combination of lag in the system and clock drifts
between the sender and receiver. The purpose of these messages is to deal with workers that have
disappeared for whatever reason. If the system is laggy or drift is significant, going with
local_received will be more lenient on deleting the workers, and timestamp will be less lenient. I
personally lean towards choosing to be more lenient here, but if anyone can offer a counterargument
I certainly welcome you to do so.

All this is to say that I am comfortable with accepting Brian's patch to solve this issue now that I
feel that I can explain what is happening. It is very late and time for me to go to bed now.
Tomorrow I will formalize the tests for this patch and submit a pull request. In the meantime, I
will ponder opening a bug with Celery about this as the two timestamps should at least be in the
same timezone. In my opinion, it's easiest to drop the time zone stuff all together and ask Celery
to use UTC everywhere for consistency and simplicity.

Actions #29

Updated by rbarlow over 8 years ago

I got one important detail incorrect - though the timestamp is getting transmitted with the timezone offset, it turns out that the timestamp itself is still getting generated with time.time() which is indeed UTC. The API including the offset with the timestamp is confusing and caused me to think things were different than they were. After further inspection, I was unable to find anywhere that the sender combined its offset with the output of time.time() to set the timestamp and so I believe the timestamp is UTC.

I configured my Vagrant system to use the UTC-5 timezone and I do see the correct UTC time being recorded in my database (which also happens when UTC is my timezone).

I can still think of a few ways that this problem could be happening:

0) The adjust_timestamp() function will cause a problem if the sender and the receiver don't have the same timezone configured. For example, if the sender were in UTC - 5 and the receiver were in UTC - 0, this code will end up subtracting 5 hours from the timestamp and we will think the worker is missing. However, AFAIK nobody has mentioning running their workers on different boxes so this might not be the problem.

1) Late delivery could cause problems. If the beats were stalled for over five minutes and then delivered, the timestamp would show as being five minutes old. This would cause Pulp to think the worker was missing. I doubt this is the cause though.

2) Out of order delivery could cause weirder problems, but I don't believe the broker will do that, especially not this drastically.

I noticed that Brian's comment #21 has something quite odd: there are three different timezones represented in the same comment! Note that the log messages are at 14:02, the scheduler is at 19:05, and the workers are at 15:06. We know that the scheduler's timestamp isn't doing anything weird, so let's assume that it is about 19:00 UTC. Brian's log messages are at about 14:00 which is the UTC-5 timezone that I know that he frequents. But the very strange thing is that the workers are appearing to be at UTC-4, which is 15:00. What is up with that? There is certainly something unusual going on, and I strongly suspect that it is related to the adjust_timestamp() and utcoffset() functions in Celery. I still haven't been able to reproduce this problem in a system of my own so I haven't been able to get in there and see what exactly is happening. Theoretically that code should work as long as the sender and receiver think they are in the same time zone as far as I can understand it.

Actions #30

Updated by rbarlow over 8 years ago

I found the issue!

https://github.com/celery/celery/issues/1802#issuecomment-161916587

I do think that Celery should fix that line of code (it was a doozy to figure out and a nice facepalm when I did!), and I suggested that they drop trying to deal with translating local time all over the place and just use UTC everywhere (much simpler!).

All of that said, I still think the local_received timestamp will do what we need here, and it gives a little more flexibility for clock drift situations too.

Actions #31

Updated by lastmikoi over 8 years ago

rbarlow: I must say that this is a very nice investigation that you made, quite impressive to be honest !

As for the whole date issue - since I had the bad luck of having to deal with timezones time information in python as my first job and managed to get out of it alive - I think I can say that the only sane way of dealing with time-related information is to use timezone-aware data, which means never using raw timestamps for non-UTC times, but rather another data type such as an ISO8601 (such as the one returned by datetime.datetime.isoformat()) formatted string, and avoid as much as possible conversions between raw timestamps and timezone-aware data.

Added by rbarlow over 8 years ago

Revision c429f7c7 | View on GitHub

Use Celery's local_received time for worker availability.

Pulp had been using the timestamp attribute of Celery's Events to determine when the workers were most recently available. Due to a bug in Celery, this cause an off-by-one-hour bug for users who live in a locale that does daylight savings time that is not currently observing daylight savings time (such as New York in the winter):

https://github.com/celery/celery/issues/1802#issuecomment-161916587

This fixes the issue by using the local_received time which is computed in the receiver by Celery with time.time() without any timezone shifting attempts.

https://pulp.plan.io/issues/1380

fixes #1380

Added by rbarlow over 8 years ago

Revision c429f7c7 | View on GitHub

Use Celery's local_received time for worker availability.

Pulp had been using the timestamp attribute of Celery's Events to determine when the workers were most recently available. Due to a bug in Celery, this cause an off-by-one-hour bug for users who live in a locale that does daylight savings time that is not currently observing daylight savings time (such as New York in the winter):

https://github.com/celery/celery/issues/1802#issuecomment-161916587

This fixes the issue by using the local_received time which is computed in the receiver by Celery with time.time() without any timezone shifting attempts.

https://pulp.plan.io/issues/1380

fixes #1380

Actions #32

Updated by rbarlow over 8 years ago

  • Status changed from ASSIGNED to POST
  • Tags deleted (Easy Fix)

https://github.com/pulp/pulp/pull/2222

I'm removing Easy Fix. This was far too difficult to figure out.

Actions #33

Updated by rbarlow over 8 years ago

I've spent this morning investigating whether this bug affects Katello-2.2.0, which ships with Pulp 2.6.0 (and a bunch of patches on top of it). I was surprised to see that the running server did not seem to be experiencing this bug, despite the fact that it uses the same 'timestamp' attribute from the Celery event and the same version of Celery. I spent some time digging around but no matter what I checked I couldn't explain why the system wasn't experiencing this bug.

Then I remembered a note from the Celery bug:

https://github.com/celery/celery/issues/1802#issuecomment-44646818

This was an important detail that I forgot to mention in my previous analysis, because the true reason that that line in the Celery library causes this issue is that the workers import Django and the scheduler does not import Django. Django alters the way that the time and datetime modules work when the USE_TZ setting is True, and Pulp does have that set to True:

https://github.com/pulp/pulp/blob/122e47ec6ad8719156e74795344184e34dc679f4/server/pulp/server/webservices/settings.py#L61

This causes the time offset on the worker to be different than the time offset on the scheduler, which causes the worker watcher to believe the worker has been missing for an hour. I had discovered all of this on Friday morning at late thirty but forgot to mention it in my analysis.

Thus this bug was caused by the coincidental combination of these three things happening at roughly the same time:

0) Many parts of the world switched from daylight savings time to standard time during November,
1) Celery had the bug where it misinterpreted the meaning of that attribute in the Python time library which would only be incorrect for users who have the server's locale set to regions that do observe daylight savings time but are currently in standard time, and
2) Pulp 2.7.0 was released in November around the same time as all of the daylight savings time switches, and is the first Pulp release to use Django.

Though the Celery bug was indeed present in Pulp 2.6, it would only cause problems for Pulp if Django were being used!

Ironically, the Pulp QE process also happened primarily while most of the engineers were in daylight savings time.

Actions #34

Updated by chandlermelton over 8 years ago

Thanks rbarlow!

As a user, what will be the process for me to get this resolved?
Will this be part of a future release I can install with an RPM?

Actions #35

Updated by rbarlow over 8 years ago

Hello @chandlermelton!

The Pulp team plans to release a 2.7.1 version "very soon" to resolve this issue.

In the meantime, you can work around this issue a few ways:

0) You can change your system's locale to UTC. We can't comment on what effect this will have on other software you are running on your system, but if you are comfortable with using UTC that will solve the issue even without the patch. One could argue that UTC is a great default for servers ☺

1) You can wait until the Pulp team releases a 2.7.1 beta, which will likely only contain the change for this issue. This will likely happen "very soon" (no promises!).

2) You can attempt to manually apply the patch for this issue, which is quite similar though not identical to bmbouter's patch in https://pulp.plan.io/issues/1380#note-23

Actions #36

Updated by rbarlow over 8 years ago

  • Related to Task #1400: Develop a plan for how Pulp's master branch should determine when a worker was last seen added
Actions #37

Updated by rbarlow over 8 years ago

  • Status changed from POST to MODIFIED
  • % Done changed from 0 to 100
Actions #38

Updated by dkliban@redhat.com over 8 years ago

Just did a scratch build of 2.7.1 beta and verified on EL6 that workers are all reporting very similar times of last heartbeat:

> db.workers.find()
{ "_id" : "reserved_resource_worker-0@rhel6-np-qeos-55562", "last_heartbeat" : ISODate("2015-12-08T04:52:43.981Z") }
{ "_id" : "reserved_resource_worker-1@rhel6-np-qeos-55562", "last_heartbeat" : ISODate("2015-12-08T04:52:44.717Z") }
{ "_id" : "resource_manager@rhel6-np-qeos-55562", "last_heartbeat" : ISODate("2015-12-08T04:52:21.238Z") }
{ "_id" : "scheduler@rhel6-np-qeos-55562", "last_heartbeat" : ISODate("2015-12-08T04:52:16.259Z") }
Actions #39

Updated by bmbouter about 8 years ago

  • Status changed from MODIFIED to CLOSED - CURRENTRELEASE
Actions #40

Updated by bmbouter about 5 years ago

  • Tags Pulp 2 added

Also available in: Atom PDF