Story #3135

As a user, I have a setting to mitigate when workers go missing under heavy loads

Added by daviddavis about 4 years ago. Updated over 2 years ago.

Start date:
Due date:
% Done:


Estimated time:
Platform Release:
Sprint Candidate:
Pulp 2
Sprint 31


Satellite just upgraded their internal Satellite server from Pulp 2.8 to 2.13 and when they publish a set of content views or sync a large number of repos, workers seem to disappear. The task that seems to be running is 'pulp.server.managers.repo.unit_association.unassociate_by_criteria'.

531852.txt (23.9 KB) 531852.txt Log output daviddavis, 11/14/2017 08:34 PM
532826.txt (104 KB) 532826.txt More logging with heartbeats recorded daviddavis, 11/16/2017 03:34 PM

Associated revisions

Revision ca053770 View on GitHub
Added by daviddavis almost 4 years ago

Add warning and config variables for heartbeat timeout/interval

fixes #3135


#1 Updated by daviddavis about 4 years ago

Attaching log output

#2 Updated by daviddavis about 4 years ago

  • Version changed from 2.13.0 to 2.13.3

#3 Updated by daviddavis about 4 years ago

Between 2.8 and 2.13 we set the values for timeouts and heartbeats to be much more aggressive[1]. This might be why workers are being marked as missing in 2.13 under a heavy load. The calls to _record_heartbeat might be backed up.

@ehelms bumped up PULP_PROCESS_TIMEOUT_INTERVAL to 300 and PULP_PROCESS_HEARTBEAT_INTERVAL to 60 and saw that the problem went away.


#4 Updated by daviddavis about 4 years ago

I ran 10 instances of mongo-perf[0] on my local dev box (2 cpu cores and 16GB of memory) and saw the time in between heartbeats increase to 10 seconds. I think changing the worker timeout from 300 to 25 is the likely culprit.


#5 Updated by daviddavis about 4 years ago

Attaching a new Satellite log. This one has heartbeats being logged. In it, you can see heartbeats starting to be recorded every 25s+ around 2017-11-16 00:46.

#6 Updated by daviddavis about 4 years ago

We're seeing the handle_worker_heartbeat function take up to two minutes to execute while the pulp.server.managers.repo.unit_association.associate_from_repo task is running. We turned on query profiling[0] but aren't seeing any queries taking longer than 1000 ms.


#7 Updated by dalley about 4 years ago

  • Sprint/Milestone set to 47
  • Severity changed from 2. Medium to 3. High
  • Triaged changed from No to Yes

#8 Updated by daviddavis about 4 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to daviddavis

#9 Updated by daviddavis about 4 years ago

  • Description updated (diff)

#10 Updated by daviddavis about 4 years ago

We looked at the cprofile data for the handle_worker_heartbeat function by looking at a particular call that took 10 seconds. The time was pretty evenly split on the read call (Worker.objects(name=worker_name).first()) and the write call (Worker.objects(name=worker_name).update_one(...)). Analyzing the slow query data, it doesn't look like any particular query is slow however, many tasks get stuck waiting for a lock (see the timeAcquiringMicros field) for over a second.

Looking at the mongostat output, it seems like queues are quite large with 20+ clients waiting to read from the database during times of high load. Also, the 'locked db' column indicates that the pulp_database is often locked 100% of the time during these one second intervals.

Doing a little digging, it looks like the version of mongodb being used (2.6) locks on the entire database when a write is issued. MongoDB 3.0 introduced collection level locking [0] [1]. The queries being called for unassociate_by_criteria are probably locking the pulp_database and not allowing the workers to update their heartbeat in the required timeframe. I think next we should try upgrading mongodb to 3.0+ to see if that resolves the problem.

[1] - Note that there was no 2.7.8 release. The first release after 2.6 was 3.0.

#11 Updated by bmbouter about 4 years ago

I think we understand what causes this on the system it was experienced on. The recommendation there was to upgrade to Mongo 3+.

There are two other things I think we can do to improve the codebase:

1) add this recommendation to the docs. Specifically in the installation guide it should have a Sphinx note. Also possible a troubleshooting section named 'Workers go missing under heavy loads'. The install guide could link to the troubleshooting for a longer explanation.

2) We should add a timer to the record heartbeat method and if it takes longer than 5 seconds it should log at the WARNING level. This will alert users that their system is unhealthy in the logs.

#12 Updated by rchan about 4 years ago

  • Sprint/Milestone changed from 47 to 48

#13 Updated by rchan almost 4 years ago

  • Sprint/Milestone changed from 48 to 52

#14 Updated by almost 4 years ago

A user in #pulp-dev reported that his katello 3.5 with Pulp 2.14 was failing to sync a repository with 50000 RPMs. Each attempt to sync would end with error messages about workers going missing. After he switched the timings to the values suggested in comment 3, the sync worked.

As part of a fix for this issue we should expose these timeout settings in /etc/pulp/server.conf ... when mongodb is not keeping up with the writes/reads, a warning should be logged stating that these timeout values need to be adjusted.

#15 Updated by bmbouter almost 4 years ago

With an upstream user also affected I'm +1 on allowing users to tune the system by having settings. Also +1 to having the warnings suggest to the user to adjust these timings.

Currently there is the PULP_PROCESS_HEARTBEAT_INTERVAL set to 5 and the PULP_PROCESS_TIMEOUT_INTERVAL set to 25. If users don't set these values correctly, Pulp won't work correctly so if we expose them directly we would have to have some checks or warnings about this.

To avoid implementing those special safety checks, what if instead of exposing these settings directly, we exposed it as a single setting called max_failover_time? This setting would default to the existing 30 seconds we do today. We could then derive the setting values that would produce correct behavior as follows:


The division by two is to make the heartbeats occur fast enough so that even if the reader and the writer are totally out of sync but checking with the same interval, the failover will still occur within max_failover_time. Specifically this accounts for the case where the read of worker records occurs just before the write of a worker heart, then the worker fails; it still needs to failover within max_failover_time.

The timeout interval is derived from the heartbeat interval and the max_failover_time value. See how that works with the existing timings. With the existing timings the max_failover_time = PULP_PROCESS_HEARTBEAT_INTERVAL + PULP_PROCESS_HEARTBEAT_INTERVAL. Specifically that is 30 = 5 + 25. We should continue to derive the timeout interval in this way for correctness.

Having PULP_PROCESS_HEARTBEAT_INTERVAL = max_failover_time / 2 should work, but there isn't much grace period if a write is delayed. If a write is delayed at all, there will be a gap in time where the existing record has aged out, but the new write hasn't updated it yet. To handle this, make the heartbeats even faster relative to the record age time. For example something like:


For even more safety we could select:


If we select even more safety, we'll end up with the current timings of today:

PULP_PROCESS_HEARTBEAT_INTERVAL = max_failover_time / 5  # This is 5 seconds

I think we should pick the safety factor of 3, 4, or 5 and not let the user adjust that. The user can continue to solve slow db issues by adjusting max_failover_time.

What do others think about Pulp switching to this approach for setting timings?

#16 Updated by daviddavis almost 4 years ago

I like the idea of only exposing only a single config variable and I think the greater the difference between PULP_PROCESS_HEARTBEAT_INTERVAL and PULP_PROCESS_TIMEOUT_INTERVAL, the better. I would probably go for 4 or 5.

There's one tweak I would make to the formula though and that is I would just set PULP_PROCESS_TIMEOUT_INTERVAL equal to max_failover_time. Otherwise, I think this could create some obfuscation for users (ie users could be confused when they see workers timing out at under 30 seconds when they have max_failover_time set to 30 seconds).

Also, I see that there's a MIGRATION_WAIT_TIME constant which we probably want to compute now that we allow users to adjust PULP_PROCESS_HEARTBEAT_INTERVAL. It should probably be 92 or PULP_PROCESS_HEARTBEAT_INTERVAL+2, whichever is greater.

#17 Updated by bmbouter almost 4 years ago

If we set PULP_PROCESS_TIMEOUT_INTERVAL equal to max_failover_time then in the case where the timeout check occurs just a tiny amount of time before the record is 30 seconds old, then it won't be realized that it's old until it checks again which would be 30 + PULP_PROCESS_HEARTBEAT_INTERVAL. So I think we have to reduce max_failover_time by PULP_PROCESS_TIMEOUT_INTERVAL otherwise our failover times could be larger than max_failover_time. What about that case?

Also +1 to using a the safety factor of 4.

The MIGRATION_WAIT_TIME part is definitely related, but I'm not totally sure what to do. It's purpose is to detect if pre-upgraded workers are still running. The 92 was selected based on pre 2.12 timings (iirc). This is challenging because with this feature change the timings could need to be higher (which sounds good), but Pulp can't know if those timings are the same ones as the pre-upgraded version, or if the user just set them just now. Given that, I don't think there is one way to handle this that will handle all cases correctly. If that is true, then I think we should leave this aspect alone and leave it at 92. What do others think about this?

#18 Updated by daviddavis almost 4 years ago

To sum up the discussion with @dkliban, bmbouter, and dalley: we're going to expose a config variable failover_time that will allow users to set a maximum failover time. I think we're going to note in the config file that it should serve as a maximum in most cases. Also, I think we're going to set a default value of 30 but recommend that perhaps Katello use 60 or higher due to the nature of their application and running multiple dbs/apps on a single box.

I think we'll use the aforementioned formulas for computing PULP_PROCESS_TIMEOUT_INTERVAL and PULP_PROCESS_HEARTBEAT_INTERVAL with a value of 5:

PULP_PROCESS_HEARTBEAT_INTERVAL = max_failover_time / 5  # This is 5 seconds

We'll also have a warning in the handle_worker_heartbeat method that'll instruct users to increase the failover time if the method exceeds PULP_PROCESS_HEARTBEAT_INTERVAL.

I think we'd still like to recommend that users (both of Pulp and Katello) run MongoDB on spinning disks as MongoDB recommends it and claims that it runs 100 times faster with SSDs[0]. If Katello wants to support HDDs, then we should probably consider using a failover_time of 300+ seconds.


#19 Updated by daviddavis almost 4 years ago

  • Status changed from ASSIGNED to POST

#21 Updated by rchan almost 4 years ago

  • Sprint/Milestone changed from 52 to 53

#22 Updated by daviddavis almost 4 years ago

I'm trying to implement the solution we came up with but am having a bit of a problem.

My first thought was to set the constants PULP_PROCESS_HEARTBEAT_INTERVAL and PULP_PROCESS_TIMEOUT_INTERVAL in pulp.common.constants[0] to config.get(...). The problem is that the config module imports constants for DEFAULT_CA_PATH[1].

My next thought was to create some functions in worker_watcher (e.g. worker_heartbeat_interval(), worker_timeout_interval()) but this worker_watcher module imports Worker[2] which in turn imports a queryset module that needs to get the worker timeout value[3]. Thus, this creates a circular import.

At this point, I am thinking of either defining the constants in pulp.server.constants[4] or creating a new pulp.async.constants module.


#23 Updated by daviddavis almost 4 years ago

  • Status changed from POST to MODIFIED

#24 Updated by bmbouter almost 4 years ago

  • Tracker changed from Issue to Story
  • Subject changed from Workers go missing under heavy loads to As a user, I have a setting to mitigate when workers go missing under heavy loads
  • % Done set to 0

Switching to a story so it isn't included in a z-stream release.

#25 Updated by bmbouter over 3 years ago

  • Sprint set to Sprint 31

#26 Updated by bmbouter over 3 years ago

  • Sprint/Milestone deleted (53)

#27 Updated by ttereshc over 3 years ago

  • Platform Release set to 2.16.0

#28 Updated by bmbouter over 3 years ago

  • Status changed from MODIFIED to 5

#29 Updated by bmbouter over 3 years ago

  • Status changed from 5 to CLOSED - CURRENTRELEASE

#30 Updated by bmbouter over 2 years ago

  • Tags Pulp 2 added

Please register to edit this issue

Also available in: Atom PDF