Project

Profile

Help

Issue #8180

Content serving performance tuning

Added by adam.winberg@smhi.se 6 months ago. Updated about 1 month ago.

Status:
NEW
Priority:
Normal
Assignee:
-
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Sprint:
Quarter:

Description

We have been using a pulp2 installation serving yum repos to ~1200 clients. We have now moved to a pulp3 installation and when we first made the move client yum runs was very slow. I mean timeout-slow, grinding our puppet runs to a halt and forcing me to revert back to our pulp2 instance.

Pulp3 is behind a apache reverse proxy, and I'm seeing errors like:

 AH01102: error reading status line from remote server 127.0.0.1:24816

There is no shortage of cpu/ram on the server. Pulp is connected to a remote postgresql server, which was my first suspect but after quite a lot of troubleshooting and testing we concluded that the problem seemed to be the pulp server.

After further debugging I found additional error logs from Apache:

AH03490: scoreboard is full, not at MaxRequestWorkers.Increase ServerLimit.

and

AH00484: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting

I'm using mpm_event in Apache where default value for ServerLimit is 16, I increased this to 30. I increased MaxRequestWorkers from 400 (default) to 750.

I also increased the number of aiohttpd workers for the pulpcore-content service from 2 to 6. This measure was more of a wild guess based on aiohttpd documentation and I don't know if it is relevant for the content serving in Pulp.

After these measures the performance was a lot better. Not quite as good as pulp2 and I still get an occasional failed puppetrun because of dnf, but it's pretty good, good enough. I was under some time pressure to get our pulp3 environment up and running and therefore I introduced both measures at the same time, so unfortunately I can't say if the increase in content workers had any effect. The Apache errors are gone though which i guess makes quite a difference.

I was looking for any information about Pulp3 content serving tuning or any existing issues regarding this, but found nothing. But some pointers in the documentation about how to tune the content serving would be really nice.

Using rpm based installation on RHEL8 with

httpd-2.4.37-30.module+el8.3.0+7001+0766b9e7.x86_64
python3-pulp-rpm-3.7.0-1.el8.noarch
python3-pulpcore-3.7.3-1.el8.noarch

Related issues

Related to Pulp - Task #6928: Measure Pulp's ability to scale to high #s of client requestsNEW

<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>
Related to Pulp - Task #8804: [EPIC] Use Redis to add caching abilities to PulpNEW

<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>
Related to Pulp - Task #8805: Cache the responses of the content appCLOSED - CURRENTRELEASE

<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>

History

#1 Updated by adam.winberg@smhi.se 6 months ago

I got rid of the Apache errors regarding ServerLimit and MaxRequestWorkers, but I still get occasional errors:

AH01102: error reading status line from remote server 127.0.0.1:24816

I tested increasing my gunicorn/aiohttp workers even more, but with no difference.

I then googled into this aiohttp issue: https://github.com/aio-libs/aiohttp/issues/2687

where one user with similar problems solved this by adding 'disablereuse=on' to the ProxyPass directive in Apache. I tried it and it seems to work for me as well. Since I proxypass to localhost I don't think this parameter has any significant performance impact, none that I have noticed anyway.

All in all the performance is quite acceptable for me now.

#2 Updated by dalley 6 months ago

  • Project changed from RPM Support to Pulp

#3 Updated by dalley 6 months ago

I did a little bit of load testing using Locust.io. Here was my setup.

Install the "locust" package and run the following script with "locust -f $script_name". Connect to the WebUI from your browser with the address "127.0.0.1:8089", using the appropriate address if you're running it from the VM. I did not use the VM.

import random
from urllib.parse import urljoin
from collections import namedtuple
from gettext import gettext as _


from locust import HttpUser, task, between

REPO_LOCATION = "http://pulp2.dev/pulp/isos/file20k/"
REPO_LOCATION = "http://pulp2-nightly-pulp3-source-centos7/pulp/content/file20k/"

# LOCAL_REPO_PATH = "/pulp/isos/file20k/"    # Pulp 2
# LOCAL_REPO_PATH = "/pulp/content/file20k/"   # Pulp 3


Line = namedtuple("Line", ("number", "content"))


class Entry:
    """
    Manifest entry.

    Format: <relative_path>,<digest>,<size>.
    Lines beginning with `#` are ignored.

    Attributes:
        relative_path (str): A relative path.
        digest (str): The file sha256 hex digest.
        size (int): The file size in bytes.

    """

    def __init__(self, relative_path, size, digest):
        """
        Create a new Entry.

        Args:
            relative_path (str): A relative path.
            digest (str): The file sha256 hex digest.
            size (int): The file size in bytes.

        """
        self.relative_path = relative_path
        self.digest = digest
        self.size = size

    @staticmethod
    def parse(line):
        """
        Parse the specified line from the manifest into an Entry.

        Args:
            line (Line): A line from the manifest.

        Returns:
            Entry: An entry.

        Raises:
            ValueError: on parsing error.

        """
        part = [s.strip() for s in line.content.split(",")]
        if len(part) != 3:
            raise ValueError(
                _("Error: manifest line:{n}: " "must be: <relative_path>,<digest>,<size>").format(
                    n=line.number
                )
            )
        return Entry(relative_path=part[0], digest=part[1], size=int(part[2]))

    def __str__(self):
        """
        Returns a string representation of the Manifest Entry.

        Returns:
            str: format: "<relative_path>,<digest>,<size>"

        """
        fields = [self.relative_path, self.digest]
        if isinstance(self.size, int):
            fields.append(str(self.size))
        return ",".join(fields)


class Manifest:
    """
    A file manifest.

    Describes files contained within the directory.

    Attributes:
        relative_path (str): An relative path to the manifest.

    """

    def __init__(self, relative_path):
        """
        Create a new Manifest.

        Args:
            relative_path (str): An relative path to the manifest.

        """
        self.relative_path = relative_path

    @staticmethod
    def parse(manifest_str):
        """
        Parse a manifest string and yield entries.

        Yields:
            Entry: for each line.

        """
        for n, line in enumerate(manifest_str.splitlines(), 1):
            line = line.strip()
            if not line:
                continue
            if line.startswith("#"):
                continue
            yield Entry.parse(Line(number=n, content=line))


class PulpLoadTestClient(HttpUser):
    # wait_time = between(0.5, 3.0)

    def on_start(self):
        """ on_start is called when a Locust start before any task is scheduled """
        pass

    def on_stop(self):
        """ on_stop is called when the TaskSet is stopping """
        pass

    @task(1)
    def clone_file20k_repository(self):
        response = self.client.get("PULP_MANIFEST")

        entries = list(Manifest.parse(response.text))

        random.shuffle(entries)

        for entry in entries:
            self.client.get(entry.relative_path)

The WebUI will ask you what URL to test as well as how many workers to use and how fast to spawn them. Use the url of the repos hosted in Pulp 2 or 3, e.g. "http://pulp2.dev/pulp/isos/file20k/" "http://pulp3-source-fedora32.localhost.example.com/pulp/content/file20k/"

I used this repo specifically: https://fixtures.pulpproject.org/file-perf/

I don't want to overgeneralize from my hardware or from this one specific benchmark (which is likely a little pathological due to the tiny files), but I can definitely see that there is a very large gap between Pulp 2 and Pulp 3 content serving performance at the moment, in terms of throughput, latency, and CPU consumption.

I also confirmed that increasing the worker count helped, but there's probably some other things we can do in the code to improve the situation.

#4 Updated by daviddavis 6 months ago

  • Triaged changed from No to Yes
  • Sprint set to Sprint 90

#5 Updated by daviddavis 6 months ago

  • Priority changed from Normal to High
  • Severity changed from 2. Medium to 3. High

#6 Updated by dalley 6 months ago

New version

import random
from urllib.parse import urljoin
from collections import namedtuple
from gettext import gettext as _


from locust import HttpUser, task, between

REPO_LOCATION = "http://pulp2.dev/pulp/isos/file20k/"
REPO_LOCATION = "http://pulp2-nightly-pulp3-source-centos7/pulp/content/file20k/"

# LOCAL_REPO_PATH = "/pulp/isos/file20k/"    # Pulp 2
# LOCAL_REPO_PATH = "/pulp/content/file20k/"   # Pulp 3


Line = namedtuple("Line", ("number", "content"))


class Entry:
    """
    Manifest entry.

    Format: <relative_path>,<digest>,<size>.
    Lines beginning with `#` are ignored.

    Attributes:
        relative_path (str): A relative path.
        digest (str): The file sha256 hex digest.
        size (int): The file size in bytes.

    """

    def __init__(self, relative_path, size, digest):
        """
        Create a new Entry.

        Args:
            relative_path (str): A relative path.
            digest (str): The file sha256 hex digest.
            size (int): The file size in bytes.

        """
        self.relative_path = relative_path
        self.digest = digest
        self.size = size

    @staticmethod
    def parse(line):
        """
        Parse the specified line from the manifest into an Entry.

        Args:
            line (Line): A line from the manifest.

        Returns:
            Entry: An entry.

        Raises:
            ValueError: on parsing error.

        """
        part = [s.strip() for s in line.content.split(",")]
        if len(part) != 3:
            raise ValueError(
                _("Error: manifest line:{n}: " "must be: <relative_path>,<digest>,<size>").format(
                    n=line.number
                )
            )
        return Entry(relative_path=part[0], digest=part[1], size=int(part[2]))

    def __str__(self):
        """
        Returns a string representation of the Manifest Entry.

        Returns:
            str: format: "<relative_path>,<digest>,<size>"

        """
        fields = [self.relative_path, self.digest]
        if isinstance(self.size, int):
            fields.append(str(self.size))
        return ",".join(fields)


class Manifest:
    """
    A file manifest.

    Describes files contained within the directory.

    Attributes:
        relative_path (str): An relative path to the manifest.

    """

    def __init__(self, relative_path):
        """
        Create a new Manifest.

        Args:
            relative_path (str): An relative path to the manifest.

        """
        self.relative_path = relative_path

    @staticmethod
    def parse(manifest_str):
        """
        Parse a manifest string and yield entries.

        Yields:
            Entry: for each line.

        """
        for n, line in enumerate(manifest_str.splitlines(), 1):
            line = line.strip()
            if not line:
                continue
            if line.startswith("#"):
                continue
            yield Entry.parse(Line(number=n, content=line))


class PulpLoadTestClient(HttpUser):
    # wait_time = between(0.5, 3.0)

    def on_start(self):
        """ on_start is called when a Locust start before any task is scheduled """
        pass

    def on_stop(self):
        """ on_stop is called when the TaskSet is stopping """
        pass

    @task(1)
    def emulate_rpm_repository(self, num_pkgs=None):
        response = self.client.get("PULP_MANIFEST")
        entries = list(Manifest.parse(response.text))

        metadata_files = []
        packages = []

        for entry in entries:
            if entry.relative_path.startswith("repodata/"):
                metadata_files.append(entry.relative_path)
            elif entry.relative_path.startswith("packages/"):
                packages.append(entry.relative_path)

        for metadata_file in metadata_files:
            self.client.get(metadata_file.relative_path)

        # is a random sample actually what we want? many clients will likely be requesting the same set of packages
        packages = random.sample(packages, num_pkgs) if num_pkgs else packages

        for pkg in random.sample(packages, num_pkgs):
            self.client.get(pkg.relative_path)

#7 Updated by dalley 5 months ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to dalley

#8 Updated by rchan 5 months ago

  • Sprint changed from Sprint 90 to Sprint 91

#9 Updated by rchan 5 months ago

  • Sprint changed from Sprint 91 to Sprint 92

#10 Updated by rchan 4 months ago

  • Sprint changed from Sprint 92 to Sprint 93

#11 Updated by rchan 4 months ago

  • Sprint changed from Sprint 93 to Sprint 94

#12 Updated by dalley 4 months ago

  • Related to Task #6928: Measure Pulp's ability to scale to high #s of client requests added

#13 Updated by rchan 3 months ago

  • Sprint changed from Sprint 94 to Sprint 95

#14 Updated by rchan 3 months ago

  • Sprint changed from Sprint 95 to Sprint 96

#15 Updated by rchan 3 months ago

  • Sprint changed from Sprint 96 to Sprint 97

#16 Updated by gerrod 2 months ago

  • Related to Task #8804: [EPIC] Use Redis to add caching abilities to Pulp added

#17 Updated by rchan about 2 months ago

  • Sprint changed from Sprint 97 to Sprint 98

#18 Updated by dalley about 2 months ago

  • Assignee changed from dalley to gerrod

Gerrod's caching PR will likely solve this problem for good. His PR gets some fantastic results in my initial testing.

#19 Updated by dalley about 2 months ago

  • Related to Task #8805: Cache the responses of the content app added

#20 Updated by rchan about 1 month ago

  • Sprint changed from Sprint 98 to Sprint 99

#21 Updated by dalley about 1 month ago

A summary of the various changes we've made in the process of improving the content app

All released presently

  • Caching responses, to avoid making repeated database queries for commonly-requested files

Coming in 3.14

#22 Updated by dalley about 1 month ago

  • Sprint/Milestone set to 3.14.0

#23 Updated by dalley about 1 month ago

  • Status changed from ASSIGNED to NEW
  • Assignee deleted (gerrod)
  • Priority changed from High to Normal
  • Sprint/Milestone deleted (3.14.0)
  • Sprint deleted (Sprint 99)

This issue is being repurposed to specifically refer to "fine-tuning" of various parameters, to determine the most optimal defaults. The caching changes and other performance improvements are tracked in other issues and they're moving forward soon.

Since the caching will improve performance significantly, the fine-tuning will be a little less important, so I'm dropping the prio.

#24 Updated by dalley about 1 month ago

  • Severity changed from 3. High to 2. Medium

Please register to edit this issue

Also available in: Atom PDF