Project

Profile

Help

Issue #3357

closed

docker_distributor_web is racy, not atomic

Added by rmcgover about 6 years ago. Updated almost 5 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Start date:
Due date:
Estimated time:
Severity:
3. High
Version - Docker:
3.1.0
Platform Release:
2.15.3
Target Release - Docker:
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
Yes
Tags:
Pulp 2
Sprint:
Sprint 33
Quarter:

Description

docker_distributor_web's publish of the redirect file (used by crane) is:

- racy: there is a period of time during publish where neither old nor new file can be accessed
- not atomic: if publish fails part way through, the system can be left with no published redirect file, leaving the repo broken in crane indefinitely

This behavior comes from the order of operations here:

plugins/pulp_docker/plugins/distributors/publish_steps.py:
class V2WebPublisher(publish_step.PublishStep):
    """
    This class performs the work of publishing a v2 Docker repository.
    """
    def __init__(self, repo, publish_conduit, config, repo_content_unit_q=None):
    (...)
        app_file = configuration.get_redirect_file_name(repo)
        app_publish_location = os.path.join(
            configuration.get_app_publish_dir(config, docker_api_version), app_file)
        self.working_dir = os.path.join(self.get_working_dir(), docker_api_version)
        misc.mkdir(self.working_dir)
        self.web_working_dir = os.path.join(self.get_working_dir(), 'web')
        master_publish_dir = configuration.get_master_publish_dir(repo, config, docker_api_version)
        atomic_publish_step = publish_step.AtomicDirectoryPublishStep(
            self.get_working_dir(), [('', publish_dir), (app_file, app_publish_location)],
            master_publish_dir, step_type=constants.PUBLISH_STEP_OVER_HTTP)
        atomic_publish_step.description = _('Making v2 files available via web.')
        self.add_child(PublishBlobsStep(repo_content_unit_q=repo_content_unit_q))
        self.publish_manifests_step = PublishManifestsStep(
            repo_content_unit_q=repo_content_unit_q)
        self.add_child(self.publish_manifests_step)
        self.publish_manifest_lists_step = PublishManifestListsStep(
            repo_content_unit_q=repo_content_unit_q)
        self.add_child(self.publish_manifest_lists_step)
        self.add_child(PublishTagsStep())
        self.add_child(atomic_publish_step)
        self.add_child(RedirectFileStep(app_publish_location))

Note that the atomic_publish_step is second-last step, and then RedirectFileStep happens after that, writing to a file within the publish directory.

Example: I have a repo named redhat-e2e-container-e2e-container-test-product whose redirect file read by crane is /var/lib/pulp/published/docker/v2/app/redhat-e2e-container-e2e-container-test-product.json - a symlink to /var/lib/pulp/published/docker/v2/master/redhat-e2e-container-e2e-container-test-product/<timestamp>/redhat-e2e-container-e2e-container-test-product.json

In the publish sequence above, the last two steps are:

1. Rewrite symlinks, replacing old with new publish directory; includes rewrite of /var/lib/pulp/published/docker/v2/app/redhat-e2e-container-e2e-container-test-product.json to /var/lib/pulp/published/docker/v2/master/redhat-e2e-container-e2e-container-test-product/<this_publish_timestamp>/redhat-e2e-container-e2e-container-test-product.json
2. Write redirect file to /var/lib/pulp/published/docker/v2/web/redhat-e2e-container-e2e-container-test-product/redhat-e2e-container-e2e-container-test-product.json

Therefore there's a period of time between steps 1 and 2 where the redirect file doesn't exist. Also, if step 2 fails for any reason, the redirect file will be left missing indefinitely. In either case, if crane loads redirect files during that time, it will behave as though the repo doesn't exist.

To reproduce:

1. Publish a docker repo X, ensure it has redirect file and can be pulled by crane.

2. Set up next publish to fail during write of redirect file.

Easiest way to do this involves patching the source to trigger a failure:

diff --git a/plugins/pulp_docker/plugins/distributors/publish_steps.py b/plugins/pulp_docker/plugins/distributors/publish_steps.py
index 365e229..e4eb10a 100644
--- a/plugins/pulp_docker/plugins/distributors/publish_steps.py
+++ b/plugins/pulp_docker/plugins/distributors/publish_steps.py
@@ -209,6 +209,7 @@ class RedirectFileStep(publish_step.PublishStep):
         """
         Publish the JSON file for Crane.
         """
+        raise RuntimeError("Simulated error")
         registry = configuration.get_repo_registry_id(self.get_repo(), self.get_config())
         redirect_url = configuration.get_redirect_url(self.get_config(), self.get_repo(), 'v2')

(Similarly a time.sleep could be introduced to simulate "writing the redirect file is slow" case)

Ensure Pulp services are restarted after the patch.

3. Trigger a publish of repo X and wait for it to fail.

4. Check redirect file under /var/lib/pulp/published/docker/v2/app for repo X. Check pull from crane.

Actual results

Redirect file for repo X does not exist at expected path.

Content of repo can't be pulled from crane. (Note: since crane loading of files is based on async polling, it could take a little while for this problem to appear)

Expected result

Redirect file for repo X exists at expected path, with content same as prior to the failed publish.

Content of repo can be pulled from crane, using the content same as prior to the failed publish.

Also available in: Atom PDF