Issue #3357
Updated by rmcgover almost 7 years ago
docker_distributor_web's publish of the redirect file (used by crane) is: - racy: there is a period of time during publish where neither old nor new file can be accessed - not atomic: if publish fails part way through, the system can be left with no published redirect file, leaving the repo broken in crane indefinitely This behavior comes from the order of operations here: plugins/pulp_docker/plugins/distributors/publish_steps.py: <pre> class V2WebPublisher(publish_step.PublishStep): """ This class performs the work of publishing a v2 Docker repository. """ def __init__(self, repo, publish_conduit, config, repo_content_unit_q=None): (...) app_file = configuration.get_redirect_file_name(repo) app_publish_location = os.path.join( configuration.get_app_publish_dir(config, docker_api_version), app_file) self.working_dir = os.path.join(self.get_working_dir(), docker_api_version) misc.mkdir(self.working_dir) self.web_working_dir = os.path.join(self.get_working_dir(), 'web') master_publish_dir = configuration.get_master_publish_dir(repo, config, docker_api_version) atomic_publish_step = publish_step.AtomicDirectoryPublishStep( self.get_working_dir(), [('', publish_dir), (app_file, app_publish_location)], master_publish_dir, step_type=constants.PUBLISH_STEP_OVER_HTTP) atomic_publish_step.description = _('Making v2 files available via web.') self.add_child(PublishBlobsStep(repo_content_unit_q=repo_content_unit_q)) self.publish_manifests_step = PublishManifestsStep( repo_content_unit_q=repo_content_unit_q) self.add_child(self.publish_manifests_step) self.publish_manifest_lists_step = PublishManifestListsStep( repo_content_unit_q=repo_content_unit_q) self.add_child(self.publish_manifest_lists_step) self.add_child(PublishTagsStep()) self.add_child(atomic_publish_step) self.add_child(RedirectFileStep(app_publish_location)) </pre> Note that the atomic_publish_step is second-last step, and then RedirectFileStep happens *after* that, writing to a file within the publish directory. Example: I have a repo named redhat-e2e-container-e2e-container-test-product whose redirect file read by crane which is /var/lib/pulp/published/docker/v2/app/redhat-e2e-container-e2e-container-test-product.json published to /var/lib/pulp/published/docker/v2/web/redhat-e2e-container-e2e-container-test-product - a symlink to /var/lib/pulp/published/docker/v2/master/redhat-e2e-container-e2e-container-test-product/<timestamp>/redhat-e2e-container-e2e-container-test-product.json /var/lib/pulp/published/docker/v2/master/redhat-e2e-container-e2e-container-test-product/<timestamp> In the publish sequence above, the last two steps are: 1. Rewrite symlinks, replacing old with new publish directory; includes rewrite of /var/lib/pulp/published/docker/v2/app/redhat-e2e-container-e2e-container-test-product.json /var/lib/pulp/published/docker/v2/web/redhat-e2e-container-e2e-container-test-product symlink from <old_timestamp> to /var/lib/pulp/published/docker/v2/master/redhat-e2e-container-e2e-container-test-product/<this_publish_timestamp>/redhat-e2e-container-e2e-container-test-product.json <this_publish_timestamp> 2. Write redirect file to /var/lib/pulp/published/docker/v2/web/redhat-e2e-container-e2e-container-test-product/redhat-e2e-container-e2e-container-test-product.json Therefore there's a period of time between steps 1 and 2 where the redirect file doesn't exist. Also, if step 2 fails for any reason, the redirect file will be left missing indefinitely. In either case, if crane loads redirect files during that time, it will behave as though the repo doesn't exist. h2. To reproduce: 1. Publish a docker repo X, ensure it has redirect file and can be pulled by crane. 2. Set up next publish to fail during write of redirect file. Easiest way to do this involves patching the source to trigger a failure: <pre> diff --git a/plugins/pulp_docker/plugins/distributors/publish_steps.py b/plugins/pulp_docker/plugins/distributors/publish_steps.py index 365e229..e4eb10a 100644 --- a/plugins/pulp_docker/plugins/distributors/publish_steps.py +++ b/plugins/pulp_docker/plugins/distributors/publish_steps.py @@ -209,6 +209,7 @@ class RedirectFileStep(publish_step.PublishStep): """ Publish the JSON file for Crane. """ + raise RuntimeError("Simulated error") registry = configuration.get_repo_registry_id(self.get_repo(), self.get_config()) redirect_url = configuration.get_redirect_url(self.get_config(), self.get_repo(), 'v2') </pre> (Similarly a time.sleep could be introduced to simulate "writing the redirect file is slow" case) Ensure Pulp services are restarted after the patch. 3. Trigger a publish of repo X and wait for it to fail. 4. Check redirect file under /var/lib/pulp/published/docker/v2/app contents of /var/lib/pulp/published/docker/v2/web/ for repo X. Check pull from crane. h2. Actual results Redirect file for repo X does not exist at expected path. Content of repo can't be pulled from crane. (Note: since crane loading of files is based on async polling, it could take a little while for this problem to appear) h2. Expected result Redirect file for repo X exists at expected path, with content same as prior to the failed publish. Content of repo can be pulled from crane, using the content same as prior to the failed publish.