Project

Profile

Help

Story #4689

As a user, Pulp logs have correlation ids in the logs of all services

Added by dkliban@redhat.com over 2 years ago. Updated 10 months ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
Start date:
Due date:
% Done:

100%

Estimated time:
Platform Release:
Groomed:
Yes
Sprint Candidate:
Yes
Tags:
Katello
Sprint:
Sprint 87
Quarter:

Description

Motivation

Pulp as a multi-process system needs to make debugging easy. Currently it is hard to:

  1. Identify which log statements are related to a specific API request
  2. Filter out all of those that aren't
  3. Do this across multiple processes

Additionally, there is a use case where an external system wants to provide the correlation id, e.g. Katello wants to submit the correlation ID as part of the request.

Solution

Use django-cid by default, and document that it's available for users to use.

Implementation

Enabling in Pulp itself

  1. Add django-cid as an installed app
  2. Reconfigure the default Django logger config in settings.py to incorporate the CID
  3. Add user documentation identifying the types of things they can additionally configure
  4. Add a test to assert Pulp viewset logs contain an autogenerated CID

Enabling in the App

  1. Add a test to assert content app logs contain an autogenerated CID

Enabling in Tasking System

  1. Create the field on task and set it: Add a field to the Task model called logging_cid. Have the enqueue_with_reservation function set this field when it creates a Task object with the value of get_cid.

  2. Have the CID log on the resource manager itself: Have the _queue_reserved_task method call set_cid() with the value of the new logging_cid from the inner_task. This will enable the correlation ID in the resource_manager logs for that task also.

  3. Have all tasks also use the CID: Have the perform_job call set_cid() with the value of the logging_cid from the task. This will enable the correlation ID in all pulp task logs without them having to change their signatures.

  4. Lastly, expose the logging_cid in the task API for users to see.


Related issues

Related to Pulp - Task #3289: Support for X-Correlation-ID HTTP REST headerCLOSED - WONTFIX

<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>
Related to Pulp - Story #2654: As a User I want Pulp3 to log the first 8 characters of a task id in every log statement emitted from a running taskCLOSED - WONTFIX

<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>
Related to Pulp - Task #7792: Update the gunicorn service file to log correlation idCLOSED - CURRENTRELEASE

<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>

Associated revisions

Revision ca7dbe15 View on GitHub
Added by daviddavis 10 months ago

Add support for correlation_id to pulp

fixes #4689

History

#1 Updated by bmbouter over 2 years ago

  • Tags deleted (Pulp 3)

#2 Updated by jsherril@redhat.com about 2 years ago

  • Tags Katello-P3 added

#3 Updated by bmbouter over 1 year ago

  • Related to Task #3289: Support for X-Correlation-ID HTTP REST header added

#5 Updated by ggainey over 1 year ago

  • Priority changed from Normal to Low

#6 Updated by ggainey over 1 year ago

  • Tags Katello added
  • Tags deleted (Katello-P3)

#7 Updated by bmbouter over 1 year ago

  • Subject changed from As a user, I can configure pulp to accept a correlation id header to As a user, Pulp logs have correlation ids in the logs of all services
  • Description updated (diff)

#8 Updated by bmbouter over 1 year ago

  • Checklist item deleted (documentation on how to enable and configure)

#9 Updated by daviddavis over 1 year ago

For this kwarg _pulp_logging_cid, will the code calling enqueue_with_reservation have to manually set this if it wants task logging to use this CID?

#10 Updated by bmbouter over 1 year ago

daviddavis wrote:

For this kwarg _pulp_logging_cid, will the code calling enqueue_with_reservation have to manually set this if it wants task logging to use this CID?

The caller of enqueue_with_reservation will not have to specify the CID, the CID will automatically and transparently be added by enqueue_with_reservation itself. The signature of enqueue_with_reservation won't be adjusted with this change so rolling it out will require no other code to change. Also when rolled out it, will be enabled automatically on all tasks.

#11 Updated by bmbouter over 1 year ago

  • Description updated (diff)

#12 Updated by bmbouter over 1 year ago

  • Description updated (diff)

#13 Updated by bmbouter over 1 year ago

  • Description updated (diff)

#14 Updated by daviddavis over 1 year ago

  • Groomed changed from No to Yes
  • Sprint Candidate changed from No to Yes

#15 Updated by rchan over 1 year ago

  • Sprint set to Sprint 74

#16 Updated by daviddavis over 1 year ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to daviddavis
  • Priority changed from Low to Normal

#17 Updated by rchan over 1 year ago

  • Sprint changed from Sprint 74 to Sprint 75

#18 Updated by daviddavis over 1 year ago

Any idea how to handle the test? Not sure how to use functional tests since tests might be running on a different box than Pulp. Perhaps a unit test that uses mock somehow?

#19 Updated by bmbouter over 1 year ago

daviddavis wrote:

Any idea how to handle the test? Not sure how to use functional tests since tests might be running on a different box than Pulp. Perhaps a unit test that uses mock somehow?

I believe pulp-smash has a utility that allows you to run commands against the host box. You can see an example of pulp-smash code checking for root on the remote box for example here: https://github.com/pulp/pulp-smash/blob/8b0dbd3febb7a3a30367277c4f838c52a13b4642/pulp_smash/cli.py#L40

#20 Updated by rchan over 1 year ago

  • Sprint changed from Sprint 75 to Sprint 76

#21 Updated by rchan about 1 year ago

  • Sprint changed from Sprint 76 to Sprint 77

#22 Updated by daviddavis about 1 year ago

  • Description updated (diff)

#23 Updated by daviddavis about 1 year ago

  • Description updated (diff)

#24 Updated by rchan about 1 year ago

  • Sprint changed from Sprint 77 to Sprint 78

#25 Updated by rchan about 1 year ago

  • Sprint changed from Sprint 78 to Sprint 79

#26 Updated by rchan about 1 year ago

  • Sprint changed from Sprint 79 to Sprint 80

#27 Updated by daviddavis about 1 year ago

  • Related to Story #2654: As a User I want Pulp3 to log the first 8 characters of a task id in every log statement emitted from a running task added

#28 Updated by rchan about 1 year ago

  • Sprint changed from Sprint 80 to Sprint 81

#29 Updated by rchan about 1 year ago

  • Sprint changed from Sprint 81 to Sprint 82

#30 Updated by daviddavis 12 months ago

I modified Pulp per the design in the description. However, there are some quirks. Here are some logs using an autogenerated CID and a CID set by the correlation id header.

Autogenerated CID (4bca4df4-d69d-4c4b-8542-fee55209af68)

Here you can see that gunicorn doesn't log the correlation_id which is unfortunate as the correlation_id cannot be matched to the request. I'm guessing it's not possible to have gunicorn log the CID as the CID gets generated/set in pulp and guincorn logs its output before execution reaches Pulp. So we'll probably have to have Pulp output the request in its logs.

The resource manager has its own generated CID which isn't going to make sense to a user viewing the logs. Also, the worker has its own CID before it picks up the task. Then it uses the task's CID which is perhaps surprising to a user.

Sep 29 20:14:22 pulp3-source-fedora.crake.example.com rq[31262]: pulp (c68135fc-5b5c-487f-86fa-a98e01f73d96): rq.worker:INFO: resource-manager: 986e0d83-db66-480c-a94d-ec6a09f9d119
Sep 29 20:14:22 pulp3-source-fedora.crake.example.com gunicorn[31259]: 127.0.0.1 - admin [29/Sep/2020:20:14:22 +0000] "POST /pulp/api/v3/repositories/file/file/aef43617-a5cf-4e7c-b8dd-52badeec6e7e/versions/0/repair/ HTTP/1.0" 202 67 "-" "HTTPie/2.0.0"
Sep 29 20:14:23 pulp3-source-fedora.crake.example.com rq[31263]: pulp (a4c84ce9-adce-42f4-811d-814d69865da0): rq.worker:INFO: 31263@pulp3-source-fedora.crake.example.com: 868422b5-673b-4ac3-a63d-c2034ab061d7
Sep 29 20:14:23 pulp3-source-fedora.crake.example.com rq[31262]: pulp (c68135fc-5b5c-487f-86fa-a98e01f73d96): rq.worker:INFO: resource-manager: Job OK (986e0d83-db66-480c-a94d-ec6a09f9d119)
Sep 29 20:14:23 pulp3-source-fedora.crake.example.com rq[31263]: pulp (4bca4df4-d69d-4c4b-8542-fee55209af68): pulpcore.app.tasks.repository:INFO: Repairing version 0 of repository test
Sep 29 20:14:23 pulp3-source-fedora.crake.example.com rq[31263]: pulp (4bca4df4-d69d-4c4b-8542-fee55209af68): rq.worker:INFO: 31263@pulp3-source-fedora.crake.example.com: Job OK (868422b5-673b-4ac3-a63d-c2034ab061d7)
Sep 29 20:14:23 pulp3-source-fedora.crake.example.com rq[31263]: pulp (a4c84ce9-adce-42f4-811d-814d69865da0): rq.worker:INFO: 31263@pulp3-source-fedora.crake.example.com: af9570a0-d648-4120-ba86-7c4b830c3f67
Sep 29 20:14:23 pulp3-source-fedora.crake.example.com rq[31263]: pulp (a4c84ce9-adce-42f4-811d-814d69865da0): rq.worker:INFO: 31263@pulp3-source-fedora.crake.example.com: Job OK (af9570a0-d648-4120-ba86-7c4b830c3f67)

Setting CID to MY_CORRELATION_ID using HTTP_X_CORRELATION_ID header

I can probably replace None with an empty string. This log perhaps makes more sense but again gunicorn isn't logging with the CID.

Sep 29 20:10:41 pulp3-source-fedora.crake.example.com rq[31170]: pulp (None): rq.worker:INFO: resource-manager: 17aef9a7-5aba-4e6f-a3cf-c44d46b5e865
Sep 29 20:10:41 pulp3-source-fedora.crake.example.com gunicorn[31168]: 127.0.0.1 - admin [29/Sep/2020:20:10:41 +0000] "POST /pulp/api/v3/repositories/file/file/aef43617-a5cf-4e7c-b8dd-52badeec6e7e/versions/0/repair/ HTTP/1.0" 202 67 "-" "HTTPie/2.0.0"
Sep 29 20:10:41 pulp3-source-fedora.crake.example.com rq[31169]: pulp (None): rq.worker:INFO: 31169@pulp3-source-fedora.crake.example.com: 20ab0ef0-faeb-482c-bae7-62cee69c7ca3
Sep 29 20:10:41 pulp3-source-fedora.crake.example.com rq[31170]: pulp (None): rq.worker:INFO: resource-manager: Job OK (17aef9a7-5aba-4e6f-a3cf-c44d46b5e865)
Sep 29 20:10:41 pulp3-source-fedora.crake.example.com rq[31169]: pulp (MY_CORRELATION_ID): pulpcore.app.tasks.repository:INFO: Repairing version 0 of repository test
Sep 29 20:10:41 pulp3-source-fedora.crake.example.com rq[31169]: pulp (MY_CORRELATION_ID): rq.worker:INFO: 31169@pulp3-source-fedora.crake.example.com: Job OK (20ab0ef0-faeb-482c-bae7-62cee69c7ca3)
Sep 29 20:10:41 pulp3-source-fedora.crake.example.com rq[31169]: pulp (None): rq.worker:INFO: 31169@pulp3-source-fedora.crake.example.com: 60ee9923-7aba-472c-8379-010542704557
Sep 29 20:10:41 pulp3-source-fedora.crake.example.com rq[31169]: pulp (None): rq.worker:INFO: 31169@pulp3-source-fedora.crake.example.com: Job OK (60ee9923-7aba-472c-8379-010542704557)

#31 Updated by rchan 12 months ago

  • Sprint changed from Sprint 82 to Sprint 83

#32 Updated by bmbouter 12 months ago

What if we customize the gunicorn response format and have the CID be included as a response header. I see those are gunicorn logging format option here. Since gunicorn includes data only known by the response, I think it is waiting to log until the response is available.

The benefit of a response header is even auto-generated ones would be shown in the logs. It would probably be valuable for the client to have the CID sent all the way back to them so this creates value there too.

I was thinking also that in all cases there would be a CID. The tasking system I was thinking would have that be a tasking system param and the tasking system would always correlate it's task CID with the dispatcher of that task (either another task or a viewset who autogenerated it). Is that also planned to be part of this work too?

#33 Updated by daviddavis 12 months ago

What if we customize the gunicorn response format and have the CID be included as a response header. I see those are gunicorn logging format option here. Since gunicorn includes data only known by the response, I think it is waiting to log until the response is available.

The benefit of a response header is even auto-generated ones would be shown in the logs. It would probably be valuable for the client to have the CID sent all the way back to them so this creates value there too.

This makes sense and I think it should work.

I was thinking also that in all cases there would be a CID. The tasking system I was thinking would have that be a tasking system param and the tasking system would always correlate it's task CID with the dispatcher of that task (either another task or a viewset who autogenerated it). Is that also planned to be part of this work too?

I'm not sure I understand. Let me give an example: when the resource manager or worker starts up and starts logging (before it's been given a task), what should its CID be? What should it be after it has finished a task?

#34 Updated by bmbouter 12 months ago

daviddavis wrote:

I'm not sure I understand. Let me give an example: when the resource manager or worker starts up and starts logging (before it's been given a task), what should its CID be? What should it be after it has finished a task?

I see what you mean, and I didn't consider those log statements. In those I expect it should show no CID. Anytime a task is being run the CID would be there.

#35 Updated by rchan 12 months ago

  • Sprint changed from Sprint 83 to Sprint 84

#36 Updated by rchan 11 months ago

  • Sprint changed from Sprint 84 to Sprint 85

#37 Updated by daviddavis 11 months ago

The problem I am running into with django-cid is that it generates uuids when you go to get the cid and log messages[0] which is problematic because in some cases (eg when a worker starts before it has a task) we don't want it to generate a cid. Instead the middleware should handle this so that it only generates correlation IDs once when a request is received.

I also looked at django-guid[1] and I think it might work instead.

[0] https://github.com/Polyconseil/django-cid/blob/0dc7fc50d8aff469b989fb8b4b42242f2989ef1c/cid/locals.py#L15-L28 [1] https://github.com/snok/django-guid

#38 Updated by daviddavis 11 months ago

  • Related to Task #7792: Update the gunicorn service file to log correlation id added

#39 Updated by pulpbot 11 months ago

  • Status changed from ASSIGNED to POST

#40 Updated by daviddavis 11 months ago

  • Sprint/Milestone set to 3.9.0

#41 Updated by rchan 11 months ago

  • Sprint changed from Sprint 85 to Sprint 86

#42 Updated by daviddavis 10 months ago

  • Sprint/Milestone deleted (3.9.0)

#43 Updated by rchan 10 months ago

  • Sprint changed from Sprint 86 to Sprint 87

#44 Updated by daviddavis 10 months ago

  • Status changed from POST to MODIFIED
  • % Done changed from 0 to 100

#45 Updated by daviddavis 10 months ago

  • Sprint/Milestone set to 3.9.0

#46 Updated by pulpbot 10 months ago

  • Status changed from MODIFIED to CLOSED - CURRENTRELEASE

Please register to edit this issue

Also available in: Atom PDF