Story #4689
closedAs a user, Pulp logs have correlation ids in the logs of all services
100%
Description
Motivation¶
Pulp as a multi-process system needs to make debugging easy. Currently it is hard to:
- Identify which log statements are related to a specific API request
- Filter out all of those that aren't
- Do this across multiple processes
Additionally, there is a use case where an external system wants to provide the correlation id, e.g. Katello wants to submit the correlation ID as part of the request.
Solution¶
Use django-cid by default, and document that it's available for users to use.
Implementation¶
Enabling in Pulp itself¶
- Add django-cid as an installed app
- Reconfigure the default Django logger config in settings.py to incorporate the CID
- Add user documentation identifying the types of things they can additionally configure
- Add a test to assert Pulp viewset logs contain an autogenerated CID
Enabling in the App¶
- Add a test to assert content app logs contain an autogenerated CID
Enabling in Tasking System¶
-
Create the field on task and set it: Add a field to the Task model called
logging_cid
. Have theenqueue_with_reservation
function set this field when it creates a Task object with the value ofget_cid
. -
Have the CID log on the resource manager itself: Have the
_queue_reserved_task
method callset_cid()
with the value of the newlogging_cid
from theinner_task
. This will enable the correlation ID in the resource_manager logs for that task also. -
Have all tasks also use the CID: Have the
perform_job
callset_cid()
with the value of thelogging_cid
from the task. This will enable the correlation ID in all pulp task logs without them having to change their signatures. -
Lastly, expose the
logging_cid
in the task API for users to see.
Related issues
Updated by bmbouter almost 5 years ago
- Related to Task #3289: Support for X-Correlation-ID HTTP REST header added
Updated by ggainey over 4 years ago
- Tags Katello added
- Tags deleted (
Katello-P3)
Updated by bmbouter over 4 years ago
- Subject changed from As a user, I can configure pulp to accept a correlation id header to As a user, Pulp logs have correlation ids in the logs of all services
- Description updated (diff)
Updated by daviddavis over 4 years ago
For this kwarg _pulp_logging_cid
, will the code calling enqueue_with_reservation
have to manually set this if it wants task logging to use this CID?
Updated by bmbouter over 4 years ago
daviddavis wrote:
For this kwarg
_pulp_logging_cid
, will the code callingenqueue_with_reservation
have to manually set this if it wants task logging to use this CID?
The caller of enqueue_with_reservation
will not have to specify the CID, the CID will automatically and transparently be added by enqueue_with_reservation
itself. The signature of enqueue_with_reservation
won't be adjusted with this change so rolling it out will require no other code to change. Also when rolled out it, will be enabled automatically on all tasks.
Updated by daviddavis over 4 years ago
- Groomed changed from No to Yes
- Sprint Candidate changed from No to Yes
Updated by daviddavis over 4 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to daviddavis
- Priority changed from Low to Normal
Updated by daviddavis over 4 years ago
Any idea how to handle the test? Not sure how to use functional tests since tests might be running on a different box than Pulp. Perhaps a unit test that uses mock somehow?
Updated by bmbouter over 4 years ago
daviddavis wrote:
Any idea how to handle the test? Not sure how to use functional tests since tests might be running on a different box than Pulp. Perhaps a unit test that uses mock somehow?
I believe pulp-smash has a utility that allows you to run commands against the host box. You can see an example of pulp-smash code checking for root on the remote box for example here: https://github.com/pulp/pulp-smash/blob/8b0dbd3febb7a3a30367277c4f838c52a13b4642/pulp_smash/cli.py#L40
Updated by daviddavis about 4 years ago
- Related to Story #2654: As a User I want Pulp3 to log the first 8 characters of a task id in every log statement emitted from a running task added
Updated by daviddavis about 4 years ago
I modified Pulp per the design in the description. However, there are some quirks. Here are some logs using an autogenerated CID and a CID set by the correlation id header.
Autogenerated CID (4bca4df4-d69d-4c4b-8542-fee55209af68)¶
Here you can see that gunicorn doesn't log the correlation_id which is unfortunate as the correlation_id cannot be matched to the request. I'm guessing it's not possible to have gunicorn log the CID as the CID gets generated/set in pulp and guincorn logs its output before execution reaches Pulp. So we'll probably have to have Pulp output the request in its logs.
The resource manager has its own generated CID which isn't going to make sense to a user viewing the logs. Also, the worker has its own CID before it picks up the task. Then it uses the task's CID which is perhaps surprising to a user.
Sep 29 20:14:22 pulp3-source-fedora.crake.example.com rq[31262]: pulp (c68135fc-5b5c-487f-86fa-a98e01f73d96): rq.worker:INFO: resource-manager: 986e0d83-db66-480c-a94d-ec6a09f9d119
Sep 29 20:14:22 pulp3-source-fedora.crake.example.com gunicorn[31259]: 127.0.0.1 - admin [29/Sep/2020:20:14:22 +0000] "POST /pulp/api/v3/repositories/file/file/aef43617-a5cf-4e7c-b8dd-52badeec6e7e/versions/0/repair/ HTTP/1.0" 202 67 "-" "HTTPie/2.0.0"
Sep 29 20:14:23 pulp3-source-fedora.crake.example.com rq[31263]: pulp (a4c84ce9-adce-42f4-811d-814d69865da0): rq.worker:INFO: 31263@pulp3-source-fedora.crake.example.com: 868422b5-673b-4ac3-a63d-c2034ab061d7
Sep 29 20:14:23 pulp3-source-fedora.crake.example.com rq[31262]: pulp (c68135fc-5b5c-487f-86fa-a98e01f73d96): rq.worker:INFO: resource-manager: Job OK (986e0d83-db66-480c-a94d-ec6a09f9d119)
Sep 29 20:14:23 pulp3-source-fedora.crake.example.com rq[31263]: pulp (4bca4df4-d69d-4c4b-8542-fee55209af68): pulpcore.app.tasks.repository:INFO: Repairing version 0 of repository test
Sep 29 20:14:23 pulp3-source-fedora.crake.example.com rq[31263]: pulp (4bca4df4-d69d-4c4b-8542-fee55209af68): rq.worker:INFO: 31263@pulp3-source-fedora.crake.example.com: Job OK (868422b5-673b-4ac3-a63d-c2034ab061d7)
Sep 29 20:14:23 pulp3-source-fedora.crake.example.com rq[31263]: pulp (a4c84ce9-adce-42f4-811d-814d69865da0): rq.worker:INFO: 31263@pulp3-source-fedora.crake.example.com: af9570a0-d648-4120-ba86-7c4b830c3f67
Sep 29 20:14:23 pulp3-source-fedora.crake.example.com rq[31263]: pulp (a4c84ce9-adce-42f4-811d-814d69865da0): rq.worker:INFO: 31263@pulp3-source-fedora.crake.example.com: Job OK (af9570a0-d648-4120-ba86-7c4b830c3f67)
Setting CID to MY_CORRELATION_ID using HTTP_X_CORRELATION_ID header¶
I can probably replace None
with an empty string. This log perhaps makes more sense but again gunicorn isn't logging with the CID.
Sep 29 20:10:41 pulp3-source-fedora.crake.example.com rq[31170]: pulp (None): rq.worker:INFO: resource-manager: 17aef9a7-5aba-4e6f-a3cf-c44d46b5e865
Sep 29 20:10:41 pulp3-source-fedora.crake.example.com gunicorn[31168]: 127.0.0.1 - admin [29/Sep/2020:20:10:41 +0000] "POST /pulp/api/v3/repositories/file/file/aef43617-a5cf-4e7c-b8dd-52badeec6e7e/versions/0/repair/ HTTP/1.0" 202 67 "-" "HTTPie/2.0.0"
Sep 29 20:10:41 pulp3-source-fedora.crake.example.com rq[31169]: pulp (None): rq.worker:INFO: 31169@pulp3-source-fedora.crake.example.com: 20ab0ef0-faeb-482c-bae7-62cee69c7ca3
Sep 29 20:10:41 pulp3-source-fedora.crake.example.com rq[31170]: pulp (None): rq.worker:INFO: resource-manager: Job OK (17aef9a7-5aba-4e6f-a3cf-c44d46b5e865)
Sep 29 20:10:41 pulp3-source-fedora.crake.example.com rq[31169]: pulp (MY_CORRELATION_ID): pulpcore.app.tasks.repository:INFO: Repairing version 0 of repository test
Sep 29 20:10:41 pulp3-source-fedora.crake.example.com rq[31169]: pulp (MY_CORRELATION_ID): rq.worker:INFO: 31169@pulp3-source-fedora.crake.example.com: Job OK (20ab0ef0-faeb-482c-bae7-62cee69c7ca3)
Sep 29 20:10:41 pulp3-source-fedora.crake.example.com rq[31169]: pulp (None): rq.worker:INFO: 31169@pulp3-source-fedora.crake.example.com: 60ee9923-7aba-472c-8379-010542704557
Sep 29 20:10:41 pulp3-source-fedora.crake.example.com rq[31169]: pulp (None): rq.worker:INFO: 31169@pulp3-source-fedora.crake.example.com: Job OK (60ee9923-7aba-472c-8379-010542704557)
Updated by bmbouter about 4 years ago
What if we customize the gunicorn response format and have the CID be included as a response header. I see those are gunicorn logging format option here. Since gunicorn includes data only known by the response, I think it is waiting to log until the response is available.
The benefit of a response header is even auto-generated ones would be shown in the logs. It would probably be valuable for the client to have the CID sent all the way back to them so this creates value there too.
I was thinking also that in all cases there would be a CID. The tasking system I was thinking would have that be a tasking system param and the tasking system would always correlate it's task CID with the dispatcher of that task (either another task or a viewset who autogenerated it). Is that also planned to be part of this work too?
Updated by daviddavis about 4 years ago
What if we customize the gunicorn response format and have the CID be included as a response header. I see those are gunicorn logging format option here. Since gunicorn includes data only known by the response, I think it is waiting to log until the response is available.
The benefit of a response header is even auto-generated ones would be shown in the logs. It would probably be valuable for the client to have the CID sent all the way back to them so this creates value there too.
This makes sense and I think it should work.
I was thinking also that in all cases there would be a CID. The tasking system I was thinking would have that be a tasking system param and the tasking system would always correlate it's task CID with the dispatcher of that task (either another task or a viewset who autogenerated it). Is that also planned to be part of this work too?
I'm not sure I understand. Let me give an example: when the resource manager or worker starts up and starts logging (before it's been given a task), what should its CID be? What should it be after it has finished a task?
Updated by bmbouter about 4 years ago
daviddavis wrote:
I'm not sure I understand. Let me give an example: when the resource manager or worker starts up and starts logging (before it's been given a task), what should its CID be? What should it be after it has finished a task?
I see what you mean, and I didn't consider those log statements. In those I expect it should show no CID. Anytime a task is being run the CID would be there.
Updated by daviddavis about 4 years ago
The problem I am running into with django-cid is that it generates uuids when you go to get the cid and log messages[0] which is problematic because in some cases (eg when a worker starts before it has a task) we don't want it to generate a cid. Instead the middleware should handle this so that it only generates correlation IDs once when a request is received.
I also looked at django-guid[1] and I think it might work instead.
[0] https://github.com/Polyconseil/django-cid/blob/0dc7fc50d8aff469b989fb8b4b42242f2989ef1c/cid/locals.py#L15-L28 [1] https://github.com/snok/django-guid
Updated by daviddavis about 4 years ago
- Related to Task #7792: Update the gunicorn service file to log correlation id added
Updated by pulpbot about 4 years ago
- Status changed from ASSIGNED to POST
Added by daviddavis almost 4 years ago
Updated by daviddavis almost 4 years ago
- Status changed from POST to MODIFIED
- % Done changed from 0 to 100
Applied in changeset pulpcore|ca7dbe15de91e13605f074ee7c9349e58bcef059.
Updated by pulpbot almost 4 years ago
- Status changed from MODIFIED to CLOSED - CURRENTRELEASE
Add support for correlation_id to pulp
fixes #4689