Use (End-to-End) Tracing or Correlation IDs

Over the last few years, I've been supporting a lot of production services in a personal and professional capacity.

While supporting them, I've inevitably had to track down problems across various sets of logging and monitoring, and have found that there's one thing that makes the process much, much easier to diagnose - the use of a unique identifier per interaction which is exposed to logging and monitoring.

This unique identifier is one that is constant across a whole interaction in a single (micro)service, and allows for an easy way to get a view of everything happening for that interaction with the service, which depending on how many logs your service produces and how much traffic is received, could quite considerably reduce the overhead of filtering logs.

This unique ID is hopefully shared across the UIs and various services that make up the system being interacted with, with each service replaying it on to the next, so you have a way of viewing all logs that correlate to a single customer interaction end-to-end.

This ID is commonly referred to as a "tracking ID", "tracing ID", "request ID", "correlation ID" or "context ID", but it may be something else that you're familiar with. It's often a UUID, but could be anything with sufficient randomness.

This ID is something I've seen generally made a required header in the service's contract, ensuring that these are definitely being sent, but it's also possible to not enforce it, and generate a fresh one if it's not already being sent to you in a valid format, so you at least have something to diagnose from your side, even if you can't piece it together with the client's view.

A benefit of this pattern, as well, is that it can make component or system testing much easier, as you know that a specific ID is going to be used and can therefore set up expectations more easily.

You may come up with clients who don't want to perform the work to start sending this header, such as a client we had that was sending it, but it was static and they didn't have the time to prioritise to resolve it. When I explained that each issue that was related to their integration would take at least 2-3 times longer to resolve (and be very costly to everyone involved) - if we even could track things down - they realised that actually they should really be amending their configuration.

Something I've not done before, but have seen others recommend is to make sure that this ID is also returned in HTTP response headers, so it's clear to a consumer of the service, as well as an operator looking through logs.

Example using Spring Boot

As an example, let's see what this would look like for a Spring filter:

@Component
public class CorrelationIdFilter extends OncePerRequestFilter {
  // UUIDv4, matching either case, but depends on what format you want to use
  private static final Pattern UUID_PATTERN =
      Pattern.compile("([a-fA-F0-9]{8}(-[a-fA-F0-9]{4}){4}[a-fA-F0-9]{8})");

  @Override
  protected void doFilterInternal(
      @NonNull HttpServletRequest request,
      @NonNull HttpServletResponse response,
      @NonNull FilterChain filterChain)
      throws ServletException, IOException {

    String correlationId = request.getHeader("correlation-id");
    if (null == correlationId || !UUID_PATTERN.matcher(correlationId).matches()) {
      // only allow UUIDs, if it's not valid according to our contract, allow it to be rewritten
      // alternatively, we would rejct the request with an HTTP 400 Bad Request, as a client
      // hasn't fulfilled the contract
      correlationId = UUID.randomUUID().toString();
    }
    // make sure that the Mapped Diagnostic Context (MDC) has the `correlationId` so it can then
    // be populated in the logs
    try (MDC.MDCCloseable closable = MDC.putCloseable("correlationId", correlationId)) {
      filterChain.doFilter(request, response);
    }
  }
}

If you were using a structured logging format), you'd have a log outputted such as:

{
  "@timestamp": "2021-11-21T18:50:57.467+00:00",
  "@version": 1,
  "level": "WARN",
  "logger_name": "me.jvt.www.api.micropub.exception.MicropubExceptionHandler",
  "mdc": {
    "correlationId": "0dbde4f9-dbcb-49e6-9258-3a377d4696ad"
  },
  "message": "An invalid request was rejected for reason: Query `` is not supported",
  "source_host": "micropub-845978f994-tvpft",
  "thread_name": "http-nio-8080-exec-4"
}

Or an unstructured logging format:

2021-05-31 14:39:04.872  ERROR 1167226 --- [o-auto-1-exec-1] c.e.s.HelloController : c8633195-2ed1-45fa-ae3d-5afb3ce33c82 An unexpected error occured, cause: ...

Written by Jamie Tanna's profile image Jamie Tanna on , and last updated on .

Content for this article is shared under the terms of the Creative Commons Attribution Non Commercial Share Alike 4.0 International, and code is shared under the Apache License 2.0.

#production #incident-management #supportability.

This post was filed under articles.

Interactions with this post

Interactions with this post

Below you can find the interactions that this page has had using WebMention.

Have you written a response to this post? Let me know the URL:

Do you not have a website set up with WebMention capabilities? You can use Comment Parade.