Distributed Tracing with Spring Cloud Sleuth and Spring Cloud Zipkin

https://spring.io/blog/2016/02/15/distributed-tracing-with-spring-cloud-sleuth-and-spring-cloud-zipkin

I was inspired by OpenZipkin contributor and Spring Cloud Sleuth and Zipkin contributor Adrian Cole’s epic talk introducing Zipkin to write this particular post. There’s so much epic stuff in that talk, so once you’re done reading this, go watch that!

Advances in technology and cloud computing have made it easier to stand up and deploy services with ease. Cloud computing enables us to automate away the pain (from days or weeks (gasp!) to minutes!) associated with standing up new services. This increase in velocity in turn enables us to be more agile, to think about smaller batches of independently deployable services. The proliferation of new services complicates reasoning about system-wide and request-specific performance characteristics.

When all of an application’s functionality lives in a monolith - what we call applications written as one, large, unbroken deployable like a .war or .ear - it’s much easier to reason about where things have gone wrong. Is there a memory leak? It’s in the monolith. Is a component not handling requests correctly? It’s in the monolith. Messages getting dropped? Also, probably in the monolith. Distribution changes everything.

Systems behave differently under load and at scale. The specification of a system’s behavior often diverges from the actual behavior of the system, and the actual behavior may itself vary in different contexts. It is important to contextualize requests as they transit through a system. It’s also important to be able to talk about the nature of a specific request and to be able to understand that specific request’s behavior relative to the general behavior of similar requests in the past minute, hour, day (or whatever!) other useful interval provides a statistically significant sampling. Context helps us establish whether a request was abnormal and whether it merits attention. You can’t trace bugs in a system until you’ve established a baseline for what normal is. How long is is long? For some systems it might be microseconds, for others it might be seconds or minutes!

In this post, we’ll look at how Spring Cloud Sleuth, which supports distributed tracing, can help us establish this context and helps us better understand a system’s actual behavior, not just its specified behavior.

Finding Clues with Spring Cloud Sleuth

Tracing is simple, in theory. As a request flows from one component to another in a system, through ingress and egress points, tracers add logic where possible to perpetuate a unique trace ID that’s generated when the first request is made. As a request arrives at a component along its journey, a new span ID is assigned for that component and added to the trace. A trace represents the whole journey of a request, and a span is each individual hop along the way, each request. Spans may contain tags, or metadata, that can be used to later contextualize the request. Spans typically contain common tags like start timestamps and stop timestamp, though it’s easy to associate semantically relevant tags like an a business entity ID with a span.

Spring Cloud Sleuth (org.springframework.cloud:spring-cloud-starter-sleuth), once added to the CLASSPATH, automatically instruments common communication channels:

HTTP headers received at Spring MVC controllers
requests that pass through a Netflix Zuul microproxy
requests made with the RestTemplate, etc.

Spring Cloud Sleuth sets up useful log formatting for you that prints the trace ID and the span ID. Assuming you’re running Spring Cloud Sleuth-enabled code in a microservice whose spring.application.name is my-service-id, you will see something like this in the logging for your microservice:

2016-02-11 17:12:45.404  INFO [my-service-id,73b62c0f90d11e06,73b62c0f90d11e06,false] 85184 --- [nio-8080-exec-1] com.example.MySimpleComponentMakingARequest     : ...

In that example, my-service-id is the spring.application.name, 73b62c0f90d11e06 is the trace ID and 73b62c0f90d11e06 is the span ID. This information is very useful. You can publish your logs to log analysis and manipulation tools like Elasticsearch and Splunk. There are various ways to get that data there. Logstash, for example, is a log publisher that will write to ElasticSearch. Cloud Foundry automatically aggregates logs from all instances of a service into a unified log through a tool called the Loggregator that can then be forwarded to any Syslog-compatible service, including tools like Splunk or PaperTrail. Whatever approach you take, you can do interesting queries if you have all the logs, and the trace information, in a single place available for query and analysis.

Spring Cloud Sleuth also makes this information available to any Spring Cloud Sleuth-aware Spring application by simply injecting the SpanAccessor. You can also use this to instrument your own components that aren’t already instrumented by Spring Cloud so that they can perpetuate trace information. Naturally, each tracer is going to be different, but Spring Cloud Sleuth’s code itself (e.g.: TraceFeignClientAutoConfiguration) is suggestive of how a typical tracer works:

...
	@Autowired
	private SpanAccessor spanAccessor;...
	Span span = this.spanAccessor.getCurrentSpan();
	...
	template.header(Span.TRACE_ID_NAME, Span.toHex(span.getTraceId()));
	setHeader(template, Span.SPAN_NAME_NAME, span.getName() );
	setHeader(template, Span.SPAN_ID_NAME, Span.toHex(span.getSpanId()));...

How Much Data is Enough?

Which requests should be traced? Ideally, you’ll want enough data to see trends reflective of live, operational traffic. You don’t want to overwhelm your logging and analysis infrastructure, though. Some organizations may only keep requests for every thousand requests, or every ten, or every million! By default, the threshold is 10%, or .1, though you may override it by specifying a sampling percentage: