Continuous monitoring of pinned threads with Spring Boot and JFR

#java Apr 17, 2024 8 min Mike Kowalski

In my previous article about Java Virtual Threads I focused on different pinning cases caused by the synchronized construct. Some of the presented cases were far from being evident just by looking at the code. That’s why I believe that actively monitoring our apps for pinning is a must.

We don’t know if the pinning happens in our app if we’re not monitoring it. We don’t know how frequently it occurs. We don’t know for how long our treads are pinned. Without monitoring, we can only speculate about how pinning affects our application’s performance. I prefer measuring than guessing.

In this article, I’d like to show how you can monitor your Spring Boot apps for pinned threads. Using JFR (Java Flight Recorder) Event Streaming will allow us to detect thread pinning cases close to real-time with a minimal performance footprint. While the presented solution will rely on Spring Boot, porting it to your framework of choice should be straightforward. Those who are impatient can find the complete example on my GitHub.

Note: If you haven’t heard about the JFR before, have a quick look at the JDK Mission Control User Guide first.

Let’s start simple, to better understand why we need to complicate our lives in the first place.

Simple pinned threads logging

Probably the easiest way of monitoring thread pinning is passing -Djdk.tracePinnedThreads=full command line option. When pinning occurs, the JVM simply prints out the stack trace, helping us to understand where it came from. Such stack traces look like this:

Thread[#26,ForkJoinPool-1-worker-1,5,CarrierThreads]
    java.base/java.lang.VirtualThread$VThreadContinuation.onPinned(VirtualThread.java:185)
    java.base/jdk.internal.vm.Continuation.onPinned0(Continuation.java:393)
    java.base/java.lang.VirtualThread.parkNanos(VirtualThread.java:631)
    java.base/java.lang.VirtualThread.sleepNanos(VirtualThread.java:803)
    java.base/java.lang.Thread.sleep(Thread.java:507)
    com.mikemybytes.playground.JfrEventsApp$Synchronizer.sync(JfrEventsApp.java:65) <== monitors:1
    com.mikemybytes.playground.JfrEventsApp.lambda$main$1(JfrEventsApp.java:42)
    java.base/java.lang.VirtualThread.run(VirtualThread.java:311)

What I like about this approach is its extreme simplicity. There’s no need to write even a single line of Java code!

Yet, when I tried this approach with one of my Spring Boot apps running on Java 21, the application started to hang unexpectedly. This turned out to be a known problem, caused by the JDK issue. The fix came with Java 22 already, but those stuck with version 21 might need to look further.

Unfortunately, this simple approach has a few limitations. First, it tells us nothing about the duration of the pinning. The longer it takes, the higher its potential performance impact. Second, as pointed out by Todd Ginsberg, this approach seems to be reporting only the first occurrence for each location. In other words, every unique pinning-related stack trace will be logged only once. As a consequence, we may not be able to count the number of occurrences or correlate them with the observed performance issues at a specific point in time. Last but not least, we don’t have any control over how such events are being logged.

Towards continuous monitoring

I’m happily running some Virtual Threads-powered apps already. What I’ve been looking for was a way of detecting thread pinning in real-time. I wanted to know the exact duration of pinning events and the details about their cause. I also wanted to be able to present pinning in my Grafana dashboard, so whenever something bad happens I could simply look for a correlation. Finally, I wanted to integrate this solution with the existing Spring Boot infrastructure.

Together with Virtual Threads, Java 21 came with a new JFR Event called VirtualThreadPinned. This event is emitted whenever the pinning occurs. Additionally, it may contain additional metadata like the stack trace (to be more precise - recorded frames) and the duration of the event. That’s why it felt like a perfect candidate for my monitoring solution.

Java 14 introduced JFR (Java Flight Recorder) Event Streaming, which allows consuming the JFR events from within the application itself. This gives us a lot of flexibility when it comes to processing these events. In our case, we would log them using our logger of choice and expose their duration as an observable metric.

In this post, I’ll guide you through the most important building blocks of this solution. The full code of the example jfr-thread-pinning-spring-boot application can be found on GitHub. This simple Spring Boot app exposes a /pinning endpoint which intentionally causes thread pinning via synchronized, so we could see the monitoring solution in action.

Recording JFR events in the background

The first step is to start listening to the JFR events from within the Spring Boot app. Extending the SmartLifecycle interface allows us to start/stop the recording session together with our application. Let’s have a look at the code:

package com.mikemybytes.jfr;

import jdk.jfr.consumer.RecordingStream;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.context.SmartLifecycle;

import java.time.Duration;
import java.util.concurrent.atomic.AtomicBoolean;

public class JfrEventLifecycle implements SmartLifecycle {

    private static final Logger log = LoggerFactory.getLogger(JfrEventLifecycle.class);

    private final AtomicBoolean running = new AtomicBoolean(false);
    private RecordingStream rs;

    @Override
    public void start() {
        if (!isRunning()) {
            running.set(true);

            rs = new RecordingStream(); // (1)
            rs.enable("jdk.VirtualThreadPinned").withStackTrace(); // (2)
            rs.onEvent(
                    "jdk.VirtualThreadPinned",
                    ev -> log.warn("Thread pinning detected! {}", ev) // (3)
            );

            // prevents long-running app from memory leaks
            rs.setMaxAge(Duration.ofSeconds(10)); // (4)

            rs.startAsync(); // (5)
        }
    }

    @Override
    public void stop() {
        if (isRunning()) {
            rs.close();
            running.set(false);
        }
    }

    @Override
    public boolean isRunning() {
        return running.get();
    }
}

The start() method is the most interesting one. We first initialize the JFR recording session by creating a RecordingStream (1). At this point, we are not listening to any JFR events yet.

With (2) we’re explicitly enabling jdk.VirtualThreadPinned event as something we’re interested in. Using .withStackTrace() ensures that the stack trace will be recorded as well.

With (3) we are providing a simple handler function. It simply logs the received RecordedEvent object containing the details about the recorded JFR event. We will complicate this step a bit more later.

Step (4) makes sure that our app won’t be accumulating events forever, which could turn into a memory leak. A window of 10 seconds should be just enough for our purposes. We don’t need to store the history, as we need the recorded events only for a very short time.

Finally, with (5) we are starting the asynchronous recording session that doesn’t block further execution. Our app is now listening to the defined set of JFR events.

Calling POST /pinning at this point causes an intentional pinning event that gets printed out as expected. Yet, the output looks slightly different from what we would get with -Djdk.tracePinnedThreads=full:

2024-04-14T16:58:30.439+02:00  WARN 14151 --- [jfr-thread-pinning-spring-boot] [ Event Stream 1] c.m.jfr.pinning.JfrEventLifecycle        : Thread pinning detected! jdk.VirtualThreadPinned {
  startTime = 16:58:30.051 (2024-04-14)
  duration = 251 ms
  eventThread = "web-vt-729a03c1-5456-4686-90b2-b2659291752f" (javaThreadId = 69, virtual)
  stackTrace = [
    java.lang.VirtualThread.parkOnCarrierThread(boolean, long) line: 677
    java.lang.VirtualThread.parkNanos(long) line: 636
    java.lang.VirtualThread.sleepNanos(long) line: 793
    java.lang.Thread.sleep(long) line: 507
    com.mikemybytes.jfr.pinning.ThreadPinningController.sleep(Duration) line: 36
    ...
  ]
}

As promised earlier, we finally see how long the pinning took (duration). We can find the exact thread (eventThread) that caused that event too. Unfortunately, the stackTrace is limited only to the top 5 entries by default. In many cases, this won’t be enough. Luckily, we can customize the output according to our needs.

If you analyze the whole log entry carefully, you will notice a delay between the recorded event time (16:58:30.051) and our log entry (16:58:30.439). That delay is why this solution is close to real-time monitoring, but not exactly one. In my opinion, the delay is small enough for our purposes (it’s usually close to 500ms). We just have to be aware of it when analyzing the collected pinning data.

Customizing pinning event data

Let’s extract the event handling part to a dedicated JfrVirtualThreadPinnedEventHandler. This way we will decouple the event-specific logic from the JFR recording session itself.

Since we already know that presenting only the top 5 stack trace entries won’t be enough, we should tweak the output a bit. Unfortunately, the RecordedObject class (which RecordedEvent extends) hardcodes the stack depth. We also can’t use the PrettyWriter class that is used under the hood, as it’s not exposed by the jdk.jfr module… As a result, we have to provide our own stack trace formatting method.

Additionally, we want our handler class to expose the pinning duration as Micrometer Timer. With Prometheus and Grafana integration in place, we could then present it together with other application metrics. Since Timer acts as a Counter too, we would also get the number of registered events.

The implementation of our handler could look like this:

package com.mikemybytes.jfr.pinning;

import io.micrometer.core.instrument.MeterRegistry;
import jdk.jfr.consumer.RecordedEvent;
import jdk.jfr.consumer.RecordedFrame;
import jdk.jfr.consumer.RecordedStackTrace;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.stereotype.Component;

import java.time.LocalDateTime;
import java.time.ZoneId;
import java.util.stream.Collectors;

@Component
class JfrVirtualThreadPinnedEventHandler {

    // improvement idea: expose as a configuration
    private static final int STACK_TRACE_MAX_DEPTH = 25;

    private final Logger log = LoggerFactory.getLogger(JfrVirtualThreadPinnedEventHandler.class);

    private final MeterRegistry meterRegistry;

    JfrVirtualThreadPinnedEventHandler(MeterRegistry meterRegistry) {
        this.meterRegistry = meterRegistry;
    }

    void handle(RecordedEvent event) {
        // marked as nullable in Javadoc
        var thread = event.getThread() != null ? event.getThread().getJavaName() : "<unknown>";
        var duration = event.getDuration();
        var startTime = LocalDateTime.ofInstant(event.getStartTime(), ZoneId.systemDefault());
        var stackTrace = formatStackTrace(event.getStackTrace(), STACK_TRACE_MAX_DEPTH);

        log.warn(
                "Thread '{}' pinned for: {}ms at {}, stacktrace: \n{}",
                thread,
                duration.toMillis(),
                startTime,
                stackTrace
        );

        var timer = meterRegistry.timer("jfr.thread.pinning");
        timer.record(duration);
    }

    private String formatStackTrace(RecordedStackTrace stackTrace, int maxDepth) {
        if (stackTrace == null) {
            return "\t<not available>";
        }
        String formatted = "\t" + stackTrace.getFrames().stream()
                .limit(maxDepth)
                .map(JfrVirtualThreadPinnedEventHandler::formatStackTraceFrame)
                .collect(Collectors.joining("\n\t"));
        if (maxDepth < stackTrace.getFrames().size()) {
            return formatted + "\n\t(...)"; // truncated
        }
        return formatted;
    }

    private static String formatStackTraceFrame(RecordedFrame frame) {
        return frame.getMethod().getType().getName() + "#" + frame.getMethod().getName() + ": " + frame.getLineNumber();
    }

}

If we cause thread pinning via the endpoint once more, we will see our custom output formatting in action:

2024-04-14T17:18:32.460+02:00  WARN 14322 --- [jfr-thread-pinning-spring-boot] [ Event Stream 1] m.j.p.JfrVirtualThreadPinnedEventHandler : Thread 'web-vt-01aad770-6783-4b53-b8a9-21e5a812fa0e' pinned for: 250ms at 2024-04-14T17:18:32.095766166, stacktrace: 
	java.lang.VirtualThread#parkOnCarrierThread: 677
	java.lang.VirtualThread#parkNanos: 636
	java.lang.VirtualThread#sleepNanos: 793
	java.lang.Thread#sleep: 507
	com.mikemybytes.jfr.pinning.ThreadPinningController#sleep: 36
	com.mikemybytes.jfr.pinning.ThreadPinningController#lambda$pinCarrierThread$0: 27
	com.mikemybytes.jfr.pinning.ThreadPinningController$$Lambda+0x000000b00155f6d8.1266805443#run: -1
	java.lang.Thread#runWith: 1596
	java.lang.VirtualThread#run: 309
	java.lang.VirtualThread$VThreadContinuation$1#run: 190
	jdk.internal.vm.Continuation#enter0: 320
	jdk.internal.vm.Continuation#enter: 312
	jdk.internal.vm.Continuation#enterSpecial: -1

Just as intended, we are no longer limited to only the top 5 stack trace entries. Yet, it’s important to have a reasonable limit here, as for some recursive calls the stack trace could be very, very long.

Last but not least, let’s examine the /actuator/prometheus endpoint output looking for our jfr.thread.pinning timer. Keep in mind, that dots will be replaced with underscores here, so we’re looking for jfr_thread_pinning:

(...)
# HELP jfr_thread_pinning_seconds_max  
# TYPE jfr_thread_pinning_seconds_max gauge
jfr_thread_pinning_seconds_max 0.25093325
# HELP jfr_thread_pinning_seconds  
# TYPE jfr_thread_pinning_seconds summary
jfr_thread_pinning_seconds_count 1.0
jfr_thread_pinning_seconds_sum 0.25093325
(...)

I’ll leave causing a thread pinning multiple times to see how the output changes as an exercise for you 😉

Next steps

The presented solution is just a starting point. You can play with the logging output as much as you want. You may introduce Micrometer Histogram to have more detailed data about the registered pinning occurrences over time. You will also need to extend your dashboards, so the recorded data is visible there.

Also, you may want to come up with a custom “signature” for each location causing pinning. It could then be used as a tag for each event reported via metrics. This way, you could easily group pinning occurrences from the same location. Such a signature could be a simple hash over the recorded stack trace, or something more fancy.

Finally, there are many other JFR events available for collecting data about the resource consumption, GC activity, etc. With all the infrastructure we already created, adding support for additional events should be straightforward.

Mike Kowalski

Software engineer believing in craftsmanship and the power of fresh espresso. Writing in & about Java, distributed systems, and beyond. Mikes his own opinions and bytes.