Skip to content

JCQueue: self-referential ThreadLocal inserters can leak JCQueue instances on long-lived/shared producer threads #8810

Description

@GGraziadei

Summary

JCQueue stores its per-producer-thread batching inserters in instance-field ThreadLocals:

private final ThreadLocal<BatchInserter> thdLocalBatcher = new ThreadLocal<BatchInserter>();
private final ThreadLocal<DynamicBatchInserter> thdLocalDynamicBatcher = new ThreadLocal<DynamicBatchInserter>();

Each inserter holds a strong reference back to its owning JCQueue (this.queue). This forms a reference cycle that prevents the JCQueue from being garbage collected for as long as any producer thread that ever published to it stays alive -- even after every other reference to the queue is dropped.

Mechanism

ThreadLocalMap holds its keys weakly but its values strongly. The cycle defeats the weak key:

Thread
  -> ThreadLocalMap
       - key:   ThreadLocal           (weak ref)   [also strongly held as a JCQueue field]
       - value: DynamicBatchInserter  (strong)
                  -> queue: JCQueue    (strong)
                       -> thdLocalDynamicBatcher: ThreadLocal (strong field)

  The key is therefore strongly reachable via: value -> queue -> field

The normal cleanup path for an instance-field ThreadLocal -- the key becoming weakly-reachable once the owning object dies, which then lets the entry be expunged -- never triggers, because value -> queue -> field keeps the key strongly reachable. As long as the producer thread lives, the JCQueue (and its metrics, batch buffers, etc.) cannot be collected.

Impact

  • Production workers: effectively no impact. Storm runs each worker as its own JVM process; producer threads and the JCQueue share a lifetime and both die when the worker process exits. The retained memory is reclaimed by process teardown.
  • Shared/static thread pools: real leak. Where producer threads outlive individual queues -- e.g. LocalCluster, embedded deployments, or test harnesses that repeatedly start/stop topologies on long-lived threads -- each killed topology can strand its JCQueue instances on those threads' ThreadLocalMaps. Over many start/stop cycles this accumulates.

Mitigations that do NOT work

Two seemingly obvious fixes are ineffective and should be avoided:

  1. static ThreadLocal<WeakHashMap<JCQueue, Inserter>> (composite weak key).
    The value (Inserter) strongly references the JCQueue that is the map key. This is the classic WeakHashMap value-references-key leak: the key never becomes weakly-reachable, so it is never evicted. The cycle is relocated, not broken.

  2. Cleanup in JCQueue.close() via ThreadLocal.remove().
    close() runs on whatever thread invokes it (typically the consumer/cleanup thread). ThreadLocal.remove() only clears the calling thread's entry. There is no API to clear other producer threads' ThreadLocalMap entries -- which are exactly the ones retaining the queue. So this is cosmetic for the actual leak.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions