NNsight 0.5.13 Release: vLLM integration and performance improvements

ebortz · December 19, 2025, 10:48pm

Excited to announce our new NNsight version, nnsight v0.5.13!

This release re-integrates support vLLM into NNsight, along with introducing performance improvements.

To learn more, check out the release notes below and the vLLM tutorial.

Please use this thread to provide feedback on vLLM integration and any other issues concerning this release.

Release Notes:

1. nnsight support for vLLM inference has been complexly refactored and works with the latest version of vLLM, including tensor parallelism. Enabling fast inference on multi-GPU models with NNsight interventions!

if __name__ == "__main__":
    from nnsight.modeling.vllm import VLLM

    model = VLLM("meta-llama/Llama-3.1-8B", dispatch=True, tensor_parallel_size=2)

    with model.trace(
        "The Eiffel Tower is located in the city of",
        temperatur=0.8,
        max_tokens=30,
    ) as tracer:

        activations = list().save()
        logits = list().save()
        samples = list().save()

        with tracer.iter[:30]:
            activations.append(model.model.layers[16].mlp.down_proj.output[0].cpu())
            logits = logits.append(model.logits.output)
            samples.append(model.samples.output)

        output = tracer.result.save()

Feedback on our vLLM integration would be much appreciated.

Works with vLLM>=0.12

2. Optimizations to interleaving induced performance improvements across the board. Noticeable when doing a significant amount of interventions.

In addition, there are three config flags you can set which see a much greater improvement but require code changes or are more experimental.

from nnsight import CONFIG as NNSIGHT_CONFIG

NNSIGHT_CONFIG.APP.PYMOUNT = False
NNSIGHT_CONFIG.APP.CROSS_INVOKER = False
NNSIGHT_CONFIG.APP.TRACE_CACHING = True

PYMOUNT: Turning this flag off removes the ability to call .save() on arbitrary objects and instead will require you to call nnsight.save

from nnsight import save
with model.trace("Hello world"):
    output = save(model.output)

Mounting and un-mounting .save() onto python objects has some performance cost.

CROSS_INVOKER: Turning this off prevents sharing variable between invokers. This sharing has a performance cost. Most people don’t use this anyway so you should probably turn it off.

with model.trace() as tracer:

    with tracer.invoke("Hello world"):
        hs = model.model.layers[0].output
    with tracer.invoke("Hello world"):
        model.model.layers[1].output = hs # X UnboundVariable: hs is not defined

TRACE_CACHING: This caches the source code of your trace making future lookups much faster. So if you have a trace in a loop or in a function called more than once, you’ll see a significant improvement.