Excited to announce our new NNsight version, nnsight v0.5.13!
This release re-integrates support vLLM into NNsight, along with introducing performance improvements.
To learn more, check out the release notes below and the vLLM tutorial.
Please use this thread to provide feedback on vLLM integration and any other issues concerning this release.
Release Notes:
1. nnsight support for vLLM inference has been complexly refactored and works with the latest version of vLLM, including tensor parallelism. Enabling fast inference on multi-GPU models with NNsight interventions!
if __name__ == "__main__":
from nnsight.modeling.vllm import VLLM
model = VLLM("meta-llama/Llama-3.1-8B", dispatch=True, tensor_parallel_size=2)
with model.trace(
"The Eiffel Tower is located in the city of",
temperatur=0.8,
max_tokens=30,
) as tracer:
activations = list().save()
logits = list().save()
samples = list().save()
with tracer.iter[:30]:
activations.append(model.model.layers[16].mlp.down_proj.output[0].cpu())
logits = logits.append(model.logits.output)
samples.append(model.samples.output)
output = tracer.result.save()
Feedback on our vLLM integration would be much appreciated.
Works with vLLM>=0.12
2. Optimizations to interleaving induced performance improvements across the board. Noticeable when doing a significant amount of interventions.
In addition, there are three config flags you can set which see a much greater improvement but require code changes or are more experimental.
from nnsight import CONFIG as NNSIGHT_CONFIG
NNSIGHT_CONFIG.APP.PYMOUNT = False
NNSIGHT_CONFIG.APP.CROSS_INVOKER = False
NNSIGHT_CONFIG.APP.TRACE_CACHING = True
- PYMOUNT: Turning this flag off removes the ability to call
.save()on arbitrary objects and instead will require you to callnnsight.save
from nnsight import save
with model.trace("Hello world"):
output = save(model.output)
Mounting and un-mounting .save() onto python objects has some performance cost.
- CROSS_INVOKER: Turning this off prevents sharing variable between invokers. This sharing has a performance cost. Most people don’t use this anyway so you should probably turn it off.
with model.trace() as tracer:
with tracer.invoke("Hello world"):
hs = model.model.layers[0].output
with tracer.invoke("Hello world"):
model.model.layers[1].output = hs # X UnboundVariable: hs is not defined
- TRACE_CACHING: This caches the source code of your trace making future lookups much faster. So if you have a trace in a loop or in a function called more than once, you’ll see a significant improvement.