Direct Logit Attribution

AdityaSingh · May 3, 2025, 1:12pm

Is there a way to do direct logit attribution in nnsight?

TransformerLens has an apply_ln_to_stack function which linearizes the final layernorm (if a model has it), and allows one to decompose the logits into the contribution from each layer in the model.

Is there an analog of this in nnsight, or any suggested workaround?

michael · May 5, 2025, 1:39pm

Hi, currently there is no direct way of doing this in NNsight. By design, NNsight wraps around arbitrary Pytorch models, and our focus up until now had more creating a framework which is model agnostic. It’s a lower level of abstraction than TransformerLens, and trades off consistent syntax for the need to reimplement for each model we support.

Thus, defining high level functions like apply_ln_to_stack arbitrarily is tricky, as it would require building upon a layer of abstraction we yet to develop. That said, this is something we are actively thinking about and hope to make progress on in the not so distant future.

For now, my recommendation is to just try and implement it yourself, maybe using a TransformerLens model as reference. If you get stuck along away, there will always be someone to help, either here or on our discord.

Hope this answers your question!