Large system prompt causes OOM errors on NDIF

arunasank · July 30, 2025, 7:51pm

Hi I am interested in testing out longer system prompts which causes the context length to be very large. For example, I am testing a private system prompt which when added to my input leads to a context length of 67428 tokens. This causes OOM errors on NDIF.

michael · August 1, 2025, 8:46pm

Hi, one thing we could do at some point is host 405B in 8 bit precision. We would need to inform other users, but this is definitely something we could discuss.

Additionally, we definitely would be open to hosting more longer context models! Feel free to list any which meet your requirements

We already discussed these details a fair bit on discord, but for convenience, here is a kv caching huggingface article: KV cache strategies . It would probably be worth experimenting a bit with different setups.

You can configure your caching strategy directly in .generate() e.g.:

with model.session(remote=True):
    with model.generate("ayy", max_new_tokens=100, cache_implementation="offloaded") as tracer:
        out = model.generator.output.save()

Let me know if you run into any issues / how things go!

arunasank · August 2, 2025, 9:05pm

This is mad cool! Thanks for teaching me something new. I’ll test this out and report back.

Tried this today, and it didn’t work. I also tried to completely disable the cache, but that didn’t work either.

michael · August 4, 2025, 1:10pm

When you say it didn’t work, do you mean it didn’t seem to reduce the VRAM usage or the request failed? Either way I can look further into it

arunasank · August 4, 2025, 1:45pm

Thanks! I don’t have access to the VRAM usage, but the request failed w a CUDA OOM.

michael · August 4, 2025, 8:37pm

Hmm, I just tried out comparing cache_implementation="dynamic" (the default) with cache_implementation="offloaded", the latter used far less VRAM. Are you able to get more tokens generated before hitting the OOM with “offloaded”?

michael · August 5, 2025, 4:38pm

To help with diagnosing your issue, would you be able to do the following:

Share a reproducible code snippet of what your generation setup looks like (just to be sure we are on the same page / catch any accidental memory leaks)
Try running your code with cache_implementation="offloaded" and report back how many tokens you were able to generate before OOM.
Similarly, running your code passing use_cache=False to .generate() and reporting how many tokens before OOM

I’m somewhat optimistic that with our cluster setup, it could be possible to generate 128K tokens (even with unquantized 405B), and want to see how much we could get away with before jumping to hosting a INT8 405B model. I read a huggingface article outlining the memory requirements needed for 128K tokens, and our 405B node technically has enough extra memory (though only in aggregate across 8 devices which is why I think you still get the OOM).

arunasank · August 5, 2025, 5:28pm

Thank you. I’ll do this on Thursday!, once the 405B model is back in rotation!