Post

Conversation

I trained a 12M parameter LLM on my own ML framework using a Rust backend and CUDA kernels for flash attention, AdamW, and more. Wrote the full transformer architecture, and BPE tokenizer from scratch. The framework features: - Custom CUDA kernels (Flash Attention, fused LayerNorm, fused GELU) for 3x increased throughput - Automatic WebGPU fallback for non-NVIDIA devices - TypeScript API with Rust compute backend - One npm install to get started, prebuilt binaries for every platform Try out the model for yourself: mni-ml.github.io/demos/transfor Built with . Check out the repos and blog if you want to learn more. Shoutout to for the compute credits allowing me to train on 2 A100 GPUs without going broke cc
2:03
Amitav Krishna
Post your reply

respect. writing your own framework is one thing, getting a full train run to converge is the part that usually gets ugly. what broke first for you, kernels or optimizer math?
Fused LayerNorm and GELU kernels are a great optimization. Did you use Triton or raw CUDA?
thanks for the amazing work! Gave me some things to work on haha - started making my first auto grad
Aadi, you are incredibly talented. Genuinely keep it up- can’t wait to see what you will make!
rust + cuda for the win but i bet that webgpu fallback is where you'll actually spend most of your debugging time also impressive that you managed to squeeze a full transformer from scratch without relying on existing libraries
Building from scratch while juggling everything else sounds brutal. What made you choose Rust over something faster to prototype with? The CUDA kernel work must have been a nightmare to debug.
☝️This is what early computer engineers who invented things and changed the faculty forever did back in 90’s. They did surgery down to 0&1 and no AI could easily ever outrun optimiser of AI. This gen should be learning Algebra/AI like this.🫡
training a transformer from scratch used to be frontier lab work now random devs ship full implementations with custom CUDA kernels... Rust backends... BPE tokenizers built from zero as side projects what took entire teams 2 years ago is now weekend hobby territory the
I feel like asking how you learnt and built, but I fear that I already know, and have simply failed to execute over and over.
I used to be under the assumption that everything was going to happen on the GPU itself, but after seeing this video, I believe that there are a few actions where the CPU is also needed to transfer data. You all made a library which exclusively uses the GPU, which I think is
Custom CUDA kernels… it took me a week just to wrap my head around basic CUDA, and you did all this in a week 😭 insane work
It's been my dream for long time to do something like that , impressive work Happy for you guyz to see you got much recognize by big giants Just great
Why do you need A100s to train a 12 million parameter model? I was able to train a 30 million parameter model quite easily on a 4070
scaling past 10k concurrent users will expose the real kernel perf bottlenecks i bet
Writing flash attention CUDA kernels from scratch for a 12M model is probably the fastest way to actually understand transformers. The WebGPU fallback is a nice touch too. Most projects at this scale don't bother with non-NVIDIA.
the framework from scratch thing proves the bottleneck was never the model. it was always the harness