Machine Learning Systems Intern Apply
Job Description
Hybrid SSM‑Transformer models have a unique advantage for on‑chip memory efficiency:
- SSM layers compress sequence history into a fixed‑size recurrent state
- Attention layers store key‑value caches that grow with context length
This leads to an important design question:
For a given model configuration and maximum context length, can on‑chip SRAM be sized so that inference runs entirely on chip—eliminating the need for slower off‑chip HBM or DRAM?
What the intern will work on:
The intern will model and analyze memory behavior during inference of hybrid SSM‑Transformer models, with a focus on avoiding off‑chip memory accesses. Key responsibilities include:
- Modeling data movement between SRAM and HBM/DRAM during inference
- Sweeping parameters such as:
- SRAM capacity
- Context length
- Model dimensions
- Mapping the feasibility boundary where inference can be performed fully on chip
- Breaking down per‑layer memory working sets
- Identifying when and why memory spills occur
- Exploring tiling and scheduling strategies to extend the no‑spill region
- Validating analytical results through simulation

