Staff Machine Learning Infrastructure Engineer Apply
-
Bachelor's degree or higher in Computer Science or a related field.
-
At least 7 years of professional experience in the software industry, with a minimum of 2 years in a tech lead role.
-
Proven experience with high-performance computing environments and distributed systems.
-
Demonstrated ability to scale ML training systems and optimize resource utilization.
-
Hands-on experience with job scheduling systems and managing cloud GPU environments (GCP, AWS, etc.).
-
Deep understanding of distributed computing concepts, including race conditions, memory optimization, and parallel processing.
-
Hands-on experience in ML model tuning for performance.
-
Experience with common ML training and inference tools including PyTorch, TensorRT, Triton, Accelerate, etc.
-
Strong analytical and problem-solving skills with the ability to troubleshoot complex system issues.
-
Excellent communication skills to collaborate effectively with cross-functional teams.
-
Experience with container orchestration tools (e.g., Kubernetes) and infrastructure-as-code frameworks.

