Senior ML Software Engineer - Quantization & Numerics

11/25/2025

Drive software development and model optimization tooling proof-of-concept efforts to streamline deployment of quantized models. Analyze performance bottlenecks in quantized state-of-the-art LLM architectures and drive performance improvements.

Working Hours

40 hours/week

Company Size

10,001+ employees

Language

English

Visa Sponsorship

About The Company

Every company has a mission. What's ours? To empower every person and every organization to achieve more. We believe technology can and should be a force for good and that meaningful innovation contributes to a brighter world in the future and today. Our culture doesn’t just encourage curiosity; it embraces it. Each day we make progress together by showing up as our authentic selves. We show up with a learn-it-all mentality. We show up cheering on others, knowing their success doesn't diminish our own. We show up every day open to learning our own biases, changing our behavior, and inviting in differences. Because impact matters. Microsoft operates in 190 countries and is made up of approximately 228,000 passionate employees worldwide.

About the Role

Drive software development and model optimization tooling proof-of-concept effort to streamline deployment of quantized models. Analyze performance bottlenecks in quantized state-of-the-art LLM architectures and drive performance improvements. Prototype and evaluate emerging low-precision data formats through proof-of-concept implementations on novel hardware accelerator SDK. Co-design model architecture optimized for low-precision deployment in close collaboration with companywide AI/ML teams. Work cross-functionally with data scientists and ML researchers/engineers across organizations to align on model accuracy and performance goals. Partner with hardware architecture and AI software framework teams to ensure end-to-end system efficiency. Bachelor's Degree in Computer Science, Electrical or Computer Engineering, or related field AND 4+ years of industry experience in high-performance ML systems, GPU kernel development, or ML runtime/infrastructure development OR Master's Degree in Computer Science, Electrical or Computer Engineering, or related field AND 3+ years of industry experience in high-performance ML systems, GPU kernel development, or ML runtime/infrastructure development OR Doctorate in Computer Science, Electrical or Computer Engineering, or related field AND 1+ year(s) of industry experience in high-performance ML systems, GPU kernel development, or ML runtime/infrastructure development. Demonstrated experience delivering production-grade software in areas such as model compression, low-precision numerics (FP8, INT8/4, NVFP4, MX formats, etc.), low-level kernel development, and performance optimization. Proficiency with modern deep learning frameworks, including PyTorch, TensorFlow, TensorRT, and ONNX Runtime. Expertise in GPU/NPU kernel development using CUDA, Triton, ROCm, or comparable frameworks and fast model bring up on a new stack Strong understanding of Transformer and LLM architectures, with hands-on experience in optimization techniques such as quantization, pruning, tensor/parameter sharding, model parallelism, KV-cache optimization, and Flash Attention etc. Practical experience with large-scale model evaluation, including benchmarking state-of-the-art LLMs and fine-tuning (SFT or RL) large models. Solid programming skills in Python, C, and C++. Excellent communication abilities and a proven capacity to collaborate effectively in hybrid team-oriented environments. Hands-on experience implementing and optimizing low-level linear algebra routines, including custom BLAS kernels would be a plus. Deep knowledge of mixed-precision arithmetic units, including numerical formats and microarchitecture, is highly desirable.

Key Skills

Model OptimizationQuantizationPerformance OptimizationDeep Learning FrameworksGPU Kernel DevelopmentLow-Precision NumericsTransformer ArchitecturesLarge-Scale Model EvaluationPythonCC++CUDATritonROCmModel CompressionBLAS Kernels