Job Description
**Summary:**
Meta is building some of the world's largest AI and high-performance computing infrastructure to power next-generation AI research and products. As an AI/HPC System Performance Engineer on the Network Infrastructure Engineering team, you will drive end-to-end performance characterization, bottleneck analysis, and optimization of large-scale AI training and inference clusters. In this role, you will work at the intersection of network fabric design, distributed computing, and AI workload behavior to ensure Meta's HPC systems deliver maximum throughput and efficiency for frontier model development.
**Required Skills:**
AI/HPC System Performance Engineer Responsibilities:
1. Profile and benchmark AI training and inference workloads across large-scale HPC clusters to identify network, compute, and memory bottlenecks
2. Develop and maintain performance analysis frameworks and dashboards to track system-level metrics including GPU utilization, network bandwidt...