This position will sit within a company that is pioneering a new era of Biomedicine!
Role Overview:
- GPU Cluster Management: Architect, deploy, and sustain high-performance GPU clusters, ensuring they are stable, reliable, and scalable. Oversee and manage cluster resources to maximize efficiency and utilization.
- Distributed/Parallel Training: Apply distributed computing techniques to facilitate parallel training of extensive deep learning models across multiple GPUs and nodes. Optimize data distribution and synchronization for faster convergence and reduced training times.
- Performance Optimization: Enhance GPU clusters and deep learning frameworks to achieve peak performance for specific workloads. Identify and resolve performance bottlenecks through profiling and system analysis.
- Deep Learning Framework Integration: Work closely with data scientists and machine learning engineers to incorporate distributed training capabilities into the company's model development and deployment frameworks.
- Scalability and Resource Management: Ensure GPU clusters can scale effectively to meet growing computational demands. Develop strategies for resource management to prioritize and allocate computing resources based on project needs.
- Troubleshooting and Support: Diagnose and resolve issues related to GPU clusters, distributed training, and performance anomalies. Provide technical support to users and efficiently resolve technical challenges.
- Documentation: Develop and maintain documentation on GPU cluster configuration, distributed training workflows, and best practices to facilitate knowledge sharing and smooth onboarding of new team members.
Qualifications:
- Master's or Ph.D. in computer science or a related field, with a focus on High-Performance Computing, Distributed Systems, or Deep Learning.
- Over 2 years of proven experience in managing GPU clusters, including installation, configuration, and optimization.
- Strong expertise in distributed deep learning and parallel training techniques.
- Proficiency in popular deep learning frameworks such as PyTorch, Megatron-LM, and DeepSpeed.
- Programming skills in Python and experience with GPU-accelerated libraries (e.g., CUDA, cuDNN).
- Knowledge of performance profiling and optimization tools for HPC and deep learning.
- Familiarity with resource management and scheduling systems (e.g., SLURM, Kubernetes).
- Solid background in distributed systems, cloud computing (AWS, GCP), and containerization (Docker, Kubernetes).
- Currently or previously holding a Staff or equivalent title | Currently sitting within a Senior leveled title for 3+ years
Does this capture what you were looking for?