ArchitectureArchitecture
Conference50min
BEGINNER

When Size Matters: The Cool Kids' Guide to High-Performance Computing in the Cloud.

This case study explores overcoming vertical scaling limits by implementing horizontal scaling with multiple high-performance P5 GPU servers. Using 32 fiber-optic links delivering 3.2 Tb/s throughput, each node with 8 H100 GPUs, the resulting cluster—1.8 TB GPU memory, 576 CPUs, and 6 TB RAM—demonstrates scalable performance for future expansion.

talk.summaryAiDisclaimer

Jacek Marmuszewski
Jacek MarmuszewskiLet's Go DevOps

talkDetail.whenAndWhere

Thursday, June 18, 11:30-12:20
Room 4A
talks.roomOccupancytalks.noOccupancyInfo
talks.description
When facing performance issues, it’s easy to be tempted to choose a “bigger boat” (vertical scaling). However, what do you do when you’ve already reached the limits of the largest available option? In that case, you need to consider horizontal scaling. However, this approach may not be as straightforward as it seems.
This case study focuses on a project where we needed to connect several P5 instances, the highest-performance GPU-based servers available. We utilized 32 fiber-optic cards, establishing a robust connection for an impressive throughput of 3.2 Tb/s between servers. Each server houses 8 NVIDIA H100 graphics cards with 600 GB of GPU memory. Although the final cluster is relatively small—totaling 1.8 TB of GPU memory, 576 CPUs, and 6 TB of RAM—it's just the beginning of what we aim to achieve.

gpu
cluster
scaling
performance
talks.speakers
Jacek Marmuszewski

Jacek Marmuszewski

Let's Go DevOps

Poland

Jacek Marmuszewski - DevSecOps with over ten years of experience building and managing cloud infrastructure. He worked for companies like Sabre and Oracle on mission-critical systems. He also had his share in startups, where, as an early joiner, he promoted DevOps culture and advocated cloud-native architecture.
Recently, he co-founded Let’s Go DevOps, a company that helps others design, build, and maintain cloud-native applications and infrastructure. He’s a big fan of cloud transformation and helps others leverage its full potential by choosing the right components for the job.