DSI Computing Resources

Three NVIDIA GPU chips

DSI GPU Clusters

The Data Science Institute owns and operates GPU clusters that can be used for GPU-intensive applications such as training and tuning AI models. Each of DSI’s GPU clusters has its own specific strengths.

  • Olvi GPU Cluster
    Olvi is a machine with two servers, each having 8 Nvidia A100 GPUs.
  • L40 GPU Cluster
    The L40 GPU Cluster is a machine with 2 servers and 8 48GB L40 GPUs that is operated by the Center for High Throughput Computing.
  • H100 GPU Cluster
    The H100 cluster is a cluster of three servers with 8 48GB H100 GPUs operated by the Center for High Throughput Computing.

Who Can Use These GPU Clusters?

The DSI GPU clusters are available to DSI Affilates and their group members.

How do I Get Access?

For OLVI access:

  • Email Abe Megahed: amegahed@wisc.edu. A DSI Affiliate needs to either send this email or have their group member copy them on the email.

For L40 or H100 priority queue access:

  • Obtain a CHTC account
  • Email Jason Lo: jason.lo@wisc.edu. A DSI Affiliate needs to either send this email or have their group member copy them on the email.

Which GPU Cluster Should I Use?

There are a few key factors that determine which GPU cluster will best suit your needs:

  • Are you prototyping or experimenting? Use Olvi.

    If you are working on a small prototype or experimental application that doesn’t require large amounts of resources but may more immediate feedback in the edit-run-debug cycle, then Olvi is a better choice. Olvi allows you to run tasks quickly from the command line and get immediate feedback rather than waiting for your job to be queued and run using HTCondor on the L40 GPU cluster.

  • Are you unfamiliar with HTCondor? Use Olvi.

    The Olvi system provides direct access to the hardware using just the command shell. By contrast, the L40 and H100 systems use a job scheduling system called HTCondor which provides better control over resources but requires that you know how to use the HTCondor system.

  • Do you require large amounts of storage? Use L40 or H100.

    If you require large amounts of storage (more than a few hundred GB), then you should use either the L40 pr H100 GPU cluster. The Olvi machines have a limited amount of storage available (15 TB each) which is divided up among its many users, so storage space on Olvi is quite limited.

  • Do you require large amounts of GPU memory? Use H100.

    The H100 GPUs have twice as much memory (80GB) as Olvi’s A100 GPUs (40GB) and almost twice as much as the L40s (48GB). So, if your application requires a large amount of GPU memory, then you may be better off using the H100 GPU cluster.

  • Do you require high performance double-precision binary floating-point? Use H100.

    If your application requires exceptionally high performance for scientific computing applications, then you should use the H100 GPU cluster. The L40s do not have support for FP64 double-precision binary floating-point so while they are good for general purpose computing, scientific applications are better served by the H100 systems.

 

Olvi L40 H100
Good for Experimenting yes no no
Uses HTCondor no yes yes
Large Storage Capacity no yes yes
GPU Memory (GB) 40 48 80
Double Precision Floating Point no no yes

Key factors in deciding which GPU cluster to use