The Olvi GPU Cluster

NVIDIA A100 chip

What is Olvi?

Olvi is a machine with two servers, each having 8 Nvidia A100 GPUs, that can be used for GPU-intensive tasks. Olvi is named in honor of Dr. Olvi Mangasarian, a UW professor who was an expert in optimzation and numerical analysis.

Specifications:

Olvi-1 and Olvi-2 are identical machines, each of which have the following specifications:

8x NVIDIA A100 SXM2 40GB HBM2 NV-LINK
2x Intel® Xeon Cascade Lake 5218 (2.3GHz) Processor (24-Core) = 48 Cores
16x 32 GB ECC REG DDR4-2933 = 512 GB Total
4 TB Enterprise SSD (2.5″)
15.36 TB 2.5″ CD6-R NVMe PCIe 4.0 SSD

Who can use Olvi?

Olvi is available for DSI Affilates and their partners, collaborators, and students.

How do I get access to Olvi?

If you qualify for an Olvi account and would like to use Olvi, please contact:

Abe Megahed: amegahed@wisc.edu
Jason Lo: jason.lo@wisc.edu

Olvi Codes of Conduct

Olvi is currently an “unmanaged” machine, which means that you do not have explicit storage and GPU quotas to limit your use of resources. This makes Olvi easier to use, since you do not have to use a job queuing system like HTCondor or Slurm to submit your jobs. However, in order for this to work, Olvi users must be mindful that they are using a shared resource and be courteous of others. For this reason, we ask that you observe the following rules:

Olvi Storage Policy

Olvi has the following storage space available:
- One small (3.4 TB) SSD Drive shared between Olvi-1 and Olvi-2
- Two large (15 TB) SSD Drives, one each mounted on Olvi-1 and Olvi-2. These large drives are not shared between the two machines.
If you have a small project (< a few hundred MB), it’s ok to use your home directory, but if you have larger projects, it is important that you make a folder in /data for your project data. The /data directory is mounted on the large 15 TB drives. The large 15 TB drives are not shared between Olvi-1 and Olvi-2, which is a slight inconvenience, but they have significantly more space available.
Olvi GPU Use Policy

Olvi has 8 A100s per machine that must be shared between users. We ask that you observe the following limitations:
- Do not simultaneously use a total of more than 4 GPUs across both Olvi servers (Olvi-1 and Olvi-2). If you are using 4 GPUs on one server, then you may not use additional GPUs on the other server. If you require more than 4 GPUs, we ask that you make a request to system administrator Abe Megahed (amegahed@wisc.edu) to reserve a time for you.
- If you have a long running job (more than a few minutes), then keep at least one GPU free as a spare for others who may have small jobs. If a spare GPU is not available, you may have to wait.

We monitor Olvi GPU and storage use to ensure that these policies are followed. If you fail to abide by these policies, we will send you a warning email. Continued infractions of the terms of the Codes of Conduct may result in your account being suspended or terminated.

Olvi Communication

A Microsoft Teams group has been set up for asking questions about Olvi and for coordinating use of Olvi resources. If you have a long-running, GPU intensive job, we ask that you post a note several hours in advance of running this job. This will give others a chance to respond if they have a project that has a deadline or more immediate GPU needs.

Olvi Users Team

Olvi Tips

GPU Management

When creating your training script, it’s better to use only one GPU. This allows other users to use the remaining GPUs.

To see which GPUs are available:
```
watch -n1 nvidia-smi
```
To add the available GPU index number at the start of your Python script / notebook:
```
import os 
os.environ["CUDA_VISIBLE_DEVICES"] = "3"  
```

Using VSCode

In order to use VSCode on OLVI, you will need to:

Download the Remote.SSH vscode extension
In VSCode, set the following:
- enabled: Remote.SSH: Lockfiles in Tmp
- disabled: Remote.SSH: Use Flock

Using HuggingFace

HuggingFace models or data often need a lot of space. To move the .cache folder to /data, you can:

export TRANSFORMERS_CACHE=/data/clo36/huggingface/models
export HF_DATASETS_CACHE=/data/clo36/huggingface/datasets
export HF_MODULES_CACHE=/data/clo36/huggingface/modules
export HF_METRICS_CACHE=/data/clo36/huggingface/metrics

Tip: Add the above code to ~/.bashrc so it stays even after you log out and log back in.

Using Docker

Some setup is required—see details. For rootless mode docker, see here.

Docker Quick Guide

Install rootlesskit

/usr/bin/dockerd-rootless-setuptool.sh install

Start the docker daemon in user mode.
```
systemctl --user start docker.service
```

Specify $DOCKER_HOST

export DOCKER_HOST=unix://$XDG_RUNTIME_DIR/docker.sock

(Optional) Set docker daemon to run at login.

loginctl enable-linger
systemctl --user enable docker.service

Using devcontainer

This is my sample devcontainer definition for LLM development on OLVI. See the docs for details.

{
 "name": "project_name",
 "image": "huggingface/transformers-pytorch-deepspeed-latest-gpu",
 "runArgs": ["--security-opt", "seccomp=unconfined", "--gpus", "all", "--ipc=host"],
 "mounts": ["source=/data,target=/data,type=bind,consistency=cached"],
 "workspaceMount": "source=${localWorkspaceFolder},target=/project_name,type=bind,consistency=cached",
 "workspaceFolder": "/project_name",
 "remoteUser": "root",
 "customizations": {
  "features": {
   "git-lfs": "latest",
   "ghcr.io/devcontainers/features/docker-outside-of-docker:1": {}
  },
  }
 },
 "postCreateCommand": "bash .devcontainer/post-create.sh"
}

Tip: To keep the above json clean, I have not include any vscode extension; you may want to add it to avoid a manual installation.

sample post_create.sh:

#!bin/bash

apt-get update
apt-get install -y locales
locale-gen en_US.UTF-8
pip install -r .devcontainer/requirements.txt