What is Olvi?
Olvi is a machine with two servers, each having 8 Nvidia A100 GPUs, that can be used for GPU-intensive tasks. Olvi is named in honor of Dr. Olvi Mangasarian, a UW professor who was an expert in optimzation and numerical analysis.
Specifications:
Olvi-1 and Olvi-2 are identical machines, each of which have the following specifications:
- 8x NVIDIA A100 SXM2 40GB HBM2 NV-LINK
- 2x Intel® Xeon Cascade Lake 5218 (2.3GHz) Processor (24-Core) = 48 Cores
- 16x 32 GB ECC REG DDR4-2933 = 512 GB Total
- 4 TB Enterprise SSD (2.5″)
- 15.36 TB 2.5″ CD6-R NVMe PCIe 4.0 SSD
Who can use Olvi?
Olvi is available for DSI Affilates and their partners, collaborators, and students.
How do I get access to Olvi?
If you qualify for an Olvi account and would like to use Olvi, please contact:
- Abe Megahed: amegahed@wisc.edu
- Jason Lo: jason.lo@wisc.edu
Olvi Codes of Conduct
Olvi is currently an “unmanaged” machine, which means that you do not have explicit storage and GPU quotas to limit your use of resources. This makes Olvi easier to use, since you do not have to use a job queuing system like HTCondor or Slurm to submit your jobs. However, in order for this to work, Olvi users must be mindful that they are using a shared resource and be courteous of others. For this reason, we ask that you observe the following rules:
-
Olvi Storage Policy
Olvi has the following storage space available:
- One small (3.4 TB) SSD Drive shared between Olvi-1 and Olvi-2
- Two large (15 TB) SSD Drives, one each mounted on Olvi-1 and Olvi-2. These large drives are not shared between the two machines.
If you have a small project (< a few hundred MB), it’s ok to use your home directory, but if you have larger projects, it is important that you make a folder in /data for your project data. The /data directory is mounted on the large 15 TB drives. The large 15 TB drives are not shared between Olvi-1 and Olvi-2, which is a slight inconvenience, but they have significantly more space available.
-
Olvi GPU Use Policy
Olvi has 8 A100s per machine that must be shared between users. We ask that you observe the following limitations:
- Do not simultaneously use more than 4 GPUs. If you require more than 4 GPUs, then we ask that you make a request to the system administrator to reserve a time for you.
- If you have a long running job (more than a few minutes), then keep at least one GPU free as a spare for others who may have small jobs. If a spare GPU is not available, you may have to wait.
Olvi Communication
A Microsoft Teams group has been set up for asking questions about Olvi and for coordinating use of Olvi resources. If you have a long-running, GPU intensive job, we ask that you post a note several hours in advance of running this job. This will give others a chance to respond if they have a project that has a deadline or more immediate GPU needs.
Olvi Tips
GPU Management
When creating your training script, it’s better to use only one GPU. This allows other users to use the remaining GPUs.
- To see which GPUs are available:
watch -n1 nvidia-smi
- To add the available GPU index number at the start of your Python script / notebook:
import os os.environ["CUDA_VISIBLE_DEVICES"] = "3"
Using VSCode
In order to use VSCode on OLVI, you will need to:
- Download the
Remote.SSH
vscode extension - In VSCode, set the following:
- enabled:
Remote.SSH: Lockfiles in Tmp
- disabled:
Remote.SSH: Use Flock
- enabled:
Using HuggingFace
HuggingFace models or data often need a lot of space. To move the .cache folder to /data, you can:
export TRANSFORMERS_CACHE=/data/clo36/huggingface/models export HF_DATASETS_CACHE=/data/clo36/huggingface/datasets export HF_MODULES_CACHE=/data/clo36/huggingface/modules export HF_METRICS_CACHE=/data/clo36/huggingface/metrics
Tip: Add the above code to ~/.bashrc so it stays even after you log out and log back in.
Using Docker
Some setup is required—see details. For rootless mode docker, see here.
Docker Quick Guide
- Install
rootlesskit
/usr/bin/dockerd-rootless-setuptool.sh install
- Start the docker daemon in user mode.
systemctl --user start docker.service
- Specify
$DOCKER_HOST
export DOCKER_HOST=unix://$XDG_RUNTIME_DIR/docker.sock
- (Optional) Set docker daemon to run at login.
loginctl enable-linger systemctl --user enable docker.service
Using devcontainer
This is my sample devcontainer definition for LLM development on OLVI. See the docs for details.
{ "name": "project_name", "image": "huggingface/transformers-pytorch-deepspeed-latest-gpu", "runArgs": ["--security-opt", "seccomp=unconfined", "--gpus", "all", "--ipc=host"], "mounts": ["source=/data,target=/data,type=bind,consistency=cached"], "workspaceMount": "source=${localWorkspaceFolder},target=/project_name,type=bind,consistency=cached", "workspaceFolder": "/project_name", "remoteUser": "root", "customizations": { "features": { "git-lfs": "latest", "ghcr.io/devcontainers/features/docker-outside-of-docker:1": {} }, } }, "postCreateCommand": "bash .devcontainer/post-create.sh" }
Tip: To keep the above json clean, I have not include any vscode extension; you may want to add it to avoid a manual installation.
sample post_create.sh
:
#!bin/bash apt-get update apt-get install -y locales locale-gen en_US.UTF-8 pip install -r .devcontainer/requirements.txt