Earlier this year, Data Science Institute (DSI) Research Scientist Matthew Feickert ran a national-level workshop at UW-Madison targeted at making reproducible machine learning workflows accessible for scientists. This workshop, supported through Feickert’s U.S. Research Software Sustainability Institute (URSSI) Early-Career Fellowship, brought 44 participants from 11 universities, national laboratories, research organizations, and companies across the United States to UW–Madison to learn how to use new open source tools and technologies.

“The goal of this workshop was to teach people methods and tools to solve their immediate research problems and collaborate with their colleagues better,” says Feickert.
Feickert’s work and the materials taught at the workshop are all publicly available and open source. Participants used Pixi, a powerful tool with high-level semantics, to declaratively create software environments that leverage NVIDIA’s CUDA technology stack for efficient AI. These particular environments are automatically and indefinitely fully reproducible, byte for byte, across different machines and multiple computing resources—from GPUs on the OSPool to commercial cloud instances.
“Historically, this would have been considered too technically difficult to achieve for anyone but software and computing experts. Now, this has become a simple planning step for all researchers.” —Matthew Feickert, Workshop Organizer
Feickert partnered with the UW-Madison Data Science Hub, AI researchers, and Center for High Throughput Computing staff to teach this Carpentries-style workshop and help participants apply the methods and workflows to real research problems in their respective fields.
“The diversity of participants’ backgrounds and experience made this workshop fantastic,” says Feickert. “We had people from bioinformatics, linguistics, computer science, AI, chemical engineering, industry, and more, who ranged from novices to practicing data scientists and AI researchers. It was exciting to have everyone unlock solutions to a whole range of challenges in their work, and to see people new to machine learning easily train models on remote GPUs using environments they crafted themselves a few minutes earlier.”

Feickert says that he hopes the largest impact of the workshop is yet to come. For many scientists, securing the reproducibility of computing environments might be a task left till the very end of a study, and might be treated as a “best effort” attempt given the complexity of cases, tools, and time pressures. Full and automatic environment reproducibility at every step of the scientific process now comes down to a tooling choice, even for hardware accelerated science.
“Historically, this would have been considered too technically difficult to achieve for anyone but software and computing experts,” says Feickert. “Now, this has become a simple planning step that researchers learned at the workshop, which they can share with their research groups and colleagues.”