Skip to content

Condo Cluster Service

Summary

BRC manages Savio, the high-performance computational cluster for research computing. Designed as a turnkey computing resource, it features flexible usage and business models, and professional system administration. Unlike traditional clusters, Savio is a collaborative system wherein the majority of nodes are purchased and shared by the cluster users, known as condo owners.

The model for sustaining computing resources is premised on faculty and principal investigators purchasing compute nodes (individual servers) from their grants or other available funds which are then added to the cluster. This allows PI-owned nodes to take advantage of the high speed Infiniband interconnect and high performance Lustre parallel filesystem storage associated with Savio. Operating costs for managing and housing PI-owned compute nodes are waived in exchange for letting other users make use of any idle compute cycles on the PI-owned nodes. PI owners have priority access to computing resources equivalent to those purchased with their funds, but can access more nodes for their research if needed. This provides the PI with much greater flexibility than owning a standalone cluster.

Program Details

Compute node equipment is purchased and maintained based on a 5-year lifecycle. PIs owning the nodes will be notified during year 4 that the nodes will have to be upgraded before the end of year 5. If the hardware is not upgraded by the end of 5 years, the PI may donate the equipment to Savio or take possession of the equipment (removal of the equipment from Savio and transfer to another location is at the PI's expense); nodes left in the cluster after five years may be removed and disposed of at the discretion of the BRC program manager

Once a PI has decided to participate, the PI or their designate works with the HPC Services manager and IST teams to procure the desired number of compute nodes and allocate the needed storage. There is a 4-node minimum buy-in for any given compute pool (and all 4 nodes must be the same whether it be the Standard, HTC, Bigmem, or GPU nodes. GPU nodes are the most expensive; therefore, if a group has already purchased the 4-node minimum of any other type of node, they can purchase and add single GPU nodes to their Condo).  Generally, procurement takes about three months from start to finish. In the interim, a test condo queue with a small allocation will be set up for the PI's users in anticipation of acquiring the new equipment. Users may submit jobs to the general queues on the cluster using their Faculty Computing Allowance. Jobs are subject to general queue limitations and guaranteed access to contributed cores is not provided until purchased nodes are provisioned.

All group members have equal access to the condo resources, via a condo-specific Slurm QoS (the 'floating reservation' described below). The expectation is that the research group will collectively manage use of the resources by individual members.

Hardware Information

Warranty

Each system has a warranty of 5 years

Basic specifications for the systems listed below:

General Computing Node (256 GB RAM)
General Computing Node (256 GB RAM)
Processors Dual-socket, 28-core, 2.1 GHz Intel Xeon Gold 6330 processors (56 cores/node)
Memory 256 GB (16 x 16GB) 2666 Mhz DDR4 RDIMMs
Interconnect 100 Gb/s Mellanox ConnectX6 HDR-100 Infiniband interconnect
Hard Drive 1.92 TB NVMe SSD drive (Local swap and log files)
Notes These come in sets of 4, and the minimum buy-in is 4 nodes
Current Approximate Price (with tax) ~$42,500 for a Dell C6400 chassis with 4 nodes + 4 EDR 2M cables
Big Memory or HTC Computing Node (512 GB RAM)
Big Memory or HTC Computing Node (512 GB RAM)
Processors Dual-socket, 28-core, 2.0 GHz Intel Xeon Gold 6330 processors (56 cores/node)
Memory 512 GB (16 x 32 GB) 3200 Mhz DDR4 ECC REG
Interconnect 100 Gb/s Mellanox ConnectX 6 EDR Infiniband interconnect
Hard Drive 1.92 TB NVMe SSD drive (Local swap and log files)
Notes These come in sets of 4, and the minimum buy-in is 4 nodes
Current Approximate Price (with tax) ~$42,000 for 4 nodes + 4 2M cables
Very Large Memory Computing Node (1.5 TB RAM)
Very Large Memory Computing Node (1.5 TB RAM)
Processors Dual-socket, 26-core, 2.1 GHz Intel Xeon Gold 6230 processors (52 cores/node)
Memory 1.5 TB (24 x 64GB) DDR4
Interconnect 100 Gb/s Mellanox ConnectX5 EDR Infiniband interconnect
Hard Drive 1 TB HDD 7.2K RPM (Local swap and log files)
Notes These can be purchased one by one, but the minimum buy-in is 2 nodes
Current Approximate Price (with tax) $16,500 per node + $100 for 1 ea. EDR 2M cable
GPU Computing Node

We are currently supporting the 8-way L40S server for $71,000 and the 8-way H100 Dell server for $250,000 as a standard node offering for the savio4_gpu partition. Please contact us (see below) if you are interested in purchasing GPUs for a condo.

Hardware Purchasing

Prospective condo owners should contact us for current pricing and prior to purchasing any equipment to insure compatibility. If you are interested in other hardware configurations (e.g., HTC/Serial nodes), please contact us. BRC will assist with entering a compute node purchase requisition on behalf of UC Berkeley faculty.

Software

Prospective Condo owners should review the System Software section of the System Overview page to confirm that their applications are compatible with Savio's operating system, job scheduler and operating environment.

Storage

All institutional and condo users have a 30 GB home directory with backups; in addition, each research group is eligible to receive up to 200 GB of shared project space (30 GB for Faculty Computing Allowance accounts and 200 GB for Condo accounts) to hold research specific application software that is shared among the users of a research group. All users have access to the Savio high performance scratch filesystem for non-persistent data. Users or projects needing more space for persistent data can also purchase additional performance tier storage from IST at the current rate. For even larger storage needs, Condo partners may also take advantage of the Condo Storage service, which provides low-cost storage for very large data needs (minimum 25 TB).

Network

A Mellanox infiniband 36-port unmanaged leaf switch is used for every 24 ea. compute nodes.

Job scheduling

We will set up a floating reservation equivalent to the number of nodes that you contribute to the Condo to provide priority access to you and your users. You can determine the run time limits for your reservation. If you are not using your reservation, then other users will be allowed to run jobs on unused nodes. If you submit a job to run when all nodes are busy, your job will be given priority over all other waiting jobs to run, but your job will have to wait until nodes become free in order to run. We do not do pre-emptive scheduling where running jobs are killed in order to give immediate access to priority jobs.

Note that the configuration above means that Condos do not have dedicated/reserved nodes. The basic premise of Condo participation is to facilitate the sharing of unused resources. Dedicating or reserving compute resources works counter to sharing, so this is not possible in the Condo model. As an alternative, PIs can purchase nodes and set them up as a Private Pool in the Condo environment, which will allow a researcher to tailor the access and job queues to meet their specific needs. Private Pool compute nodes will share the HPC infrastructure along with the Condo cluster; however, researchers will have to cover the support costs for BRC staff to manage their compute nodes. Please contact us for rates for Private Pool compute nodes.

Charter Condo Contributors

The following is a list of those who initially contributed Charter nodes to the Savio Condo, thus helping launch the Savio cluster:

Contributor Affiliation
Eliot Quataert Theoretical Astrophysics Center, Astronomy Department
Eugene Chiang Astronomy Department
Chris McKee Astronomy Department
Richard Klein Astronomy Department
Uros Seljak Physics Department
Jon Arons Astronomy Department
Ron Cohen Department of Chemistry, Department of Earth and Planetary Science
John Chiang Department of Geography and Berkeley Atmospheric Sciences Center
Fotini Katopodes Chow Department of Civil and Environmental Engineering
Jasmina Vujic Department of Nuclear Engineering
Jasjeet Sekhon Department of Political Science and Statistics
Rachel Slaybaugh Nuclear Engineering
Massimiliano Fratoni Nuclear Engineering
Hiroshi Nikaido, Molecular and Cell Biology
Donna Hendrix, Computation Genomics Research Lab
Justin McCrary, Director D-Lab
Alan Hubbard, Biostatistics, School of Public Health
Mark van der Laan, Biostatistics and Statistics, School of Public Health
Michael Manga, Department of Earth and Planetary Sciences
Sol Shiang Goldman School of Public Policy
Jeff Neaton Physics
Eric Neuscamman College of Chemistry
M. Alam Reza Mechanical Engineering
Elaine Tseng UCSF School of Medicine
Julius Guccione UCSF Department of Surgery
Ryan Lovett Statistical Computing Facility
David Limmer College of Chemistry
Doris Bachtrog Integrative Biology
Kranthi Mandadapu College of Chemistry
Kristin Persson Department of Materials Science and Engineering
Daryl Chrzan Department of Materials Science and Engineering
William Boos Earth and Planetary Science
Daniel Weisz Department of Astronomy
Peter Sudmant Integrative Biology
Priya Moorjani Molecular and Cell Biology

Faculty Perspectives

UC Berkeley Professor of Astrophysics Eliot Quataert speaks at the BRC Program Launch (22 May 2014) on the need for local high performance computing (HPC) clusters, distinct from national resources such as NSF, DOE (NERSC), and NASA.

UC Berkeley Professor of Integrated Biology Rasmus Nielsen speaks at the BRC Program Launch (22 May 2014) about the transformative effect of using HPC in genomics research.