Openstack and SCITAS

Openstack and HPC @ SCITAS [email protected] EPFL - SCITAS June 29, 2015 [email protected] (EPFL - SCITAS) Openstack and HPC @ SCITAS June...
Author: Jasper Rice
6 downloads 0 Views 249KB Size
Openstack and HPC @ SCITAS [email protected] EPFL - SCITAS

June 29, 2015

[email protected] (EPFL - SCITAS)

Openstack and HPC @ SCITAS

June 29, 2015

1 / 11

Why virtualization?

Use Cases

Most of our infrastructure is optimized for multi-node and single-node multi-core computations, highly sensitive to performance of individual components of the systems: CPU, I/O, interconnects. But some of our users are limited by the software or algorithms they use in one or more dimensions: single-core (or node) jobs with very long run times odd (sometimes pre-packaged) software environments

[email protected] (EPFL - SCITAS)

Openstack and HPC @ SCITAS

June 29, 2015

2 / 11

Why virtualization?

Our typical cluster

Our typical cluster 2 socket 32/64 GB RAM Local disk (˜1 TB) GPFS /scratch (˜100 TB) 10 GbE InfiniBand QDR 40 Gb low latency fat tree (in the clusters optimized for parallel workloads) 2x 10 GbE link to central GPFS (/home and /work) 10 GbE uplink to EPNET Today we have >1000 nodes, >19600 cores. And a BlueGene/Q.

[email protected] (EPFL - SCITAS)

Openstack and HPC @ SCITAS

June 29, 2015

3 / 11

Why virtualization?

Objectives

Objectives a.k.a. Things that are difficult to do now Transparent check-pointing and migration of long-running workloads Support for different software environments: OS, custom software stacks On demand, short lived analysis runs of specific pre-packaged software stacks (Big? data) In addition to catering for corner cases within our existing systems this setup will enable us to be more responsive to changing requirements.

[email protected] (EPFL - SCITAS)

Openstack and HPC @ SCITAS

June 29, 2015

4 / 11

What?

Hybrid setup

Integration with current batch system We want to convert a fraction of our sequential workloads cluster into an OpenStack instance. While maximizing re-utilization of the existing supporting infrastructure. For the first two use cases want the introduction of virtualization to be transparent for the users. Resources will be available through the batch scheduler (SLURM) This will be expanded to other use cases, available on-demand, through different interfaces (CLI, Web portal). By choosing OpenStack, and integrating it in our existing systems using the API, we can transparently integrate with other instances (at EPFL, SWITCH, ...). And vice-versa.

[email protected] (EPFL - SCITAS)

Openstack and HPC @ SCITAS

June 29, 2015

5 / 11

How?

Initial VM Infrastructure

Initial VM Infrastructure A few nodes are being drained to be repurposed as our OpenStack instance. Volume storage I I

Shared storage for easy migration existing GPFS instance

Image storage I

existing GPFS instance

Network I I I

flat, with pre-registered addresses management nodes and hypervisors in a separate VLAN VMs in the same network as regular SLURM nodes

[email protected] (EPFL - SCITAS)

Openstack and HPC @ SCITAS

June 29, 2015

6 / 11

How?

Unknowns

Unkowns: what can go wrong? Integration with SLURM batch scheduler I I I

Basic provisions exist within SLURM to use CLOUD How much development effort will it require? Will scheduler performance be impacted by disappearing nodes?

Storage I

Will GPFS perform well for this use case?

What infrastructure changes (network in particular) are needed to have a hybrid HPC/Cloud cluster? I

We currently use NAT but are working with the Network team to change this.

[email protected] (EPFL - SCITAS)

Openstack and HPC @ SCITAS

June 29, 2015

7 / 11

How?

Unknowns

SLURM is PaaS

A batch system is already PaaS (provided your service can run on user space): Do we need a cloud to deploy applications? Why not use our clusters in ”native mode” and get bare metal performance and full use of the fast interconnects: https://github.com/chu11/magpie http://hibd.cse.ohio-state.edu/

[email protected] (EPFL - SCITAS)

Openstack and HPC @ SCITAS

June 29, 2015

8 / 11

Future

Looking ahead

Using the inherent flexibility of batch (SLURM) workloads to provide the illusion of infinite resources for the cloud part Virtualization for Parallel Workloads I

SR-IOV: IB passthrough

Use of accelerators in VMs I

GPGPUs: rCUDA

SWIFT Object Storage (on top of GPFS)

[email protected] (EPFL - SCITAS)

Openstack and HPC @ SCITAS

June 29, 2015

9 / 11

We as users of OpenStack

We as users of OpenStack

We could really profit from a self-service Virtualization service with a standard API: Monitoring and other ancillary services Software Testing (Continuous Integration Builds) General service development

[email protected] (EPFL - SCITAS)

Openstack and HPC @ SCITAS

June 29, 2015

10 / 11

End

Thank you! Questions?

[email protected] (EPFL - SCITAS)

Openstack and HPC @ SCITAS

June 29, 2015

11 / 11