Linux Open Source Distributed Filesystem

Linux Open Source Distributed Filesystem Ceph at SURFsara Remco van Vugt July 2, 2013 1/ 34 Agenda I Ceph internal workings I I I I Research p...

Author: Ethelbert Derek Clark

0 downloads 1 Views 3MB Size

Report

Download PDF

Recommend Documents

Open Source, Linux & DRBL

Free Open Source Software: Linux

Linux filesystem permissions

Podstawy open source system SUSE Linux

Filesystem Performance and Scalability in Linux

Linux Sockets and the Virtual Filesystem

Build Object-Based Filesystem into Linux

Podstawy open source system SUSE Linux cz. II

Getting the first Open Source GSM stack in Linux

Open Source Development Labs. Carrier Grade Linux Standards Requirements Definition

Linux+ Guide to Linux Certification, Third Edition. Chapter 4 Linux Filesystem Management

Spark. Open source system from Berkeley Distributed processing over HDFS

AR1020 SPI-I2C Open Source Linux Driver Documentation

The New Btrfs Filesystem for Linux: Features and Tools Lecture

Towards a Highly Adaptable Filesystem Framework for Linux

HOW OPEN IS OPEN SOURCE?

Demands, Solutions, and Improvements for Linux Filesystem Security

OpenPPM: Open Source a partir de Open Source

A Linux Filesystem Tracing Method Using the Kprobes Linux Dynamic Instrumentation System

Open Source Business Intelligence

OPEN SOURCE AUDIO SYNTHESIS

Open Source Lizenzen

Metodologia Open Source

Building Open Source Hardware

Linux Open Source Distributed Filesystem Ceph at SURFsara

Remco van Vugt

July 2, 2013

1/ 34

Agenda I

Ceph internal workings I I I

I

Research project results I I I I I

I

Ceph components CephFS Ceph OSD Stability Performance Scalability Maintenance Conclusion

Questions

2/ 34

Ceph components

3/ 34

CephFS

I

Fairly new, under heavy development

I

POSIX compliant

I

Can be mounted through FUSE in userspace, or by kernel driver

4/ 34

CephFS (2)

Figure: Ceph state of development 5/ 34

CephFS (3)

Figure: Dynamic subtree partitioning

6/ 34

Ceph OSD I

Stores object data in flat files in underlying filesystem (XFS, BTRFS)

I

Multiple OSDs on a single node (usually: one per disk)

I

’Intelligent daemon’, handles replication, redundancy and consistency

7/ 34

CRUSH

I

Cluster map

I

Object placement is calculated, instead of indexed

I

Objects grouped into Placement Groups (PGs)

I

Clients interact direct with OSDs

8/ 34

Placement group

Figure: Placement groups 9/ 34

Failure domains

Figure: Crush algorithm

10/ 34

Replication

Figure: Replication 11/ 34

Monitoring

I

OSD use peering, and report about each other

I

OSD either up or down

I

OSD either in or out the cluster

I

MON keeps overview, and distrubutes cluster map changes

12/ 34

OSD fault recovery

I

OSD down, I/O continues to secondary (or tertiary) OSD assigned to PG (active+degraded)

I

OSD down longer than configured timeout, OSD is down and out (kicked out of the cluster)

I

PG data is remapped to other OSD and re-replicated in the background

I

PGs can be down if all copies are down

13/ 34

Rebalancing

14/ 34

Research

15/ 34

Research questions I

Research question I

I

Is the current version of CephFS (0.61.3) production-ready for use as a distributed filesystem in a multi-petabyte environment, in terms of stability, scalability, performance and manageability?

Sub questions I

I

I

Is Ceph, and an in particular the CephFS component, stable enough for production use at SURFsara? What are the scaling limits in CephFS, in terms of capacity and performance? Does Ceph(FS) meet the maintenance requirements for the environment at SURFsara?

16/ 34

Stability

I

Various tests performed, including: I I I I

Cut power from OSD, MON and MDS nodes Pull disks from OSD nodes (within failure domain) Corrupt underlying storage files on OSD Killed daemon processes

I

No serious problems encountered, except for multi-mds

I

Never encountered data loss

17/ 34

Performance

I

Benchmarked RADOS and CephFS I I

I

Bonnie++ RADOS bench

Tested under various conditions: I I I I

Normal Degraded Rebuilding Rebalancing

18/ 34

RADOS Performance

19/ 34

CephFS Performance

20/ 34

CephFS MDS Scalability

I

Tested metadata performance using mdtest

I

Various POSIX operations, using 1000,2000,4000,8000 and 16000 files per directory

I

Tested 1 and 3 MDS setup

I

Tested single and multiple directories

21/ 34

CephFS MDS Scalability (2)

I

Results: I I I I

Did not multi-thread properly Scaled over multiple MDS Scaled over multiple directories However...

22/ 34

CephFS MDS Scalability (3)

23/ 34

Ceph OSD Scalability

I

Two options for scaling: I I

I

Horizontal: adding more OSD nodes Vertical: adding more disks to OSD nodes

But how far can we scale..?

24/ 34

Scaling horizontal

Number of OSDs 24 36 48

PGs 1200 1800 2400

MB /sec 586 908 1267

max (MB /sec) 768 1152 1500

Overhead % 24 22 16

25/ 34

Scaling vertical

I

OSD scaling I I

Add more disks, possibly using external SAS enclosures But, each disk adds overhead (CPU, I/O subsystem)

26/ 34

Scaling vertical (2)

27/ 34

Scaling vertical (3)

28/ 34

Scaling OSDs

I I

Scaling horizontal seems no problem Scaling vertical has it’s limits I I

Possibly tunable Jumbo frames?

29/ 34

Maintenance

I

Built in tools sufficient

I

Deployment

I

Crowbar

I

Chef

I

Ceph deploy

I

Configuration

I

Puppet

30/ 34

Research (2) I

Research question I

I

Is the current version of CephFS (0.61.3) production-ready for use as a distributed filesystem in a multi-petabyte environment, in terms of stability, scalability, performance and manageability?

Sub questions I

Is Ceph, and an in particular the CephFS component, stable enough for production use at SURFsara?

I

What are the scaling limits in CephFS, in terms of capacity and performance?

I

Does Ceph(FS) meet the maintenance requirements for the environment at SURFsara? 31/ 34

Conclusion I

I

I

Ceph is stable and scalable I

RADOS storage backend

I

Possibly: RBD and object storage, but outside scope

However: CephFS is not yet production ready I

Scaling is a problem

I

MDS failover was not smooth

I

Multi-MDS not yet stable

I

Let alone directory sharding

However: developer attention back on CephFS

32/ 34

Conclusion (2)

I

Maintenance I

Extensive tooling available

I

Integration into existing toolset possible

I

Self-healing, low maintenance possible

33/ 34

Questions?

34/ 34