Storage Becomes First Class Memory

2016 ICSEE International Conference on the Science of Electrical Engineering Storage Becomes First Class Memory Netanel Katzburg Amit Golander Shlo...
Author: Dina Griffith
3 downloads 0 Views 159KB Size
2016 ICSEE International Conference on the Science of Electrical Engineering

Storage Becomes First Class Memory Netanel Katzburg

Amit Golander

Shlomo Weiss

School of Electrical Engineering Tel-Aviv University

Plexistor

School of Electrical Engineering Tel-Aviv University

I. I NTRODUCTION Storage has become the performance bottleneck for many applications and middleware layers, driving R&D investments to optimize software to match the characteristics of the underlying storage devices. The two common storage devices— solid-state drives (SSDs) and hard-disk drives (HDDs)—are both block-based devices, are 1, 000 to 1, 000, 000 times slower than memory, and provide better performance for sequential rather than random access patterns. Recently, a new generation of storage devices is emerging [6]. These storage devices are a lot closer in characteristics to memory than to SSD and are called non-volatile DIMM memory (NVDIMM) or Persistent Memory (PM). They are random access devices, cache-line addressable, can be placed on the memory interconnect (i.e. DDR4 DIMM channel) and respond just as quickly (e.g. STT-MRAM [7]) or a bit slower (e.g. RRAM [16]). The hardware and software eco system around the emerging PM devices is also maturing [12]. JEDEC added the required signals to the memory interconnect. ACPI standardized the NVDIMM firmware interface table [9]. Open source contributors added a driver for Linux PM range registration (pmem) and direct access (DAX) enabled file systems [14]. Microsoft is adding similar support in Windows 2016 [2]. Unlike traditional file systems that packetize data and metadata into blocks and access much slower than memory media via a block software layer, PM-based file systems directly access byte-granular data on fast media. PM-based file systems converge memory and storage, so that applications can access c 978-1-5090-2152-9/16/$31.00 ⃝2016 IEEE

10,000,000

Random mixed load [IOPS]

Abstract—Computer architectures have always addressed memory and storage differently. The memory subsystem is an integral part of any processor design, while storage was placed on the I/O subsystem and accessed via several software layers. Emerging storage systems however, are challenging this fundamental and decades-old assumption. First class memory is an entity that supports all the operations generally available to main memory. This article describes how storage is becoming a first class memory. We explore the benefits of novel hardware and software technologies, demonstrating a speedup of x280 at the storage layer - over modern Flash and file systems, which translated to a speedup of x3.8 at the application layer, when measured SQL transactions on the PostgreSQL database. We then show that traditional data access tradeoffs become irrelevant and, as a result, application programming is significantly simplified.

1,000,000

100,000

10,000

1,000

Fig. 1. Performance of traditional and PM-based file system (logarithmic scale). The FIO benchmark uses many threads to randomly access 4KB (blockaligned) offsets in files, wherein 70% of the accesses are read accesses and the remaining 30% are write accesses.

memory as if it were storage (fast access) and storage as if it were memory (DAX). Figure 1 shows that a PM-based file system was measured to boost performance by a factor of 280 compared to running the same disk-bound benchmark using a traditional FS and a modern enterprise-grade flash SSD. This paper makes the following contribution: 1) It demonstrates that the performance bottleneck of traditionally disk-bounded applications is likely to move to the application code itself. 2) It shows that the traditional tradeoffs between POSIX Direct I/O, direct memory map, and indirect synchronous as compared to asynchronous I/O are far less significant, and may lead to simpler and better programs. 3) It analyzes under which conditions the traditional latency assumptions on read against write access, sequential as compared to random, and first access in contrast to repeated access patterns, are changing. II. E XPERIMENTAL M ETHODOLOGY We measured the performance of well-known benchmarks on the same hardware and platform using two file systems for the remainder of this paper. A modern block based file system (FS) using a modern flash SSD as well as volatile memory for software (SW) caching is the baseline for the first commercial PM-based file system (M1FS). Table I depicts the hardware and platform used for both the traditional and the PM-based file systems. It is comprised of a modern dual-socket commodity server, a BIOS that disables all throttling and power management features, and a new Linux kernel version. The baseline block-based file system is XFS, which is the default for Redhat enterprise Linux 7 (RHEL7).

2016 ICSEE International Conference on the Science of Electrical Engineering

Volatile memory File System (FS) Storage

25

Average latency

99.90%

99.99%

4,000,000

5,000,000

6,000,000

20

Latency [μsec]

Operating system CPU

Traditional Emerging block-based FS PM-based FS CentOS Linux release 7.2.1511 Linux kernel 4.5 Dual socket Intel Xeon E5-2650v3 (Haswell) 20 HW threads per socket, 2.3GHz 64GB Micron DDR4 DIMMs at 2133MHz XFS v1.0.2 M1FS v1.7.2 The default FS Plexistor PM-based FS 960GB SanDisk Cloud64GB Micron DDR4 Speed 1000 SSD NVDIMMs at 2133MHz

15

10

5

0

0

1,000,000

2,000,000

3,000,000

Random Write 4KB [IOPS]

TABLE I H ARDWARE AND SOFTWARE CONFIGURATION .

Performance is comprised of latency and throughput (i.e. I/O per second (IOPS)). Both were measured on the same test (i.e. latency measured on a fully loaded system) and averaged across three test iterations. We discern IOPS and latency results using vertical and horizontal bars respectively. The main benchmarks are the widely used synthetic FIO benchmark (v2.1.2), and DBT-2 [15], an open source OLTP transactional performance test. DBT-2 is a fair usage implementation of the TPC’s TPC-C Benchmark specification. For specific needs, SPEC SFS 2014 was used. SPEC SFS is a standard performance evaluation suite that measures representative file server workloads for comparing performance across different vendor platforms. We used the database flavor, and focused on overall-response-time. III. PM- AWARE FS AS THE FOUNDATION FOR NEXT- GENERATION IN - MEMORY C OMPUTE Real-time analytics requires fast access to data. Internetof-things and social media drive high demand for sub-block write granularity accesses. These demands cannot be met by traditional file systems and storage, which led to large investments in developing ad-hoc in-memory databases (IMDB) and applications [11], such as Redis, SAP Hana, and Spark. A. Small Write Accesses Traditional file systems are block based and achieve the best write performance when the access request is aligned in offset and size to the block size they use. Performing a partial write degrades performance because it involves twice the I/O compared to a full block write, due to the read, modify and write nature of a small sub-block write. Measurements show that an application performing small 0.25KB write accesses achieves 65% of the IOPS it would have achieved with larger and block aligned write accesses. For this reason optimized applications have traditionally tuned the data structures to align properly with the required block size. MySQL InnoDB for example uses 16KB pages for standard records. In Microsoft SQL server the page size is 8KB, and the recommended storage block size is 8KB or 4KB in order to avoid significant performance degradation. PM on the other hand is byte addressable and is best accessed at cache line granularity (e.g. 64B). Measurements show that for PM-based file systems smaller accesses indeed yield higher IOPS, by 25%. From a throughput perspective it is still better to perform few larger write system calls, but the

Fig. 2. PM-aware file system performance, latency dependency at IOPS. One can observe the low and consistent latency with a very minimal variance.

throughput is still two orders of magnitude higher than with a traditional file system. With PM-aware file systems databases and their deployment can be made simpler, because they do not have to align to the implementation details of the underlying software layer. B. Enable New Application Level Functionality Enabling new application level functionality, such as stream, real-time and predictive analytics, often causes many small write intensive accesses. These drove in-memory frameworks (e.g. Apache Spark), but as just seen are no longer a problem with PM-aware file systems. Relational databases will also benefit from the large performance improvement, especially when expensive operations or queries are executed, for example when generating comprehensive reports. Full table scans are a fundamental demand in a column-store database and are required to be fast [8]. Execution time can occasionally be traded with pre-processing and additional capacity. Alternatively an IMDB can be added as part of the solution. Most IMDB implementations (e.g. Redis) assume that the entire data set has to reside in memory, but memory capacity is both expensive and limited. In one type of deployment IMDBs are used as a caching layer that serves read queries, while another database is the slow but transaction safe database, which yields complex and expensive end-to-end solutions. In another type of deployment IMDBs are used in order to handle a stage within a data pipeline alongside other inmemory frameworks and traditional storage solutions. Such deployments are hard to manage, because each framework implements its own data services such as flushing to persistent storage or taking a snapshot at the isolated in-memory framework level. Now with the generic storage infrastructure catching up to memory-like performance, there is an opportunity to achieve the same or even better performance in a way that makes data management safer and easier. The average database response time of SPEC SFS2014 benchmark is measured to be 0.01 milliseconds as opposed to 0.18 on Traditional FS. Using the traditional POSIX APIs on a PM-based FS was enough to achieve a result that is lower than the measurement granularity which a modern benchmark can report. Furthermore, as shown on Figure 2, the PM-based file systems consistently offers very low latency with small variance, implying that performance is

2016 ICSEE International Conference on the Science of Electrical Engineering steady and uniform, especially for realistic access loads (e.g. < 1, 000, 000 IOPS). Having a common foundation across several in-memory databases, as well as traditional databases and other applications, will allow data sharing, improve quality and simplify the in-memory database design, leading to increased focus on user-driven functionality rather than infrastructure challenges. IV. P ERFORMANCE B OTTLENECK R ETURNS TO THE A PPLICATION To keep modern multi-core server processors utilized, a large amount of data needs to be accessed quickly. The growth in compute power drove many applications to become memory and storage bound over the years. PM hardware responds in memory-like latencies. PM-aware FSs discourage the use of SW caching and the need to make frequent data copies. Both characteristics imply that despite the performance speedup that will be seen at the application level, some of it will be blocked by inefficiencies in the application software. Inefficiencies that were thus far hidden behind the long storage latencies.

B. CPU Frees up to Perform User Tasks Traditionally storage-bound applications waste most of their CPU resources. Figure 3 shows that when the user-space database ran on a traditional file system, the useful work (i.e. user space) was only able to leverage 13.8% of the CPU time. Running the same workload on a PM-based file system increases the useful portion of work to over half of the time (58%). Clearly, there is little to be gained from additional performance enhancements at the storage or system code. Further performance enhancement would have to come from improving the way applications are written. V. T RADITIONAL POSIX T RADEOFFS ARE N O L ONGER C ORRECT Traditional storage is a scarce resource, and the POSIX API was designed to reflect that. Table II presents the main approaches to perform an access and the pros and cons

%sys

%iowait

%soft

%idle

90% 80%

70% 60% 50%

4.23x

40% 30% 20% 10% 0%

10 PostgreSQL client processes

10 PostgreSQL client processes

Traditional FS

PM-based FS

Fig. 3. PostgreSQL server CPU utilization breakdown. The client is running 10 DBT-2 concurrent processes. CPU usage is split between Linux user space (i.e. the benchmark) and system (including the execution of storage system calls as well as other system calls).

associated with each approach. Application developers are requested to select an approach per file and per access that reflects the right tradeoff. This is obviously a burden that slows down the development process and is error prone.

A. Storage Latency is No Longer the Bottleneck File system access latency is mainly comprised of the storage device, interconnect, storage software and operating system (OS) context switch latencies. Measurements show that latencies in idle systems are lower for both systems. In both fully utilized and idle conditions, the time an application has to wait for an access request is over two orders of magnitude shorter for the application running on a PM-based file system. In real life workloads, access requests from a PM-based file system are no longer orders of magnitude slower than memory and lock latencies. Bottlenecks can easily shift to any element of the system that communicates or responds at a different scale. New hardware implementations, such as hybrid cube memory interconnects, and 100GE may further expose inefficiencies in the application code itself.

%user

CPU utilization 100%

Description

Traditional Pros

Direct I/O

Accesses the media, while bypassing the software cache

Sync

Accesses the software cache and then the media (for writes and read misses) Accesses the software cache and at a later time in the future also writes to the media Maps the storage to memory (typically the software cache) and allows direct application access via machine level instructions such as load and store

High write throughput. Also efficient for applications that perform their own SW caching Safe. The request is acknowledged after written to media

Async

mmap

Relatively low latency

Save memory allocations (capacity and latency). Also, convenient when storing structures and selectively accessing fields

Traditional Cons Slow read accesses

Slow write accesses

Unsafe (pre fsync). Writes acknowledged before the data is persistent Unsafe (pre msync). Writes acknowledged before the data is persistent

TABLE II T HE MAIN POSIX APPROACHES FOR ACCESSING FILE SYSTEMS AND THE PROS AND CONS AS ASSUMED BY USERS OF TRADITIONAL STORAGE .

PM-aware file systems do not typically employ software caching, because PM is as fast or a bit slower than volatile memory. In the remainder of this section we present why traditional pros and cons are no longer correct, and claim that most differences are insignificant; eliminating these constraints simplifies the task of application development. A. Software Caching and Direct I/O Software caching (e.g. Linux page cache) keeps a copy of the data in volatile memory in an attempt to improve the latency of recurring read accesses. A Direct I/O access bypasses the software cache, meaning that the access is made directly to the persistent storage media, which saves one copy of the data during every write access.

2016 ICSEE International Conference on the Science of Electrical Engineering

DirectIO

Traditional FS TPS Traditional FS 90th % latency

7,000

PM-based FS TPS PM-based FS 90th % latency

3 8.95x

6,000

SW Caching

DirectIO

SW Caching 0

20

40

60

80

100

120

140

Read latency[usec]

Fig. 4. Average read latencies with and without the Direct I/O flag. The benchmark is comprised of 10 repeating accesses per datum. Shorter bars are better.

The average access latencies of a highly-repetitive read benchmark, which was configured to either use or bypass the SW cache is presented in Figure 4. Traditional file systems use the software cache by default, so forcing them to operate in a Direct I/O mode degraded the performance of this highly repetitive benchmark by a factor of nearly nine, which is almost linear with the level of repetition in the benchmark. The same benchmark running on a PM-based file system shows better results and no performance sensitivity to the application configuration. The reason is that software caching is not typically used anyway, because volatile memory is not significantly faster than persistent memory. B. Synchronous and Asynchronous Accesses Writing data to traditional storage consumes time, because the data and its describing meta data has to be copied to relatively slow durable storage media before it can be committed. For this reason blocking (Sync) and non-blocking(Async) access approaches were developed. The Sync approach is serial and slow but transaction safe. The Async approach is fast but only eventually persistent. The opportunity to boost performance by a factor of ten, as measured, in contrast to the risk of losing data, reveals why application developers had to make these tradeoffs using traditional storage. Comparing performance of the Async and Sync approaches using a PM-based file systems yields a different conclusion. Measurements reveal insignificant performance differences between the approaches. Async performance results can be slightly worse due to context switch overhead, or slightly better thanks to relaxing mirroring and hardware cache flushing requirements. In any case, it is no longer as significant as in traditional storage, meaning that most application developers can benefit from design simplicity and settle for the Sync approach across all accesses. C. Storage and Memory Accesses Thus far we have evaluated approaches that access storage via read and write system calls. The last access approach depicted in Table II is mmap. Memory mapping allows the application developer to access the data via memory pointers and machine level instructions such as load and store. In traditional storage, mmap means copying the data from the device to volatile memory, which explains the performance

2 4,000

1.5

3,000

1

2,000

0.5

1,000

0 0

Latency [Sec]

Throughput [TPS]

2.5

5,000

0 10

20

# of concurrent DBT-2 client processes

30

40

Fig. 5. PostgreSQL server performance. Throughput and latency dependency at number of concurrent DBT-2 client processes. PostgreSQL is configured with 10GB of Shared buffer, DBT-2 uses a 200 warehouses scale and 100 connections

boost of using mmap, providing that there are recurring write accesses to the same file offset. Using mmap does not boost performance if there are no recurring accesses, because of the overhead in updating the OS page table. PM-based file systems present a different tradeoff. First, there is no copy of data required in order to allow memory access, because the device itself can be directly accessed (as memory (DAX). Second, read and write system calls return very quickly, on par with an OS page table update. Preliminary measurements showed poor performance numbers for mmap due to having to call the inefficient clflush instruction. This comparison is left to future work as Intel announced it will support an optimized version of that instruction (clflushopt). VI. L EGACY APPLICATION SHOWCASE Following our conclusions, we started research on PostgreSQL, an open source SQL database (DB). If correct, the superior storage level performance, along side the lower footprint on CPU resources would translate to higher database performance, even before further optimizing its code. The OLTP speedup achieved ”out-of-the-box” just by switching the underlying file system is presented in Figure 5. It shows that with the PM-based FS, the query average and 90th% response time improved by 3.80 and 6.45 respectively , while serving 3.82 more transactions per second. As shown by the CPU utilization breakdown shown in Figure 3, the performance bottleneck likely shifted to the database itself. We intend to further analyze and optimize PostgreSQL bottlenecks and to analyze to what extent bottlenecks may be attributed to storage and if so, can they be resolved by rewriting PostgreSQL to leverage the DAX mechanism. VII. R ELATED W ORK A lot of research effort was invested in the new generation of storage hardware, both at the cell and the device level [1]. However only a few papers focus on the software layers and on the methodology that could ease the task of application development. The first set of these papers focus on the challenge of getting NVM hardware to work. Specifically, previous research

2016 ICSEE International Conference on the Science of Electrical Engineering considers how to flush the data that should be persistent beyond the on-chip SRAM caches [13], how to implement persistent variables and data structures, and how to manage the non-volatile address space [3, 10]. In an effort to create a common framework for these and similar research efforts, two open-source projects have emerged, lead by Oracle and Intel. Oracle is focusing on C language and compiler extensions for PM. Intel is developing user space direct-access library. Both approaches are similar in the sense that they offer additional programming features at the cost of additional complexity. Both require application developer to rewrite applications and both are built on top of the POSIX mmap API, assuming an underlying DAX-enabled or PM-based file system. POSIX-compliant PM-based file systems can also accelerate legacy applications. Several PM-based file systems have been proposed to date, including BPFS [4] and and PMFS [5]. None of these matured beyond a proof of concept. The above mentioned papers focus on the opportunity to bring extreme performance to legacy or newly developed applications. We agree that boosting performance is an important goal, but there is also an opportunity for ease-ofprogrammability, because most applications will no longer be storage bound, even if programmers use storage in a naive way (Naive usage implies not paying attention to block sizes, Async usage, and sequential access patterns). VIII. C ONCLUSION Most storage solutions involve some compromise between performance or consistency of data, requiring application developers to make decisions involving sometimes complex tradeoffs. In this paper, we show that PM and PM-based FS can be game changing technologies for both users and application developers. PM-based file systems offer outstanding performance, without requiring complex integration at the system and application level. Traditional tradeoffs between POSIX Direct I/O, direct memory map, and indirect synchronous as compared to asynchronous I/O are far less significant, potentially leading to simpler and more efficient and robust programs. We further illustrate how the performance bottleneck shifts from storage to other parts of the system, including the application itself, and call application developers, such as traditional database developers, to resolve the thus-far-hidden inefficiencies. We further call IMDB developers to return to using standard storage infrastructure as it can easily handle small and writeintensive accesses that previously couldn’t. R EFERENCES [1]

A. M. Caulfield et al. “Understanding the impact of emerging non-volatile memories on high-performance, IO-intensive computing”. In: Proc. of the ACM/IEEE Int’l Conf. for High Performance Computing, Networking, Storage and Analysis. 2010, pp. 1–11.

[2]

N. Christiansen. Storage Class Memory in Windows. 2016. URL: http : / / www. snia . org / sites / default / files / NVM/2016/presentations/Neal%20Christiansen SCM in Windows NVM Summit.pdf. [3] J. Coburn et al. “NV-Heaps: making persistent objects fast and safe with next-generation, non-volatile memories”. In: ACM SIGARCH Computer Architecture News. Vol. 39. 1. 2011, pp. 105–118. [4] J. Condit et al. “Better I/O through byte-addressable, persistent memory”. In: Proc. of the ACM 22nd Symp. on Operating systems principles. 2009, pp. 133–146. [5] S. R. Dulloor et al. “System Software for Persistent Memory”. In: Proc. of the 9th European Conf. on Computer Systems. 2014, pp. 1–15. [6] Intel. A Revolutionary Breakthrough in Memory Technology - Intel. 3DXPoint Launch keynote. 2015. URL: http : / / www. intel . com / newsroom / kits / nvm / 3dxpoint / pdfs/Launch Keynote.pdf. [7] Y. Kim et al. “Multilevel Spin-Orbit Torque MRAMs”. IEEE Transactions on Electron Devices 62.2 (2015), pp. 561–568. [8] W. Lai, Y. Fan, and X. Meng. “Scan and join optimization by exploiting internal parallelism of flash-based solid state drives”. In: Web-Age Information Management. 2013, pp. 381–392. [9] M. Larabel. ACPI 6 Non-Volatile Memory Device Support / NFIT / LIBND For Linux. 2015. URL: https : / / www.phoronix.com/scan.php?page=news item&px= ACPI-6.0-Libnd-NVDIMM-Moving. [10] I. Moraru et al. “Consistent, durable, and safe memory management for byte-addressable non volatile main memory”. In: Conf. on Timely Results in Operating Systems (TRIOS13). 2013. [11] H. Plattner and A. Zeier. In-Memory data management: Technology and applications. Springer Science & Business Media, 2012. [12] SNIA. The 3rd SNIA NVM summit. 2016. URL: http : / / www. snia . org / events / non - volatile - memory - nvm summit. [13] H. Volos, A. J. Tack, and M. M. Swift. “Mnemosyne: Lightweight persistent memory”. In: ACM SIGARCH Computer Architecture News. Vol. 39. 1. 2011, pp. 91– 104. [14] M. Wilcox. Add support for NV-DIMMs to Ext4. 2014. URL : http://lwn.net/Articles/613384. [15] M. Wong. DBT-2: Open Source Development Labs Database Test 2. 2014. URL: https://sourceforge.net/ p/osdldbt/dbt2/ci/master/tree/. [16] Z. Zhang et al. “All-Metal-Nitride RRAM Devices”. IEEE Letters on Electron Device 36.1 (2015), pp. 29– 31.