A Linux Filesystem Tracing Method Using the Kprobes Linux Dynamic Instrumentation System

International Journal of Software Engineering and Its Applications Vol. 9, No. 3 (2015), pp. 95-106 http://dx.doi.org/10.14257/ijseia.2015.9.3.10 A L...
Author: Junior Tyler
22 downloads 1 Views 608KB Size
International Journal of Software Engineering and Its Applications Vol. 9, No. 3 (2015), pp. 95-106 http://dx.doi.org/10.14257/ijseia.2015.9.3.10

A Linux Filesystem Tracing Method Using the Kprobes Linux Dynamic Instrumentation System Sang-Young Cho and Hyunsu Ha Department of Computer Science and Engineering Hankuk University of Foreign Studies, 81, Oidae-ro, Mohyen-myen, Cheoin, Yongin, Gyeonggi, Korea, 449-791 [email protected] and [email protected] Abstract A storage performance analysis tool is crucial to finding performance bottlenecks in I/O storage systems and developing efficient storage system architectures or algorithms. Due to the existence of many parallel Linux projects, there are also many trace tools proposed by different authors. This work presents an integrated performance analysis tool for Linux file systems. The tool provides actual time information for Linux file system functions. In contrast to other existing tools, the tool provides a filtering mechanism, a graphical interface, and system-level analysis information without a heavy load of measurement. It may be used by Linux developers or end-users for analyzing file system layers or measuring software performance to find bottlenecks in Linux file systems. Keywords: Linux Kernel, File System, Tracing, Performance Analysis

1. Introduction Android is an open-source Linux-based operating platform mainly for mobile devices. Recently, the use of this platform has been expanded to include not only smart phones, but also other smart appliances such as tablets, smart TVs, set-top boxes, and even interactive consumer electronics. Though the application performance of an Androidbased device can be affected by many factors, the most influential factor is the performance of the underlying I/O storage system [1]. The storage system is a collection of software layers such as SQLite, EXT4, a block device driver, and a storage media driver. Each software module of the I/O storage system has been utilized for hard-disk storage media. Several performance issues began to occur when the integrated I/O storage system is applied to battery-powered mobile devices that are using NAND-based media. Therefore many researches have been devoted to improve the performance of I/O storage system. There are existing benchmarking tools for measuring storage system performance in various ways. The most well-known tool is IOzone [2] that provides numerous testing environments and data views with read/write speed and I/O turnaround time. IOzone shows performance results in statistical values, but is difficult to use for Android-based systems. AndroBench [3] was developed to benchmark the storage performance of Android-based mobile devices at the application layer. These tools can give the overall performance behavior in the system level, but lack the ability to report detailed inter- or intra-function behaviors that are useful for analyzing exact timing behaviors. There are many function-level performance analysis tools, also known as function tracers [4-8], such as ftrace, perf, strace, dtrace, and blktrace. These tools allow users to trace almost all function-level events within the Linux kernel including file system domain functions. However, the tools do not provide any convenient interfaces to interpret or to filter the trace results, thus users find it difficult to translate the trace results

ISSN: 1738-9984 IJSEIA Copyright ⓒ 2015 SERSC

International Journal of Software Engineering and Its Applications Vol. 9, No. 3 (2015)

into useful analyzing information. The tools cannot focus on the trace events with which users are concerned and usually gather all kernel events, including dmesg events; this is due to the kernel logging system. Furthermore, each tool is specialized in a sub-kernel domain. To get system-level performance information, several tools should be used at the same time and each tool’s trace information should be integrated to generate a wholistic view of system-level performance data. A storage performance analysis tool is crucial for finding performance bottlenecks in I/O storage systems and for developing efficient storage system architectures or algorithms. In this paper, we describe the design and implementation of a powerful Linux file system analysis tool that provides filtering, user access through a graphical interface, and system-level analysis information without heavy measurement. This paper is organized as follows. In Section 2, we introduce the Linux file system layers that are the target of tracing, as well as the Kprobes Linux dynamic instrumentation system. Section 3 elaborates on the design of the Linux tracing tool for analyzing the performance of Linux file system operations. Its implementation issues are discussed in Section 4. The paper is concluded in Section 5.

2. Related Studies 2.1. Linux Filesystem Layer and Write System Call Operation

Figure 1. Linux File System Layers The Linux file system can be analyzed by investigating the function call sequence of the kernel when a read or write system call has been invoked by a user process. The whole of the performance analysis information is acquired by collecting the timing data of every invoked function’s call and return, which is then processed in order to draw a complete picture of the read/write process. Though read and write system calls are all frequently invoked file-related functions in the Linux file system, write is the most influential system call on the entire storage system’s performance if the file system is constructed on NAND-based storage media. This is because the write operation of NAND-based media is more time-consuming than the read operation and the journaling operation is performed during the write operations. Therefore, we focus on analyzing the Linux file system’s write operation.

96

Copyright ⓒ 2015 SERSC

International Journal of Software Engineering and Its Applications Vol. 9, No. 3 (2015)

Figure 1 shows an example of the Linux file system’s layers. The Linux file system has a layered structure that is divided into layers of VFS (Virtual File System), EXT4 (for example), a generic block layer, the I/O scheduler layer and block device driver, and physical storage media. Various file systems can co-exist in a Linux system through the VFS interface. The generic block layer supports block-based device I/O. Each block device has its own queue and I/O requests are inserted at the I/O scheduler layer. When storage media are eMMC or UFS devices, the MMC block device driver, MMC core, and MMC host driver are in charge of media accesses for storage I/O.

Figure 2. Write System Call Operation Figure 2 shows that a write system call on Linux 3.11.10 invokes a series of function calls through multiple layers and finally enables the storage media to be accessed. To trace the functions of concern across these layers, actual function names are required, which can be burdensome for the developer, who must remember the precise names of all of these functions. Each layer consists of functions that perform for file operations and the jobs which use them. For instance, the ext4_writepages() function of the generic block layer passes the divided page information of a write system call to the I/O scheduler layer. Based on how the layers are structured, it is more desirable to provide users with function lists and for them to choose functions to trace. As shown in Figure 2, a simple write system call invokes many functions in each layer and it is not easy to gather time information for each invoked function and analyze performance bottlenecks among the series of function calls simultaneously. Existing tracing or analysis tools for file system behaviors usually focus on the specific functions for a certain layer instead of providing overall performance analysis information for file system operations. 2.2. The Kprobles Linux Dynamic Instrumentation System A static or dynamic Linux kernel tracepoint is a kind of breakpoint that provides a hook to call a function (probe) to extract kernel behavior information [9]. Static tracepoints cannot be inserted at run-time but they can be located at any points throughout the program. Dynamic tracepoints can be easily inserted or enabled during run-time, but

Copyright ⓒ 2015 SERSC

97

International Journal of Software Engineering and Its Applications Vol. 9, No. 3 (2015)

they incur some overhead and are usually placed at function entry and exit points. Kprobes enables us to dynamically break into any kernel routine and collect debugging and performance information non-disruptively. The Kprobes instrumentation is built as a kernel module. Thus, rather than having to recompile and reboot the system with an instrumented kernel, a kprobe instrumentation module can be written, compiled, and loaded on the system. There is no need to reboot the system. Once the instrumentation module has served its purpose, it can be unloaded, and the kernel returned to its normal operation. There are currently three types of probes: kprobes, jprobes, and kretprobes (also called return probes). A kprobe can be inserted on virtually any instruction in the kernel. A jprobe is inserted at the entry point of a kernel function, and provides convenient access to the function's arguments. A return probe fires when a specified function returns. Kretprobes also provides an optional user-specified handler which runs on function entry. The performance analysis tool of this paper was aimed at dynamic focusing. Probing overhead may not be significant because our main concern lies not on debugging information such as local variable investigation, but on the functions’ timing information. The main focus of this work is the execution time of each function in relation to filerelated operations. Therefore, we modify “kretprobe” to invoke entry and exit handlers for each kernel function. Kretprobe can be applied within a kernel module using the following function forms: int register_kretprobes(struct kretprobe *p) void unregister_kretprobes(struct kretprobe *p.

The “struct kretprobe *p” parameter refers to information for a tracepoint given by the user. Its structure is showed in Figure 3. struct kretprobe { struct kprobe kp; kretprobe_handler_t handler; kretprobe_handler_t entry_handler; int maxactive; int nmissed; size_t data_size; struct hlist_head free_instances; raw_spinlock_t lock; };

Figure 3. Definition of Struct Kretprobe The “handler” variable refers to the custom tracepoint handler at a registered function return point and “entry_handler” refers to that of the function entry point. Because processor has to wait until each handler returns and this incurs the probe effect of program execution, there are some variables to control system resources. To manage excessive resource usage, “maxactive” is set to be the maximum number of kretprobe instances processed at the same time. If the number of simultaneous kretprobe instances exceeds the value of “maxactive”, the value “nmissed” increases by one for every missed handler. “data_size” is used for indicating the size of data sent from an entry handler to a corresponding return handler and “free_instance” holds the pointer to the list of free kretprobe instances. When a function registered by the kretprobe interface is called, a breakpoint exception occurs and the “pre_handler_kretprobe” function is executed. This function allocates a

98

Copyright ⓒ 2015 SERSC

International Journal of Software Engineering and Its Applications Vol. 9, No. 3 (2015)

kretprobe instance from “free_instance”, and calls an entry handler if it is defined and finally calls the “arch_prepare_kretprobe” function to prepare the kretprobe process including registering the instance onto “kretprobe_inst_table”. The “arch_prepare_kretprobe” function is an architecture-dependent function that switches the address that the stack pointer is indicating with the address of “kretprobe_trampoline”. After returning from the entry handler, the original function executes its own code until it reaches the return point. At the return point, the function jumps to the address of “kretprobe_trampoline” set as above. “kretprobe_trampoline” is an inline assembly routine that saves and loads the values of registers to call “trampoline_handler” that looks up a kretprobe instance from “kretprobe_inst_table” and executes the custom return handler.

3. Software Structure and Operation Scenario Figure 3 shows the overall software structure of the Linux file system tracing environment.

Figure 4. The Overall Structure of the Tracing Environment The whole software is divided into two programs: one is the host control program and the other is the target system program. The host control program enables the user to configure a trace environment such as the Linux version of target system, the kernel functions to be traced, and test mode that specifies what kind of file system access functions are executed and how the functions are executed on the target Linux machine. The tracing environment setting is written as a text file and transmitted to the trace control program of the target system via serial communication. The host control program performs tracing information analysis after receiving a tracing result file from the trace control program of the target system. The viewer of the host control program provides the file system performance analysis information using a visualized interface such as graph, grid, or tree views. The trace control program receives an input text file and configures a corresponding trace environment based on the information of the file. The program, sequentially, installs

Copyright ⓒ 2015 SERSC

99

International Journal of Software Engineering and Its Applications Vol. 9, No. 3 (2015)

the kernel trace module, flushes the kernel page cache, executes a file access program, and removes the kernel trace module after the end of tracing. The tracing configuration is passed to the kernel trace module and the file access program so that the two programs work together using the same tracing information. The kernel trace module controls the whole trace operation. The first role of the module is to initialize a global list that is used to store the tracing log in main memory during the tracing of file accesses. The recorded tracing information is written to a result file to be transmitted to the host control program. Then, the module registers the specified kernel functions to be traced and handlers for entry and return onto the kernel kretprobe system using the register_kretprobe() function. The entry and return handlers are called when a registered kernel function is invoked or returned. The handlers filter functions that are called from kernel threads or other processes except for the file access program and thus the module records only the tracing information about the file access program. The recorded log contains the invoked function name and the time information of entry and return. The collected tracing data is written on a result file after executing a file access program, according to the tracing configuration, that has file access operations. The result file is transferred to the host control program to be analyzed. The viewer of the host control program displays the analyzed tracing information.

4. Implementation of the Tracing Software There were two main programs implemented: a kernel trace module operating on the Linux kernel level and a host control program running on a host PC. The kernel trace module was implemented using Fedora 3.13 using the kretprobe interface to store the tracing data of the Linux file system on main memory. The host control program was developed on .NET framework 4 using C#. 4.1. Kernel Trace Module The trace kernel module consists of four major parts: initializing a global list for storing tracing data, registering all the functions given in the input text file, registering handler routines for measuring time at function entry and return points, preparing logging messages, linking the message data to the global linked-list, and finally, unregistering functions to free allocated memory and record data in a global linked-list on a result text file. The main part of the module is the handler routine. The pseudo code of the return handler is shown in Figure 5. // function at return point int ret_handler(kretprobe_instance *ri) { if (calling_thread == kernel_thread) return 1; else if (calling_thread == file_access_program) { get_log_memory(); get_time_ns(); record_log_with_name_and_time(); } Return 0; }

Figure 5. Implemented Return Handler

100

Copyright ⓒ 2015 SERSC

International Journal of Software Engineering and Its Applications Vol. 9, No. 3 (2015)

Entry and return handlers are implemented because the start and return times of a function should be obtained to calculate the execution time of a function. The two handlers are identical except for their names and marks. These marks are to be recorded in the log to distinguish between log sources. The handler checks whether the handlers are called in the context of the file access program. If the context is not the file access program to be tested, the handler ignores the function invocation and skips the trace logging because the invocation is not related to the file access program. The kernel trace module is installed before the execution of the file access program and removed after the end of the program. When the module is removed, the log data in the global linked list is written to a result file that is to be transmitted to the host control program. Any tracing incurs a probe effect which disturbs the original execution of the program and thus the exact time result cannot be obtained. In order to minimize the probe effect, the overheads due to tracing should be minimized. If this is not the case, the overheads have considerable influence on the output result of read/write process. The kernel trace module is based on kretprobe interface using the Kprobes mechanism that is known to have negligible amount of overheads. The main role of the module is to read the current time and write to a log file with function name. We experimented with writing methods using the printk function, sprintf function, and strncpy function. The experiment shows that the average execution time of printk is 2.133 microseconds in our testing environment and that of sprintf is 1.468 microseconds. The average execution time of strncpy is 53.8 nanoseconds to store log to memory. These findings led us to devise a memory-based log method using strncpy because using printk or sprintf can incur heavy overhead even though the printk function is a basic kernel log printing mechanism. The kernel trace module maintains a global linked-list for gathering time data of selected functions. Only time measuring, string copying and list linking are performed in the handler so that the operation is kept to a minimum. The average execution time of a single handler is 196.2 nanoseconds, which is less than one tenth of printk’s function call overhead. 4.2. Host Control Program

Figure 6. User Interface of Host Control Program

Copyright ⓒ 2015 SERSC

101

International Journal of Software Engineering and Its Applications Vol. 9, No. 3 (2015)

The host control program initiates file system tracing and shows tracing results in graphical views. Figure 6 shows the main window and a pop-up function selection window of the host control program. The Kprobes mechanism requires function names to call an entry and return handler. These names should be registered using the Kprobes API. This means the user should know the exact names of the functions he/she wishes to trace. It is not easy to remember all of the function names. Furthermore, function names may change from version to version in Linux kernels. The host control program provides function names in four long-term versions of Linux kernels. The names are classified according to the file system layers. Therefore the user can select a Linux version and a specific layer. The program lists the function names for the chosen layer so that the user can select the functions that user wants to trace. The selected function names and trace environment setting information are written to an input file and transferred to the trace control program of a target system. One of the main parts of the host control program is the result viewer. The viewer program is a user level application implemented in C# on .NET framework 4. The kernel trace module of the target Linux system produces a result file that consists of sequential time data for function calls. The result file is analyzed by the host control program and the viewer shows the tracing results. The analyzed data includes a call graph, the cumulative execution time of each function, and the execution time of a function itself excluding the times of functions called inside it. Viewer information is obtained from a tree function call structure that is generated from a result file. Figure 7 shows a tree generation process from a sample result file.

Figure 7. Tree Generating Process with a Result File As mentioned above, a result file consists of the sequential time data of the function calls. Figure 7 shows an example result file in which the traced data of three functions is recorded. Starting with the “vfs_write” function, the “submit_bio” function and “blk_q_bio” function have both been called from the “vfs_write” function. Initially, a root node is generated and the process starts to read the result file line by line. Whenever the process faces a function name, it generates a new child node and writes the time value of the line. When it meets a function name with return, it overwrites the time value of the current node with a subtraction value of the recorded time value of the node from the line time value and goes back to the parent node. The subtracted time value of a node is the execution time of the corresponding function. The execution time is called the “cumulative-time” of the function. The generated tree structure becomes the function call

102

Copyright ⓒ 2015 SERSC

International Journal of Software Engineering and Its Applications Vol. 9, No. 3 (2015)

graph of a file access program and each node of tree contains the execution time of a function call. The viewer provides self-time and cumulative time information. Self-time is defined as the execution time of the function itself, excluding the execution times of functions called inside of the calling function. Self-time is calculated using post-order tree traverse. If a node is a leaf node, cumulative time and self-time must be equal since there is no function called inside it. For a non-leaf node, the self-time of the node can be calculated after all the child nodes’ cumulative-times are calculated. The self-time value of a non-leaf node can be obtained by subtracting the sum of all child nodes’ cumulative-time values from the cumulative-time of the non-leaf node. An example of calculating self-times of tree nodes is shown in Figure 8. Focus on the value “sum” beside the “ vfs_write” node. After evaluating the self-time values of all children nodes, “submit_bio” and “blk_q_bio”, of the “vfs_write” node, the value “sum” beside the “vfs_write” node contains the summation of the self-time values of all child nodes. The self-time value of the “vfs_write” node is 16, that is 26 minus 10.

Figure 8. Self-time Calculation Process Generated function tree is based on a result text file. Therefore, repetitive calls of the same function are dealt with individual nodes in the tree structure. However, it is more informative for users to check the cumulative execution time of every call made to the same function than to check individual execution times of all calls of the same functions. When determining the overall performance of read/write process, we usually compare the cumulative data of the same functions in different tests. For a simple example, the whole write process of the file system can be determined by comparing the cumulative execution time of the “vfs_write” function since it is the top level function of the entire execution. Figure 9 shows the process of accumulating the execution times of every function using a map data structure. Map data structure consists of a ‘Key,’ which is a function name, and ‘Value,’ which contains function invocation information. The calculation of cumulative time is performed by traversing the generated tree in post-order starting from the root node. When visiting a new node, the algorithm checks the map data structure whether the corresponding key exists. If key is found in the structure, then the ‘Value’ of the key is updated with the time information of the node as like Figure 9 (c). When the corresponding key is not found, the key is registered to the Map data structure and a ‘Value’ object is generated. The content is the node time information.

Copyright ⓒ 2015 SERSC

103

International Journal of Software Engineering and Its Applications Vol. 9, No. 3 (2015)

Figure 9 Cumulative Time Calculation Process The two data structures, tree and map, are obtained from a result file analysis. Tree is used to get call graph information and individual time information for each function call. Map has the cumulative time data per kernel function called. Applying these structures, graph, grid, and tree views are provided to users for analyzing the performance of file system accesses. 4.3. Viewer Program It is very important for a tracing tool to provide an intuitive and informative interface. The tool presented here supports three types of viewing interfaces. The graph view interface as shown in Figure 10 (a) displays the cumulative time of each function of a result file in single mode view. The view enables user to find timeconsuming functions at a glance. It also supports a multiple mode view where several test results can be compared to verify algorithmic performance enhancements or the differences in performance of various file access modes. The grid view interface of Figure 10 (b) shows the map data of a result file analysis. A file access system call in the Linux Ext4 file system is performed by three threads. They are the main thread that makes system calls, the jdb2 thread for journaling, and the mmcqd thread for low-level I/O, respectively. The tool in this paper can trace all the functions called by the three threads and the traced data is displayed in different three grids. By providing the layer information of each function, user can easily recognize overheads of layers during file system access. The tree view interface in Figure 10 (c) displays the whole call graph of all the individual function calls using the tree data structure of a result file. The call graph display can be merged or expanded at a function so as to hide all functions called by a specific function. This helps the user see only the part of the call graph that is of interest. A pop-up window is used to show node time information when the mouse is located at a specific function name.

104

Copyright ⓒ 2015 SERSC

International Journal of Software Engineering and Its Applications Vol. 9, No. 3 (2015)

Figure 10. User Interface of Viewer

4. Conclusion A storage performance analysis tool is very crucial to finding performance bottlenecks in I/O storage systems and to developing efficient storage system architectures and algorithms. The performance of a storage system can be measured in various ways and there are many existing benchmarking tools. However, some tools provide statistical results at the top level, but lack detailed reporting. Other tools enable us to investigate detailed behavior of inter- and intra-file-related kernel functions, but only within restricted portions of the kernel through inconvenient interfaces. In this paper, we describe the design and implementation of a powerful Linux file system analysis tool that provides filtering services, a graphical interface, and systemlevel analysis information without incurring heavy penalties. The tool mainly consists of an in-kernel tracing module and a host control program that runs on a host PC. The kernel trace module has been implemented on Fedora 3.13 using the kretprobe interface to store tracing data from the Linux file system on main memory. The host control program is developed on .NET framework 4 in C#. The kernel module has low probing overhead compared to other existing tools. The host control program sets up a performance tracing environment and provides three types of performance viewers after analyzing a result file generated by the kernel trace module.

Acknowledgements This work was supported by Hankuk University of Foreign Studies Research Fund of 2014.

References [1] [2]

S. Jeong, K. Lee, S. Lee, S. Son and Y. Won, “I/O Stack Optimization for Smartphones”, Proceedings of the 2013 USENIX Annual Technical Conference, San Jose, USA, (2013), pp. 309-320. IOzone, http://www.iozone.org/.

Copyright ⓒ 2015 SERSC

105

International Journal of Software Engineering and Its Applications Vol. 9, No. 3 (2015)

[3]

[4] [5] [6] [7] [8] [9]

J.-M. Kim and J.-S. Kim, “AndroBench: Benchmarking the Storage Performance of Android-Based Mobile Devices”, Proceedings of the Frontiers in Computer Education, Springer, Heildelberg, vol. 133, (2012), pp. 667-674. A. Spear, M. Levy and M. Desnoyers, “Using Tracing to Solve the Multicore System Debug Problem”, IEEE Computer, December, (2012), pp. 60-64. D. B. de Oliveria and R. S. de Oliveria, “Comparative Analysis of Trace Tools for Real-Time Linux”, IEEE Latin America Transactions, vol. 12, no. 6, September, (2014), pp. 1134-1140. Kernel Trace System, http://elinux.org/Kernel_Trace_Systems/. T. Bird, “Measuring Function Duration with Ftrace”, Proceedings of the Linux Symposium, Ottawa, Canada, (2009), pp. 47-54. LTTng Documentation, http://lttng.org/docs/. A. Mavinakayanahalli, P. Panchamukhi and J. Keniston, “Probing the Guts of Kprobes”, Proceedings of the Linux Symposium, vol. 2, Ottawa, Canada, (2006), pp. 109-123.

Authors Sang-Young Cho received his B.S. degree in Control and Instrumentation from Seoul National University, Korea, in 1988, and M.S. and Ph.D. degrees in Electrical Engineering from KAIST, Daejeon, Korea, in 1990 and 1994, respectively. He is currently a full professor with the Department of Computer Science and Engineering in Hankuk University of Foreign Studies. His research interests include embedded system, computer architecture, and software optimization. Hyunsu Ha received his Bachelor’s Degree in Computer Science and Engineering at Hankuk University of Foreign Studies. He participated in designing C# based software for digital visualizer and remote lighting control system, and had fulfilled QT framework optimization for embedded devices. His research is currently associated with Linux filesystem and its benchmarking tool, and its interests include embedded system, software optimization and software security.

106

Copyright ⓒ 2015 SERSC