Query Optimization Using Column Statistics in Hive A presentation by
Anja Gruenheid Associated Professors: Edward Omiecinski & Leo Mark School of Computer Science Georgia Institute of Technology
September 21 2011
Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion
Outline 1
Motivation
2
Apache Hive
3
Query Optimization
4
Query Optimization in Hive Statistics to Improve Query Performance Column Statistics in Apache Hive
5
Evaluation
6
Conclusion 2 / 19
Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion
Motivation MapReduce MapReduce solutions have spread across industry providing a file-based computation platform Advantages Distributed solution Scales extremely well Aggregation methods are comparatively faster than in most DBMS Solution:
Disadvantages Provides no query functionality Selection methods are comparatively slower than in most DBMS
Framework on top of MapReduce solution to simulate data warehouse 3 / 19
Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion
Motivation Apache Hive OLAP solution on top of the Hadoop MapReduce framework Originally developed for Facebook, but has spread to other companies It provides: SQL-like language called HiveSQL Aggregation, selection and projection functionality Possibility to store metadata in a connected RDBMS Drawbacks: Open-source project in development Query processing still relatively slow 4 / 19
Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion
Apache Hive send query
Driver
execute
Execution Engine
send plan
get plan
UI
send metadata
Compiler
MetaStore get metadata
Metadata is stored in separate database Statistics are accessed during runtime to generate left-deep query execution plans Queries are executed on MapReduce engine 5 / 19
Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion
Query Optimization What is Query Optimization? Improvement of performance or computational cost etc. of a query How and why the query is improved, is case and system-dependent Here: Improvement aims to increase query performance How can query performance be improved given the MapReduce environment? Use metadata statistics to further optimize query plan Use MapReduce specific cost plans 6 / 19
Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion
Statistics to Improve Query Performance Column Statistics in Apache Hive
Statistics to Improve Query Performance Statistics for database tables are used to provide a better query execution plan. are updated either every time a value is updated or in intervals. Column Level Statistics: Advantages lead to good approximations of table sizes
Disadvantages have to be updated regularly or else not useful
can easily be stored in a database system 7 / 19
Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion
Statistics to Improve Query Performance Column Statistics in Apache Hive
Column vs. Table Statistics Example: customer ./ orders ./ lineitem table customer orders lineitem
size 150 1,500 6,000
In table lineitem, there exists a column constraint that limits the number of tuples to 20.
Table Statistics: Proposed join order is customer ./ orders ./ lineitem Column Statistics: Proposed join order is lineitem ./ orders ./ customer 8 / 19
Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion
Statistics to Improve Query Performance Column Statistics in Apache Hive
Column vs. Table Statistics Example: customer ./ orders ./ σcol=const lineitem table customer orders lineitem
size 150 1,500 6,000
In table lineitem, there exists a column constraint that limits the number of tuples to 20.
Table Statistics: Proposed join order is customer ./ orders ./ lineitem Column Statistics: Proposed join order is lineitem ./ orders ./ customer 9 / 19
Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion
Statistics to Improve Query Performance Column Statistics in Apache Hive
Column Statistics in Apache Hive I Statistics in Apache Hive are only implemented on the table but not column level. Column level statistics to be collected: distinct count of tuples per column number of null values per column minimum value per column maximum value per column highest frequency of a value per column
10 / 19
Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion
Statistics to Improve Query Performance Column Statistics in Apache Hive
Column Statistics in Apache Hive II Cost functions: Cost function ( as used in a relational database system: 0 if T is a leaf relation CCA (T ) = |T | + CCA (T1 ) + CCA (T2 ) if T = T1 ./ T2 Cost function adapted to the MapReduce framework: P CMRA (T ) = in−1 n Ti | = 2 |Ti−1 o Why is an adapted cost function necessary? Classic cost functions only consider costly I/O operations in case of a tablescan MapReduce framework leads to I/O operations between reduce & map phases 11 / 19
Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion
Evaluation I: Experimental Setup Experimental query: orders o ./o.o_custkey =c.c_custkey customer c ./o.o_orderkey =l.l_orderkey (σl_shipdate=0 1995−01−010 lineitem) l ./c.c_nationkey =n.n_nationkey nation n Join Graph:
n
c
o
l
12 / 19
Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion
Evaluation II: Join Order Join Graph:
n
c
o
l
Proposed Join Ordering: CCA :
nation n ./c.c_nationkey =n.n_nationkey customer c ./o.o_custkey =c.c_custkey orders o ./o.o_orderkey =l.l_orderkey (σl_shipdate=0 1995−01−010 lineitem) l
CMRA :
orders o ./o.o_orderkey =l.l_orderkey (σl_shipdate=0 1995−01−010 lineitem) l ./o.o_custkey =c.c_custkey customer c ./c.c_nationkey =n.n_nationkey nation n 13 / 19
Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion
Evaluation III: Join Order Join Graph:
n
c
Value
Highest frequency
primary keys c_nationkey o_custkey l_orderkey l_shipdate
1 6161 41 7 2707
o
l Table
Cardinality
nation customer orders lineitem
25 150,000 1,500,000 6,001,215 2707
I/O operation estimations: n ./ c ./ o:
nc = |n ∗ c| = 25 ∗ 1 ∗ 6161 nco = nc + nc ∗ 1 ∗ 41 = 6.469 ∗ 106
l ./ o ./ c:
lo = |l ∗ o| = 2707 ∗ 1 ∗ 7 loc = lo + lo ∗ 41 ∗ 1 = 7.959 ∗ 105 14 / 19
Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion
Evaluation IV: Experimental Results
average runtime (sec)
1000 800 Original Query CA Query MRA Query
600 400 200
1GB 10GB data set size
Orig. Query Opt. Query (CA) Opt. Query (MRA)
Cost (CA) 10999896000000 416761200000 1099590730556
Cost (MRA) 1502707 1650000 5414
15 / 19
Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion
Conclusion Column Statistics in Hive. . . are useful because they improve query performance. need a special cost function because of the underlying MapReduce framework. Future Work: Implementation of the presented algorithm in Hive Evaluation whether histograms or other statistics could improve query performance even further
16 / 19
Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion
Thanks for your attention! Do you have any questions?
17 / 19
Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion
References Jeffrey Dean and Sanjay Ghemawat MapReduce: Simplified Data Processing on Large Clusters, Operating Systems Design and Implementation, 2004. The Apache Software Foundation Hive, http://hive.apache.org/, version 0.7.0 Ashish Thusoo and Zheng Shao and Suresh Anthony and Dhruba Borthakur and Namit Jain and Joydeep Sen Sarma and Raghotham Murthy and Hao Liu Data Warehousing and Analytics Infrastructure at Facebook, ACM SIGMOD, 2010
18 / 19
Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion
References Ashish Thusoo and Joydeep Sen Sarma and Namit Jain and Zheng Shao and Prasad Chakka and Suresh Anthony and Hao Liu and Pete Wyckoff and Raghotham Murthy Hive A Warehousing Solution Over a MapReduce Framework, VLDB, 2009 Ashish Thusoo and Joydeep Sen Sarma and Namit Jain and Zheng Shao and Prasad Chakka and Suresh Anthony and Hao Liu and Raghotham Murthy Hive - A Petabyte Scale Data Warehouse Using Hadoop, IEEE, 2010 Robert Stewart Performance and Programmability of High Level Data Parallel Processing Languages: Pig, Hive, JAQL & Java-MapReduce, Heriot-Watt University, 2010 19 / 19