Query Optimization Using Column Statistics in Hive

Query Optimization Using Column Statistics in Hive A presentation by Anja Gruenheid Associated Professors: Edward Omiecinski & Leo Mark School of Com...
19 downloads 0 Views 400KB Size
Query Optimization Using Column Statistics in Hive A presentation by

Anja Gruenheid Associated Professors: Edward Omiecinski & Leo Mark School of Computer Science Georgia Institute of Technology

September 21 2011

Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion

Outline 1

Motivation

2

Apache Hive

3

Query Optimization

4

Query Optimization in Hive Statistics to Improve Query Performance Column Statistics in Apache Hive

5

Evaluation

6

Conclusion 2 / 19

Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion

Motivation MapReduce MapReduce solutions have spread across industry providing a file-based computation platform Advantages Distributed solution Scales extremely well Aggregation methods are comparatively faster than in most DBMS Solution:

Disadvantages Provides no query functionality Selection methods are comparatively slower than in most DBMS

Framework on top of MapReduce solution to simulate data warehouse 3 / 19

Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion

Motivation Apache Hive OLAP solution on top of the Hadoop MapReduce framework Originally developed for Facebook, but has spread to other companies It provides: SQL-like language called HiveSQL Aggregation, selection and projection functionality Possibility to store metadata in a connected RDBMS Drawbacks: Open-source project in development Query processing still relatively slow 4 / 19

Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion

Apache Hive send query

Driver

execute

Execution Engine

send plan

get plan

UI

send metadata

Compiler

MetaStore get metadata

Metadata is stored in separate database Statistics are accessed during runtime to generate left-deep query execution plans Queries are executed on MapReduce engine 5 / 19

Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion

Query Optimization What is Query Optimization? Improvement of performance or computational cost etc. of a query How and why the query is improved, is case and system-dependent Here: Improvement aims to increase query performance How can query performance be improved given the MapReduce environment? Use metadata statistics to further optimize query plan Use MapReduce specific cost plans 6 / 19

Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion

Statistics to Improve Query Performance Column Statistics in Apache Hive

Statistics to Improve Query Performance Statistics for database tables are used to provide a better query execution plan. are updated either every time a value is updated or in intervals. Column Level Statistics: Advantages lead to good approximations of table sizes

Disadvantages have to be updated regularly or else not useful

can easily be stored in a database system 7 / 19

Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion

Statistics to Improve Query Performance Column Statistics in Apache Hive

Column vs. Table Statistics Example: customer ./ orders ./ lineitem table customer orders lineitem

size 150 1,500 6,000

In table lineitem, there exists a column constraint that limits the number of tuples to 20.

Table Statistics: Proposed join order is customer ./ orders ./ lineitem Column Statistics: Proposed join order is lineitem ./ orders ./ customer 8 / 19

Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion

Statistics to Improve Query Performance Column Statistics in Apache Hive

Column vs. Table Statistics Example: customer ./ orders ./ σcol=const lineitem table customer orders lineitem

size 150 1,500 6,000

In table lineitem, there exists a column constraint that limits the number of tuples to 20.

Table Statistics: Proposed join order is customer ./ orders ./ lineitem Column Statistics: Proposed join order is lineitem ./ orders ./ customer 9 / 19

Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion

Statistics to Improve Query Performance Column Statistics in Apache Hive

Column Statistics in Apache Hive I Statistics in Apache Hive are only implemented on the table but not column level. Column level statistics to be collected: distinct count of tuples per column number of null values per column minimum value per column maximum value per column highest frequency of a value per column

10 / 19

Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion

Statistics to Improve Query Performance Column Statistics in Apache Hive

Column Statistics in Apache Hive II Cost functions: Cost function ( as used in a relational database system: 0 if T is a leaf relation CCA (T ) = |T | + CCA (T1 ) + CCA (T2 ) if T = T1 ./ T2 Cost function adapted to the MapReduce framework: P CMRA (T ) = in−1 n Ti | = 2 |Ti−1 o Why is an adapted cost function necessary? Classic cost functions only consider costly I/O operations in case of a tablescan MapReduce framework leads to I/O operations between reduce & map phases 11 / 19

Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion

Evaluation I: Experimental Setup Experimental query: orders o ./o.o_custkey =c.c_custkey customer c ./o.o_orderkey =l.l_orderkey (σl_shipdate=0 1995−01−010 lineitem) l ./c.c_nationkey =n.n_nationkey nation n Join Graph:

n

c

o

l

12 / 19

Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion

Evaluation II: Join Order Join Graph:

n

c

o

l

Proposed Join Ordering: CCA :

nation n ./c.c_nationkey =n.n_nationkey customer c ./o.o_custkey =c.c_custkey orders o ./o.o_orderkey =l.l_orderkey (σl_shipdate=0 1995−01−010 lineitem) l

CMRA :

orders o ./o.o_orderkey =l.l_orderkey (σl_shipdate=0 1995−01−010 lineitem) l ./o.o_custkey =c.c_custkey customer c ./c.c_nationkey =n.n_nationkey nation n 13 / 19

Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion

Evaluation III: Join Order Join Graph:

n

c

Value

Highest frequency

primary keys c_nationkey o_custkey l_orderkey l_shipdate

1 6161 41 7 2707

o

l Table

Cardinality

nation customer orders lineitem

25 150,000 1,500,000 6,001,215 2707

I/O operation estimations: n ./ c ./ o:

nc = |n ∗ c| = 25 ∗ 1 ∗ 6161 nco = nc + nc ∗ 1 ∗ 41 = 6.469 ∗ 106

l ./ o ./ c:

lo = |l ∗ o| = 2707 ∗ 1 ∗ 7 loc = lo + lo ∗ 41 ∗ 1 = 7.959 ∗ 105 14 / 19

Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion

Evaluation IV: Experimental Results

average runtime (sec)

1000 800 Original Query CA Query MRA Query

600 400 200

1GB 10GB data set size

Orig. Query Opt. Query (CA) Opt. Query (MRA)

Cost (CA) 10999896000000 416761200000 1099590730556

Cost (MRA) 1502707 1650000 5414

15 / 19

Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion

Conclusion Column Statistics in Hive. . . are useful because they improve query performance. need a special cost function because of the underlying MapReduce framework. Future Work: Implementation of the presented algorithm in Hive Evaluation whether histograms or other statistics could improve query performance even further

16 / 19

Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion

Thanks for your attention! Do you have any questions?

17 / 19

Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion

References Jeffrey Dean and Sanjay Ghemawat MapReduce: Simplified Data Processing on Large Clusters, Operating Systems Design and Implementation, 2004. The Apache Software Foundation Hive, http://hive.apache.org/, version 0.7.0 Ashish Thusoo and Zheng Shao and Suresh Anthony and Dhruba Borthakur and Namit Jain and Joydeep Sen Sarma and Raghotham Murthy and Hao Liu Data Warehousing and Analytics Infrastructure at Facebook, ACM SIGMOD, 2010

18 / 19

Motivation Apache Hive Query Optimization Query Optimization in Hive Evaluation Conclusion

References Ashish Thusoo and Joydeep Sen Sarma and Namit Jain and Zheng Shao and Prasad Chakka and Suresh Anthony and Hao Liu and Pete Wyckoff and Raghotham Murthy Hive A Warehousing Solution Over a MapReduce Framework, VLDB, 2009 Ashish Thusoo and Joydeep Sen Sarma and Namit Jain and Zheng Shao and Prasad Chakka and Suresh Anthony and Hao Liu and Raghotham Murthy Hive - A Petabyte Scale Data Warehouse Using Hadoop, IEEE, 2010 Robert Stewart Performance and Programmability of High Level Data Parallel Processing Languages: Pig, Hive, JAQL & Java-MapReduce, Heriot-Watt University, 2010 19 / 19

Suggest Documents