Build ETL efficiently (10x) with Minimal Logging Simon Cho Blog : Simonsql.com
[email protected]
Agenda Want to discuss first – Quick review
SARG Index access methods Tipping Point Recovery Model
Case 1 What’s the best way to pull this table?
Introduce for Minimal Logging What is minimal logging? How does it work? Condition
Question1 – SARG How is it difference? --Query1 DECLARE @BaseAccountID BIGINT = 1 SELECT * FROM DBO.[Accounts] WHERE AccountID = @BaseAccountID + 1000 GO
--Query2 DECLARE @BaseAccountID BIGINT = 1 SELECT * FROM DBO.[Accounts] WHERE AccountID - 1000 = @BaseAccountID GO
Question1 – SARG How is it difference?
Question2 – SARG How is it difference? --Query1 SELECT AccountID, AccountName FROM DBO.[Accounts] WHERE AccountID = 1000 SELECT AccountID, DCID FROM DBO.[Accounts] WHERE AccountID = '1000' GO --Query2 SELECT AccountID, AccountName FROM DBO.[Accounts] WHERE AccountName = 1000 SELECT AccountID, AccountName FROM DBO.[Accounts] WHERE AccountName = '1000' GO
AccontID : Bigint AccountName : Varchar
Question2 – SARG How is it difference?
SARG - Search Arguments Sargability: Why %string% Is Slow ‘string%’ is it SARG arguments? https://www.brentozar.com/archive/2010/06/sarga ble-why-string-is-slow/
CAST and CONVERT (Transact-SQL) https://msdn.microsoft.com/enus/library/ms187928.aspx
Data Type Precedence (Transact-SQL) https://msdn.microsoft.com/enus/library/ms190309.aspx
One more question --Query1 SELECT * FROM DBO.[Accounts] WHERE Registered_Datetime > '2016-02-11' --Query2 DECLARE @Registered_datetime DATETIME = '2016-02-11' SELECT * FROM DBO.[Accounts] WHERE Registered_Datetime > @Registered_datetime GO
One more question - Cont
Let’s talk about Index access methods There are only 3 methods regardless Clustered or Nonclustered (1) Index Scan Execution Plan shows : Table Scan, Index Scan
Index Seek Execution Plan shows : Index Seek (2) Individual key look up. (3) Ordered Partial Scan
(1) Table Scan/Index Scan SELECT orderid, custid, empid, shipperid, orderdate FROM dbo.Orders;
(2) Index Seek : Individual key lookup SELECT * FROM [Account].[Accounts] where AccountID in (100,200,300)
(3) Index Seek : Ordered Partial Scan
SELECT orderid, custid, empid, shipperid, orderdate FROM dbo.Orders WHERE orderdate = '20060212';
Combination Partial Scan + Index Seek
SELECT orderid, custid, empid, shipperid, orderdate FROM dbo.Orders WHERE orderid BETWEEN 101 AND 120;
Let’s talk about Index access methods What’s the most efficient method for ETL? (1) Index Scan (2) Index Seek - Individual key look up. (3) Index Seek - Ordered Partial Scan
Book: Inside Microsoft® SQL Server™ 2005 T-SQL Querying By Itzik Ben-Gan “Chapter 3” – Must read.
Let’s talk about Index access methods What’s the most efficient method for ETL? (1) Index Scan (2) Index Seek - Individual key look up. (3) Index Seek - Ordered Partial Scan
Book: Inside Microsoft® SQL Server™ 2005 T-SQL Querying By Itzik Ben-Gan “Chapter 3” – Must read.
What’s the problem of Index Seek? TRUNCATE TABLE #tmp CREATE TABLE #tmp (AccountID BIGINT) GO INSERT INTO #tmp (AccountID) SELECT 1 GO 1000
SELECT a.* FROM [dbo].[Accounts] a JOIN #tmp b ON a.accountid = b.accountid
It can read same page multiple times!!
What’s the problem of Index Seek? SELECT a.* FROM dbo.[Accounts] a JOIN #tmp b ON a.accountid = b.accountid SELECT FROM JOIN WHERE
a.* dbo.[Accounts] a dbo.Num b ON n BETWEEN 1 AND 1000 AccountID=1
(1000 row(s) affected) Table 'Accounts'. Scan count 0, logical reads 3071, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table '#tmp______________________________________________________________________________________________________________ __00000000157D'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. (1 row(s) affected)
(1000 row(s) affected) Table 'Num'. Scan count 1, logical reads 6, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Accounts'. Scan count 0, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. (1 row(s) affected)
What’s the problem of Index Seek? SELECT a.* FROM [dbo].[Accounts] a JOIN #tmp b ON a.accountid = b.accountid
SELECT FROM JOIN WHERE
a.* [dbo].[Accounts] a dbo.Num b ON n BETWEEN 1 AND 1000 a.AccountID=1
So, what’s the Best case for Data Extraction? Using Partial scan instead of individual record seek.
What we need for this? Covered + Ordered Index
If it’s not an option, what should we do? Unordered Scan(Table scan) or Individual seek?
Tipping Point That’s we call Tipping Point. Ref) Kimberly – SQL Pass 2010 : Indexing Strategies It's the point where the number of rows returned is "no longer selective enough". SQL Server chooses NOT to use the nonclustered index to look up the corresponding data rows and instead performs a table scan.
What % of data do you guess?
Tipping Point Tipping Point Query #1 Table with 1 million rows over 50,000 pages (20 r/p) 12,500 – 16,666 pages ஃ rows = 1.25-1.66%
Tipping Point Query #2 Table with 1 million rows over 10,000 pages (100 r/p) 2,500 – 3,333 pages ஃ rows = .25-.33%
Tipping Point Query #3 Table with 1 million rows over 100,000 pages (10 r/p) 25,000 – 33,333 pages ஃ rows = 12.5-16.66%
Note : This is SQL 2005 testing. And this is not fragmented index. It depend on a lot of things. System, SQL version, Fragmetation and e.t.c. So, No exact number for determine. In my experience, generally greater than 1% of data, Table Scan is better. http://www.sqlskills.com/blogs/kimberly/the-tipping-point-query-answers/
Tipping Point What’s exactly it’s affected? Directly affected Index Seek vs Scan for data driven. Book mark lookup or Cluster scan.
Indirectly related for ETL pull whole data vs just Delta data. Need to consider write as well. We can utilize minimal logging.
Update bulk records or Create new table. Generally 5-10% data change is the Tipping point for ETL. About 20% data change, rebuild is better most likely.
What is Recovery Mode? Full Log backup required Full logging everything Note : Starting SQL 2008, some statement can be small logged even in full recovery mode.
Simple Automatically reclaims log space. No log backup Log chain is broken – Not able to restore Point-in-time
Fully Support minimal logging operation
Bulk_Logged Log backup required Note : The log size pretty much same as full recovery mode even the minimal logging operation.
Fully Support minimal logging operation Log chain is NOT broken. Purpose : Temporary change recovery mode for Minimal logging STMT in Full logging Database. Note : Unfortunately, Mirrored database can’t changed
Case 1 Source Table(Source_A) 1.5 GB table 20 M records
Target Table(Target_B) 1 Clustered Index + 3 NonClustered Indexes Daily Base
Goal Daily base refresh Target Table. Pull data using Linked Server
Case 1 Which method? • Method 1 – – – –
Drop clustered index Drop all nonclustered indexes Truncate table Target_B Insert Target_B select * from Source_A – Create clustered index – Create all nonclustered indexes
• Method 2 – – – – –
Drop Table Target_B Create table Target_B Create clustered index Create all nonclustered index Insert Target_B select * from Source_A
Method 3 – Simple is the best! Truncate table Target_B Insert Target_B select * from Source_A
• Or I have better one(?)
Case 1 Which method? • Method 4 – – – –
Truncate table Drop all Nonclustered Indexes Drop Clustered Index Insert Target_B with(TabLock) Select * from Source_A – Create Clustered Index – Create all Nonclustered Indexes
• Method 5 – Truncate table – Drop all Nonclustered Indexes. Cluster index remains – Insert Target_B with(TabLock) Select * from Source_A – Create all Nonclustered Indexes
• Method 6 – Drop Table Target_B – Select * into Target_B from Source_A – Create Clustered Index – Create all Nonclustered Indexes
Scenario 1 Summary 1 Million = 80 MB, SSD, Linked Server Recovery
Method
Seconds
Log Count
Log Length(Bytes)
Compare
Simple
1
23
1121768
183817760
80 Times
Simple
2,3
27
4346543
567734748
250 Times
Simple
4,5,6
11
32823
2304100
1
Bulk
1
22
1126958
184106236
80 Times
Bulk
2,3
27
4346495
567733504
250 Times
Bulk
4,5,6
11
37521
2573788
1
Full
1
23
1174056
523121874
230 Times
Full
2,3
29
4372115
569862522
250 Times
Full
4,5,6
11
69618
308086556
130 Times
Note : How to check Log count and Log Length 1. CHECKPOINT or Log backup depends on Recovery mode 2. Execute below statement. SELECT COUNT(*) AS LogRecordCount, SUM([Log Record Length]) FROM sys.fn_dblog(NULL,NULL)
Maybe someone say like this!
Somebody say my test is wrong. SSIS is super fast!!. Do not use linked server! So, I tested it method 3 with SSIS package. Recovery
Metho d
Seconds
Log Count
Log Length(Bytes)
Compare
Simple
3
24
4273104
557659212
250 Times
Bulk
3
24
4315214
563119236
250 Times
Full
3
24
4340768
567257152
250 Times
Nothing Difference!!
Somebody say Duration Doesn’t differ
So, we are ok to use old method. Yes, I’m quite impressed with SSD. So that, I tested with USB external disk. 20 Million, 1.5 GB. Only for slowest(Method3) and fastest(Method4)
Recovery
Method
Simple
3
Simple
4
Secon ds
Log Count
1705 86924396
610
1100271
Log Length(Bytes)
Compare
11354768774
130 Times
84231150
1
Still impressive. Duration doesn’t that much differ. (I expected about 20 times difference) But, I’m pretty sure, if you dealing with bigger size of data, You will see a lot difference. Ex) In my experience, 24 hours big ETL job, only takes 45 Min.
Let’s look at detail number(20 M, 1.5 GB) 1,005,690 writes(Physical write) 1005690 * 8Kb/1024/1024 = 7.9 GB
727303 reads(Physical read)
Just for 1.5 GB Table!!
727303 * 8Kb/1024/1024 = 5.5 GB
1,298,861 CPU_Time 1298861/1000 = 1,298 Sec
554,488,821 Logical_reads(Buffer pool read) 554488821 * 8Kb/1024/1024 = 4230 GB
8 Times more log created on Disk
VS
6 Times CPU, 1.5 times reads, 2 times writes, 250 times Buffer pool read
I know you feel like this
Minimal Logging Operation
Full Logging Operation
Everything logged by each row level.
Minimal Logging Operation
Someone call No-Logging. Technically not true. Do not log every individual row change. Only logging enough rollback information. Only logs extent allocations each time a new extent is allocated to the table.
Minimal Logging Prerequisites http://technet.microsoft.com/en-us/library/ms190422%28v=sql.100%29.aspx
For a database under the full recovery model, all row-insert operations that are performed by bulk import are fully logged in the transaction log. Note : After SQL2008 it’s improved with TabLock
After 2008 and Above in Full recovery mode
It’s logging by Page allocation level.
CDC or Replication - Always fully logged.
What can be minimally logged With Prerequisites Below operation can be minimally logged
Select Into Bulk Import, Bulk Insert, Bcp Create/Alter/Drop Index Insert into …Table with (TabLock) select …
Do you concern with Table Lock? Minimal logging is required Table Lock during inserting data. Table Lock Basically, it’ll blocking other update/insert operation. Read committed can’t read data either. Dirty read(Nolock) is ok to read. So, not a many developer like blocking. * Select * into : Schema Modification lock is occurred. Do not allow dirty read either.
Do you concern with TabLock? This is SSIS Package is default setting. Data Flow – Destination
Prerequisites of Minimal logging
# Table Requirements for Minimally Logging Bulk-Import Operations
The table is not being replicated. Table locking is specified (using TABLOCK) Database Recovery : Bulk-logged or Simple Note: The bulk-logged recovery model is designed to temporarily replace the full recovery model during large bulk operations
No indexes (Heap). Don’t care empty or not : data pages are minimally logged.
Nonclustered indexes only : data pages are minimally logged. If the table is empty, index pages are minimally logged as well If table is non-empty, index pages are fully logged. Ex) If you start with an empty table and bulk import the data in multiple batches , both index and data pages are minimally logged for the first batch , but beginning with the second batch , only data pages are minimally logged.
Clustered Index, No Nonclustered index, and empty table :
both data and index pages are minimally logged. If a table is non-empty, data pages and index pages are both fully logged. EX) If you start with an empty table and bulk import the data in batches , both index and data pages are minimally logged for the first batch , but from the second batch onwards, only data pages are bulk logged.
Clustered Index and NonClustered index : Fully logged. https://technet.microsoft.com/en-us/library/ms190422(v=sql.120).aspx
Minimal Logging Operation Indexes
Rows in table
Hints
Logging
Heap
Any
TABLOCK
Minimal
Heap
Any
None
Full
Heap + Index
Any
TABLOCK
Full
Cluster
Empty
TABLOCK, ORDER (1)
Minimal
Cluster
Empty
None
Full
Cluster
Any
None
Full
Cluster
Any
TABLOCK
Full
Cluster + Index
Any
None
Full
Cluster + Index
Any
TABLOCK
Full
http://technet.microsoft.com/en-us/library/dd425070%28v=sql.100%29.aspx (1) If you are using the INSERT … SELECT method, the ORDER hint does not have to be specified, but the rows must be in the same order as the clustered index. If using BULK INSERT the order hint must be used. (2) Concurrent loads only possible under certain conditions. See “Bulk Loading with the Indexes in Place”. Also, only rows written to newly allocated pages are minimally logged. (3) Depending on the plan chosen by the optimizer, the nonclustered index on the table may either be fully- or minimally logged.
Bonus - Introduce about Trace Flag 610 SQL 2008 and above When the bulk load operation causes a new page to be allocated, all of the rows sequentially filling that new page are minimally logged. https://simonsql.com/2011/12/05/minimallogging-operation-with-traceflag-610/
Minimal Logging Operation-TF610 Table Indexes
Rows in table
Hints
Without TF 610
With TF 610
Concurrent possible
Heap
Any
TABLOCK
Minimal
Minimal
Yes
Heap
Any
None
Full
Full
Yes
Heap + Index
Any
TABLOCK
Full
Depends (3)
No
Cluster
Empty
TABLOCK, ORDER (1)
Minimal
Minimal
No
Cluster
Empty
None
Full
Minimal
Yes (2)
Cluster
Any
None
Full
Minimal
Yes (2)
Cluster
Any
TABLOCK
Full
Minimal
No
Cluster + Index
Any
None
Full
Depends (3)
Yes (2)
Cluster + Index
Any
TABLOCK
Full
Depends (3)
No
https://technet.microsoft.com/en-us/library/dd425070(v=sql.100).aspx
Reference Minimal Logging Reference http://simonsql.com/2011/12/05/minimal-logging-operationwith-traceflag-610/ http://technet.microsoft.com/enus/library/ms191244(v=sql.100).aspx http://msdn.microsoft.com/en-us/library/dd425070.aspx http://www.sqlservercentral.com/articles/Administration/1008 56/
Considerations for Switching from the Full or BulkLogged Recovery Model http://technet.microsoft.com/enus/library/ms190203(v=sql.105).aspx
Q &A Simon Cho Blog : Simonsql.com
[email protected]