Build ETL efficiently (10x) with Minimal Logging

Build ETL efficiently (10x) with Minimal Logging Simon Cho Blog : Simonsql.com [email protected] Agenda  Want to discuss first – Quick review   ...
Author: Barnard Craig
0 downloads 3 Views 1MB Size
Build ETL efficiently (10x) with Minimal Logging Simon Cho Blog : Simonsql.com [email protected]

Agenda  Want to discuss first – Quick review    

SARG Index access methods Tipping Point Recovery Model

 Case 1  What’s the best way to pull this table?

 Introduce for Minimal Logging  What is minimal logging?  How does it work?  Condition

Question1 – SARG How is it difference? --Query1 DECLARE @BaseAccountID BIGINT = 1 SELECT * FROM DBO.[Accounts] WHERE AccountID = @BaseAccountID + 1000 GO

--Query2 DECLARE @BaseAccountID BIGINT = 1 SELECT * FROM DBO.[Accounts] WHERE AccountID - 1000 = @BaseAccountID GO

Question1 – SARG How is it difference?

Question2 – SARG How is it difference? --Query1 SELECT AccountID, AccountName FROM DBO.[Accounts] WHERE AccountID = 1000 SELECT AccountID, DCID FROM DBO.[Accounts] WHERE AccountID = '1000' GO --Query2 SELECT AccountID, AccountName FROM DBO.[Accounts] WHERE AccountName = 1000 SELECT AccountID, AccountName FROM DBO.[Accounts] WHERE AccountName = '1000' GO

AccontID : Bigint AccountName : Varchar

Question2 – SARG How is it difference?

SARG - Search Arguments  Sargability: Why %string% Is Slow  ‘string%’ is it SARG arguments? https://www.brentozar.com/archive/2010/06/sarga ble-why-string-is-slow/

 CAST and CONVERT (Transact-SQL)  https://msdn.microsoft.com/enus/library/ms187928.aspx

 Data Type Precedence (Transact-SQL)  https://msdn.microsoft.com/enus/library/ms190309.aspx

One more question --Query1 SELECT * FROM DBO.[Accounts] WHERE Registered_Datetime > '2016-02-11' --Query2 DECLARE @Registered_datetime DATETIME = '2016-02-11' SELECT * FROM DBO.[Accounts] WHERE Registered_Datetime > @Registered_datetime GO

One more question - Cont

Let’s talk about Index access methods  There are only 3 methods regardless Clustered or Nonclustered  (1) Index Scan  Execution Plan shows : Table Scan, Index Scan

 Index Seek  Execution Plan shows : Index Seek  (2) Individual key look up.  (3) Ordered Partial Scan

(1) Table Scan/Index Scan  SELECT orderid, custid, empid, shipperid, orderdate FROM dbo.Orders;

(2) Index Seek : Individual key lookup SELECT * FROM [Account].[Accounts] where AccountID in (100,200,300)

(3) Index Seek : Ordered Partial Scan 

SELECT orderid, custid, empid, shipperid, orderdate FROM dbo.Orders WHERE orderdate = '20060212';

Combination Partial Scan + Index Seek 

SELECT orderid, custid, empid, shipperid, orderdate FROM dbo.Orders WHERE orderid BETWEEN 101 AND 120;

Let’s talk about Index access methods  What’s the most efficient method for ETL?  (1) Index Scan  (2) Index Seek - Individual key look up.  (3) Index Seek - Ordered Partial Scan

 Book: Inside Microsoft® SQL Server™ 2005 T-SQL Querying By Itzik Ben-Gan “Chapter 3” – Must read.

Let’s talk about Index access methods  What’s the most efficient method for ETL?  (1) Index Scan  (2) Index Seek - Individual key look up.  (3) Index Seek - Ordered Partial Scan

 Book: Inside Microsoft® SQL Server™ 2005 T-SQL Querying By Itzik Ben-Gan “Chapter 3” – Must read.

What’s the problem of Index Seek? TRUNCATE TABLE #tmp CREATE TABLE #tmp (AccountID BIGINT) GO INSERT INTO #tmp (AccountID) SELECT 1 GO 1000

SELECT a.* FROM [dbo].[Accounts] a JOIN #tmp b ON a.accountid = b.accountid

It can read same page multiple times!!

What’s the problem of Index Seek? SELECT a.* FROM dbo.[Accounts] a JOIN #tmp b ON a.accountid = b.accountid SELECT FROM JOIN WHERE

a.* dbo.[Accounts] a dbo.Num b ON n BETWEEN 1 AND 1000 AccountID=1

(1000 row(s) affected) Table 'Accounts'. Scan count 0, logical reads 3071, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table '#tmp______________________________________________________________________________________________________________ __00000000157D'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. (1 row(s) affected)

(1000 row(s) affected) Table 'Num'. Scan count 1, logical reads 6, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Accounts'. Scan count 0, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. (1 row(s) affected)

What’s the problem of Index Seek? SELECT a.* FROM [dbo].[Accounts] a JOIN #tmp b ON a.accountid = b.accountid

SELECT FROM JOIN WHERE

a.* [dbo].[Accounts] a dbo.Num b ON n BETWEEN 1 AND 1000 a.AccountID=1

So, what’s the Best case for Data Extraction?  Using Partial scan instead of individual record seek.

 What we need for this?  Covered + Ordered Index

 If it’s not an option, what should we do?  Unordered Scan(Table scan) or Individual seek?

Tipping Point  That’s we call Tipping Point.  Ref) Kimberly – SQL Pass 2010 : Indexing Strategies  It's the point where the number of rows returned is "no longer selective enough". SQL Server chooses NOT to use the nonclustered index to look up the corresponding data rows and instead performs a table scan.

 What % of data do you guess?

Tipping Point  Tipping Point Query #1  Table with 1 million rows over 50,000 pages (20 r/p)  12,500 – 16,666 pages ஃ rows = 1.25-1.66%

 Tipping Point Query #2  Table with 1 million rows over 10,000 pages (100 r/p)  2,500 – 3,333 pages ஃ rows = .25-.33%

 Tipping Point Query #3  Table with 1 million rows over 100,000 pages (10 r/p)  25,000 – 33,333 pages ஃ rows = 12.5-16.66%

 Note : This is SQL 2005 testing. And this is not fragmented index.  It depend on a lot of things. System, SQL version, Fragmetation and e.t.c.  So, No exact number for determine.  In my experience, generally greater than 1% of data, Table Scan is better. http://www.sqlskills.com/blogs/kimberly/the-tipping-point-query-answers/

Tipping Point  What’s exactly it’s affected?  Directly affected  Index Seek vs Scan for data driven.  Book mark lookup or Cluster scan.

 Indirectly related for ETL  pull whole data vs just Delta data.  Need to consider write as well. We can utilize minimal logging.

 Update bulk records or Create new table.  Generally 5-10% data change is the Tipping point for ETL.  About 20% data change, rebuild is better most likely.

What is Recovery Mode?  Full  Log backup required  Full logging everything  Note : Starting SQL 2008, some statement can be small logged even in full recovery mode.

 Simple  Automatically reclaims log space.  No log backup  Log chain is broken – Not able to restore Point-in-time

 Fully Support minimal logging operation

 Bulk_Logged  Log backup required  Note : The log size pretty much same as full recovery mode even the minimal logging operation.

 Fully Support minimal logging operation  Log chain is NOT broken.  Purpose : Temporary change recovery mode for Minimal logging STMT in Full logging Database.  Note : Unfortunately, Mirrored database can’t changed

Case 1  Source Table(Source_A)  1.5 GB table 20 M records

 Target Table(Target_B)  1 Clustered Index + 3 NonClustered Indexes  Daily Base

 Goal  Daily base refresh Target Table.  Pull data using Linked Server

Case 1 Which method? • Method 1 – – – –

Drop clustered index Drop all nonclustered indexes Truncate table Target_B Insert Target_B select * from Source_A – Create clustered index – Create all nonclustered indexes

• Method 2 – – – – –

Drop Table Target_B Create table Target_B Create clustered index Create all nonclustered index Insert Target_B select * from Source_A

 Method 3 – Simple is the best!  Truncate table Target_B  Insert Target_B select * from Source_A

• Or I have better one(?)

Case 1 Which method? • Method 4 – – – –

Truncate table Drop all Nonclustered Indexes Drop Clustered Index Insert Target_B with(TabLock) Select * from Source_A – Create Clustered Index – Create all Nonclustered Indexes

• Method 5 – Truncate table – Drop all Nonclustered Indexes. Cluster index remains – Insert Target_B with(TabLock) Select * from Source_A – Create all Nonclustered Indexes

• Method 6 – Drop Table Target_B – Select * into Target_B from Source_A – Create Clustered Index – Create all Nonclustered Indexes

Scenario 1 Summary 1 Million = 80 MB, SSD, Linked Server Recovery

Method

Seconds

Log Count

Log Length(Bytes)

Compare

Simple

1

23

1121768

183817760

80 Times

Simple

2,3

27

4346543

567734748

250 Times

Simple

4,5,6

11

32823

2304100

1

Bulk

1

22

1126958

184106236

80 Times

Bulk

2,3

27

4346495

567733504

250 Times

Bulk

4,5,6

11

37521

2573788

1

Full

1

23

1174056

523121874

230 Times

Full

2,3

29

4372115

569862522

250 Times

Full

4,5,6

11

69618

308086556

130 Times

Note : How to check Log count and Log Length 1. CHECKPOINT or Log backup depends on Recovery mode 2. Execute below statement. SELECT COUNT(*) AS LogRecordCount, SUM([Log Record Length]) FROM sys.fn_dblog(NULL,NULL)

Maybe someone say like this!

Somebody say my test is wrong.  SSIS is super fast!!. Do not use linked server!  So, I tested it method 3 with SSIS package. Recovery

Metho d

Seconds

Log Count

Log Length(Bytes)

Compare

Simple

3

24

4273104

557659212

250 Times

Bulk

3

24

4315214

563119236

250 Times

Full

3

24

4340768

567257152

250 Times

Nothing Difference!!

Somebody say Duration Doesn’t differ     

So, we are ok to use old method. Yes, I’m quite impressed with SSD. So that, I tested with USB external disk. 20 Million, 1.5 GB. Only for slowest(Method3) and fastest(Method4)

Recovery

Method

Simple

3

Simple

4

Secon ds

Log Count

1705 86924396

610

1100271

Log Length(Bytes)

Compare

11354768774

130 Times

84231150

1

Still impressive. Duration doesn’t that much differ. (I expected about 20 times difference) But, I’m pretty sure, if you dealing with bigger size of data, You will see a lot difference. Ex) In my experience, 24 hours big ETL job, only takes 45 Min.

Let’s look at detail number(20 M, 1.5 GB)  1,005,690 writes(Physical write)  1005690 * 8Kb/1024/1024 = 7.9 GB

 727303 reads(Physical read)

Just for 1.5 GB Table!!

 727303 * 8Kb/1024/1024 = 5.5 GB

 1,298,861 CPU_Time  1298861/1000 = 1,298 Sec

 554,488,821 Logical_reads(Buffer pool read)  554488821 * 8Kb/1024/1024 = 4230 GB

8 Times more log created on Disk

VS

6 Times CPU, 1.5 times reads, 2 times writes, 250 times Buffer pool read

I know you feel like this

Minimal Logging Operation 

Full Logging Operation 



Everything logged by each row level.

Minimal Logging Operation    

Someone call No-Logging. Technically not true. Do not log every individual row change. Only logging enough rollback information. Only logs extent allocations each time a new extent is allocated to the table.



Minimal Logging Prerequisites http://technet.microsoft.com/en-us/library/ms190422%28v=sql.100%29.aspx



For a database under the full recovery model, all row-insert operations that are performed by bulk import are fully logged in the transaction log. Note : After SQL2008 it’s improved with TabLock



After 2008 and Above in Full recovery mode 



It’s logging by Page allocation level.

CDC or Replication - Always fully logged.

What can be minimally logged  With Prerequisites Below operation can be minimally logged    

Select Into Bulk Import, Bulk Insert, Bcp Create/Alter/Drop Index Insert into …Table with (TabLock) select …

Do you concern with Table Lock?  Minimal logging is required Table Lock during inserting data.  Table Lock  Basically, it’ll blocking other update/insert operation.  Read committed can’t read data either.  Dirty read(Nolock) is ok to read.  So, not a many developer like blocking. * Select * into : Schema Modification lock is occurred. Do not allow dirty read either.

Do you concern with TabLock?  This is SSIS Package is default setting.  Data Flow – Destination

Prerequisites of Minimal logging 

# Table Requirements for Minimally Logging Bulk-Import Operations   

The table is not being replicated. Table locking is specified (using TABLOCK) Database Recovery : Bulk-logged or Simple Note: The bulk-logged recovery model is designed to temporarily replace the full recovery model during large bulk operations



No indexes (Heap). Don’t care empty or not : data pages are minimally logged.



Nonclustered indexes only : data pages are minimally logged.  If the table is empty, index pages are minimally logged as well  If table is non-empty, index pages are fully logged. Ex) If you start with an empty table and bulk import the data in multiple batches , both index and data pages are minimally logged for the first batch , but beginning with the second batch , only data pages are minimally logged.



Clustered Index, No Nonclustered index, and empty table :  

 

both data and index pages are minimally logged. If a table is non-empty, data pages and index pages are both fully logged. EX) If you start with an empty table and bulk import the data in batches , both index and data pages are minimally logged for the first batch , but from the second batch onwards, only data pages are bulk logged.

Clustered Index and NonClustered index : Fully logged. https://technet.microsoft.com/en-us/library/ms190422(v=sql.120).aspx

Minimal Logging Operation Indexes

Rows in table

Hints

Logging

Heap

Any

TABLOCK

Minimal

Heap

Any

None

Full

Heap + Index

Any

TABLOCK

Full

Cluster

Empty

TABLOCK, ORDER (1)

Minimal

Cluster

Empty

None

Full

Cluster

Any

None

Full

Cluster

Any

TABLOCK

Full

Cluster + Index

Any

None

Full

Cluster + Index

Any

TABLOCK

Full

http://technet.microsoft.com/en-us/library/dd425070%28v=sql.100%29.aspx (1) If you are using the INSERT … SELECT method, the ORDER hint does not have to be specified, but the rows must be in the same order as the clustered index. If using BULK INSERT the order hint must be used. (2) Concurrent loads only possible under certain conditions. See “Bulk Loading with the Indexes in Place”. Also, only rows written to newly allocated pages are minimally logged. (3) Depending on the plan chosen by the optimizer, the nonclustered index on the table may either be fully- or minimally logged.

Bonus - Introduce about Trace Flag 610  SQL 2008 and above  When the bulk load operation causes a new page to be allocated, all of the rows sequentially filling that new page are minimally logged.  https://simonsql.com/2011/12/05/minimallogging-operation-with-traceflag-610/

Minimal Logging Operation-TF610 Table Indexes

Rows in table

Hints

Without TF 610

With TF 610

Concurrent possible

Heap

Any

TABLOCK

Minimal

Minimal

Yes

Heap

Any

None

Full

Full

Yes

Heap + Index

Any

TABLOCK

Full

Depends (3)

No

Cluster

Empty

TABLOCK, ORDER (1)

Minimal

Minimal

No

Cluster

Empty

None

Full

Minimal

Yes (2)

Cluster

Any

None

Full

Minimal

Yes (2)

Cluster

Any

TABLOCK

Full

Minimal

No

Cluster + Index

Any

None

Full

Depends (3)

Yes (2)

Cluster + Index

Any

TABLOCK

Full

Depends (3)

No

https://technet.microsoft.com/en-us/library/dd425070(v=sql.100).aspx

Reference  Minimal Logging Reference  http://simonsql.com/2011/12/05/minimal-logging-operationwith-traceflag-610/  http://technet.microsoft.com/enus/library/ms191244(v=sql.100).aspx  http://msdn.microsoft.com/en-us/library/dd425070.aspx  http://www.sqlservercentral.com/articles/Administration/1008 56/

 Considerations for Switching from the Full or BulkLogged Recovery Model  http://technet.microsoft.com/enus/library/ms190203(v=sql.105).aspx

Q &A Simon Cho Blog : Simonsql.com [email protected]