Azure-consistent Object Storage in Microsoft Azure Stack Ali Turkoglu Principal Software Engineering Manager Microsoft
Mallikarjun Chadalapaka Principal Program Manager Microsoft
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Agenda
Context, Solution Stack, Architecture, ARM & SRP
ACS Architecture Deep Dive
Blob Service Architecture & Design
Questions/Discussion
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Azure-consistent storage
Cloud storage for Azure Stack
Azure-consistent blobs, tables, queues, and storage accounts
Administrator manageability
Enterprise-private clouds or hosted clouds from service providers
IaaS (page blobs) + PaaS (block blobs, append blobs, tables, queues)
Builds on & enhances WS2016 Software-Defined Storage (SDS) platform capabilities
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Azure-consistent storage: Solution view Application clients using Azure Account, Blob ,Table, Queue APIs, Microsoft Azure Storage Explorer & Tooling
Microsoft Azure Stack Portal, Azure Storage cmdlets, ACS cmdlets, Azure CLI, Client SDK
Administrator
Tenant-facing storage cloud services
Virtualized services Resource Provider Cluster
Data services Cluster
Infrastructure services
Blob backend
Scale-Out File Server (SOFS) with Storage Spaces Direct (S2D)
Blob backend
.....
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Clustering Architecture
Azure Service Fabric (ASF) guest clusters for cloud services
Hyper-converged Windows Server Failover Cluster (WSFC) Host fabric
WSFC enhances ASF cluster resiliency
HA VMs via Hyper-V host clustering Anti-affinity policies on VMs to ensure all VMs never failover to same host Application health monitoring on Service Fabric service for timely detection of service hang situations
WSFC & CSVFS* provide the basis for blob service HA model
Administrator
Tenant-facing storage cloud services
Resource Provider Cluster
Data services Cluster
Blob backend
Scale-Out File Server (SOFS) with Storage Spaces Direct (S2D)
Blob backend
.....
*Cluster Shared Volume File System
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Relating Azure Storage Concepts
Block Blob
Subscription
Resource Group
Container
Storage Account
Append Blob Page Blob
Table Queue
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
ARM & Resource Providers
Azure Resource Manager (ARM) in Azure Stack Azure-consistent management Clients use REST, PS, or Portal
Resource Provider (RP) manages a type of infra Plug-in to ARM Compute (CRP) Network (NRP) Storage (SRP) ….
Users express desired state via templates Template = declarative statement ARM necessary orchestration + imperative directives to RPs
7
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Azure-consistent storage Management Model
Microsoft Azure Stack Portal,Azure Storage cmdlets,Azure CLI, and Client SDK
Microsoft Azure Stack Portal, and ACS cmdlets
Azure Resource Manager
Tenant Resources
Storage Resource Provider
Admin Resources
ACS data path services
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
CRP
IaaS VM Storage
All VM storage in Azure Stack resides in blob store
Every OS or Data Disk is a page blob Page blob ReFS file Starts in REST API access mode
CRP and SRP collaborate behind the scenes Page blob toggles to ‘SMB-only’ at VM run time Super-optimized Hyper-V-over-SMB I/O path
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
ACS Architecture Deep Dive
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Key Requirements and Challenges for Object Storage on MAS Atomicity guarantees
Data Consistency
guarantees
Immutable blocks
Snapshot isolated copy for reads
512 byte page alignment for page blobs
Distributed List Blob (enumeration)
Durability: Synchronous 3-copy replication.
Scalable to millions of objects
99.9% High availability read/write
Fault tolerance & Crash consistency
No performance regression relative to Hyper-V over SMB
Adapt to smaller cloud scale
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
ACS Architecture
PC
Table Master
RPC
Table Server
Physical Services SSU Node 1
SOFS
WAC DB
SSU Node 2
Blob Service
BLOB DB
Chunk Store File
SOFS
Blob Service
Page Blob File
BLOB DB
Chunk Store File
Page Blob File
TM DB
TS DB
SMB
Account/Container Service
RPC
RPC
RPC
SMB
WFE FE (Front End)
B
SRP
SM
H
SRP
Virtual Services
HTTP
TTP
B/ R
Load Balancer
SM
SRP (Storage Resource Provider): Integrates with ARM and exposes resource management REST interfaces for storage service overall. FE: Provides REST interface Front End, consistent with Azure. WAC: Storage account management, user requests authentication, and container operations. Blob Service: Implements the blob service backend. Stores block and page blob data in file system/Chunk Store, and metadata in ESENT DB. Table Master: Maps user table to database/TS instances. Table Server: Handles table query & transactions in databases. Storage: SOFS exposes a unified view of all tiered storage to compute nodes as CA shares. Provides fault tolerance & local replication.
SMB
TS DB
CSV (ReFS) Storage Spaces & Pools Shared or S2D DAS Disks
HA Clustering
ACS Component
*Key Interactions between the components are shown
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Blobs - Semantic Requirements: See MSDN BLOCK BLOBS Client uploads individual immutable blocks with PUT-BLOCK for future inclusion in a block blob. Block size may be up to 4 MB. A blob can have up to 100,000 uncommitted blocks. Maximum size of uncommitted block list is 400 GB. Followed by a PUT-BLOCK-LIST call to assemble the blob. Maximum size supported for Block Blob is 200 GB and 50,000 committed blocks. Blocks must retain their identity to permit later PUT-BLOCK-LIST calls to re-arrange. Unused blocks are lazily cleaned up after a PUT-BLOCK-LIST request. In the absence of PUT-BLOCK-LIST, uncommitted blocks are garbage collected after 7 days. Blob names are case-sensitive. At least one character long, and at most 1024 characters. All Blob operations guarantee atomicity where it either happened as a whole, or it has not happened at all. There is no undetermined state at failure. For Block Blobs each GET-BLOB request gets a snapshot isolated copy of the Blob data (or request fails if this cannot be accommodated.
PAGE BLOBS Client creates a new empty page blob by calling PUT-BLOB. A page Blob starts as sparse and its size can be up to 1 TB. Random Read/Write access Client than calls PUT-PAGE to add content to the Page Blob. PUT-PAGE operation writes a range of pages to a Page Blob. Put-page operation must guarantee atomicity. Calling Put Page with the Update option performs an in-place write on the specified page blob. Any content in the specified page is overwritten with the update. Calling Put Page with the Clear option releases the storage space used by the specified page. Pages that have been cleared are no longer tracked as part of the page blob. Each range of pages submitted with Put Page for an update operation may be up to 4 MB in size. The start and end range of the page must be aligned with 512-byte boundaries
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Azure Blobs Object API : See MSDN for details
Common Blob Operations Put Blob, Get Blob, Get/Set Blob Properties, Get/Set Blob Metadata, Lease Blob, Snapshot Blob, Copy Blob, Abort Copy Blob, Delete Blob
Operations on Block Blobs Put Block, Put Block List, Get Block List
Operations on Append Blobs Append Block
Operations on Page Blobs Put Page, Get Page Ranges
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Block Blob Object Design Challenges on Traditional File System
Why not implement Block Blobs as file objects?
Isolation/atomicity and unique composition requirements are the key offenders. Lack of any form of acceptable atomicity support on NTFS. Rename/Create path on file system is prohibitively expensive API semantics does not map to files, immutable vs. random access Enumeration & Rich Metadata operations requires Index and DB, and transactions. Namespace should be in database, but not in file system, to meet scale demands, and other requirements Kernel mode filter driver complexity. Need for Index & Transaction support
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
ACS Service Design Principles
Keep it simple
Achievable limit in V1.
More than API consistency Build on available technologies and not reinvent the wheel
Depend on SOFS (with ReFS, Storage Spaces Direct), CSVFS Use ESE/NT Leverage Dedup Chunk Store ServiceFabric/WSFC for high availability, scaling out and load balancing.
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Key Design Decisions for Blob Service
ReFS for Page Blobs, Snapshots/Extend Cloning, 4 MB atomic write
Page Blob and Block Blob share the same service. Share the DB/Metadata Block and page blobs are under same container
Page Blobs as files Highly optimized data path for Hyper-V over SMB
Block Blobs as ChunkStore Objects Block Blobs are not files, but immutable chunks
RPC as data transport from WFE to Blob service WFE cannot write directly to ChunkStore
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Block Blob Service Design
Global namespace exists in the DB. User mode only access/store. No file system access to the block blobs; neither for namespace nor for data. Use ChunkStore implementation to store committed and uncommitted blocks, and the stream maps. Design for Azure Parity from get go. Azure blob metadata is stored in ESE/NT DB. To optimize metadata only operations To implement Blob API GC semantics. DB Scope is set of containers and their blobs metadata.
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Why ChunkStore? Immutable file containers for chunk and stream maps. Append Only Chunk Insertion design Part of Deduplication Feature, proven. Integrity checks for detecting page corruptions, and read from a replica for in place patching. Shallow and Deep Garbage Collection. Data chunks shared by root blob and snapshots (deduplication) Guarantee crash-consistent file commit (Precise order of operations guarantees data integrity at each stage) Various chunk size support (up to 4MB) Efficient self contained chunk id referencing
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Why ESE/NT?
ESE
Extensible Storage Engine (ESE), also known as JET Blue, is an ISAM (Indexed Sequential Access Method) data storage technology. It provides transacted data update and retrieval
Transactions & Index The ESE database engine provides Atomic Consistent Isolated Durable (ACID) transactions.
Logging and crash recovery ESE has write ahead logging and snapshot isolation for guaranteed crash recovery. The application-defined data consistency is honored even in the event of a system crash.
Good Backup/Restore support ESE supports on-line backup where one or more databases are copied, along with log files in a manner that does not affect database operations
Page corruption detection and read from replica and patch in place (FSCTL_MARK_HANDLE / MARK_HANDLE_READ_COPY) Automatic DB scan and corruption detection ESENT engine self-detecting and auto-correcting checksum errors on a Jet database stored on a Spaces triple-replica remotely accessed via SMB/CSVFS
Used at scale in exchange workloads.
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Page Blob Service Design
Global namespace exists in the DB. Page blobs are files stored in ReFS volume. Blob is a linear mapping to the file. There is no stream map. Page blobs also use DB to store azure blob metadata. To enable exclusive inplace access (direct via SMB) for Hyper-V, check-out, and check-in semantics supplied. No concurrent REST vs. File System access.
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Page Blobs - Compute vs. REST data access path
Blob REST access is via the Blob service.
Compute page blob access (Hyper-V accessing VHDs) is direct to the file over SMB, same path as today with RDMA etc. Once a blob is “checked out” for compute access, it is not accessible through REST.
Hyper-V Compute
WFE CSU
SSU
SMB Compute In-Place Page Blob VHD Access
BLOB Service
Blob Files on ReFS
SSU Node
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
SOFS
Blob Service Design Data Flow & Representation
25
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
PAGE Blob Design on ReFS (1) HTTP Put Page Z to PageBlobC
Single Log File per Volume. Append-only writes to staging log with control ReFS duplicate extents Final Commit call updates metadata Crash Consistency via staging log write, ReFS extent duplication. Check-pointing to manage space consumption/ and recovery.
Blob Service Frontend
Front End
(2) Put Page Z using RPC to Blob Service hosting PageBlobC
Backend Blob Service Backend
(3) Lookup Filename for PageBlobC
Blob Name
Filename
Metadata
PageBlobA
FilePageBlobA
MetadataA
PageBlobB
FilePageBlobB
MetadataB
PageBlobC
FilePageBlobC
MetadataC
(6) Update Metadata for PageBlobC ...
Header Page X of PageBlobA
(4) Build in-memory buffer with alignment data from FilePageBlobC and unaligned RPC data then append buffer to shared staging file
Blobs Table in Metadata Store
Page Y of PageBlobB Page X
Page X+1 of PageBlobA Page Z of PageBlobC
Unset (5) Duplicate extent for Page Z from shared staging file to FilePageBlobC
Unused
Page Y Page Z
Unset
Shared Staging File FilePageBlobC
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
BLOCK BLOB – PUT BLOCK
No change in composition of the blob
Blobs Table
(1) Put Block X for Blob-A
Concurrent put block operations are allowed. No Block Sharing across different Blobs Uncommitted blocks LV Column in DB Azure Block Id (64 bytes) to Chunk Id mapping Efficient search uncommitted at PutBlock-List (or at recovery) For GC policy required by the Blob API.
WFE
Blob-A
Metadata Stream-Id
Committed Blocks
UnCommitted Blocks
(2) Put Block (Block Id + Data)
Blob Service
(4) Insert Blob Entry (if needed)
(5) Insert BlockId, Chunk-Id entry
Block X Id, ... Chunk-Id Uncommitted Blocks Table
(3) Insert Chunk,
Chunk X
Get Chunk-Id
Stream Map Container
Block Data Container
ChunkStore
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
BLOCK BLOB – PUT BLOCK LIST
Put Block List modifies metadata/composition
Blobs Table
(1) Put Block List { X , Y } for Blob-A
Allocates/Inserts a stream map to ChunkStore.
Snapshots clones and inserts new stream maps.
(7) Update Stream-Id and Meatada
(2) Put Block List
Blob-A
Metadata Stream-Id
(6) Create Committed Blocks Table
(List + Metadata)
Blob entry has the "current" stream map ID. Committed Block List LV (long value) column
WFE
Committed Blocks
Block X Id, Chunk-Id ...
Block Y Id, Chunk-Id
Blob Service
Committed Blocks Table
UnCommitted Blocks
Block X Id, Chunk-Id Block...Y Id, Chunk-Id Block Z Id, Chunk-Id
Uncommitted Blocks Table
(3) Lookup Chunks (5) Move remaining Chunks for GC
(4) Add new Stream Map
Chunk X Stream Map 1
Stream Map Container
ChunkStore
Chunk Y
Block Data Container
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Blob Service Crash Consistency
29
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Example Crash Consistency Approach State
Crash just Normal action before normal action
1. The state consists of a valid metadata blob record that might have a pre-existing blob record or not (M0), valid stream map state that might have a stream for the existing blob or no stream in case the blob doesn’t exist (S0) and a valid GC metadata table that has no records for related to this blob (GC0).
We remain in this state.
2. The state consists of a valid metadata blob record (M0) that is not yet updated with the valid inserted data chunk ID (D1).
On demand full GC. For more details about the full GC, please refer to the full GC crash consistency section.
3. The state consists of in memory metadata update for the blob record (M1) that is updated with the valid data chunk id (D1).
4. The state consists of a valid metadata update for the blob record (M1) that is updated with the valid inserted data chunk id (D1).
Coverag e ID
If the operation conditions 1 succeed, then insert the new data chunk into the chunk store. This takes us to state 2. Otherwise, if the operation conditions fails, then remain in this state
1
2 S0
D0
M0 GC0
S0
Insert Data Chunk
D1
GCing chunks/stream
No
On demand full GC. For more details about the full GC, please refer to the full GC crash consistency section.
Commit the DB transaction. This takes us to state 4.
Will remain in this state.
This is a successful terminal state.
If blob exists: 2, 301 Else 2, 100
3 Full GC
Insert stream map for Overwritten block 5
S0
Orphaned chunk
Yes
Overwrite?
Update blob with block ID
If there is no uncommitted block with the same block id, then begin transaction to update the metadata store with the newly inserted data chunk id. This takes us to state 3. Otherwise, if there is an uncommitted block with the same block id, then create a stream map for it. This takes us to state 5.
M0 GC0
D1
M1 GC0
S1
6
4 S0
101, 102
Data Chunks state
S
Stream Maps state
M
Blob record state
GC
GC record state
Red Blue
D1
S1
M1 GC0
D1
M1 GC1
Commit Transaction 7 S1
D1
M1 GC1
In memory entity On disk entity Modified entity
103
M0 GC0
Update blob with block ID Update GC with stream map
Commit Transaction
D
D1
Crash
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Periodic GC
Full GC
Blob Service HA Model
31
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Blob Service HA model on WSFC & CSVFS
Multiple instances, one on each physical machine. Blobs are stored in a cluster file system (CSVFS). Blob namespace is partitioned among the blob service instances. Each Blob Service maintains a partition mapping table. CSV volumes can move between nodes to remain highly available. A Blob client maintains a mapping of the partition ID to the node name on which the partition is hosted. The mapping can change due to node failover and CSV failover. 2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Table Service
33
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Azure Table Semantics See MSDN for details
Data Model: Each row (entity) contains up to 1MB of schema-less data. Each entity contains up to 252 properties including 3 system properties PartitionKey + RowKey form the primary key for data query in Tables. Support filter, LINQ queries and pagination for retrieving table entities. PartitionKey
RowKey
TimeStamp
Status
…
A
Alice
May 29, 2016
“Online”
…
B
Brian
June 29, 2016
“Offlinee”
…
Atomic: Support both entity-level transactions as well as Group Entity Transaction in one partition with maximum 100 entities. Consistent: After a successful write, any reads that happen after the write get the latest value. Durable: Synchronously store 3 replications before reporting success. Scalable: Millions of entities in a single table. Millions of tables in the whole system. Highly Available: 99.9% read/write for local replication
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Azure Table API : See MSDN for details
Common Table Operations Query Tables, Create Table, Delete Table, Get Table ACL, Set Table ACL
Operations on Table Entities Query Entities, Insert Entity, Insert Or Merge Entity, Insert Or Replace Entity, Update Entity, Merge Entity, Delete Entity
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Table Backend Table master creates and keeps table ranged partition to Table Server instance mapping in Azure Service Fabric reliable collection, which is replicated across the cluster.
Service Fabric Cluster WFE
WFE
HA VM 6
HA VM 7
Mapping also cached in FE to expedite lookup.
Query TS Mapping
One Table Server instance serves one DB including multiple user tables
Table Master
Table command & Response Get partition assignment
Create Table
TableName = “CCADB32B-3955-41B8-B28B-062EE8791EE9"
Partitio nKey
RowKe y
TimeStamp
PropertyBag
Table Server Instance 8
C&E
100000
2016/08/29
Project = “ACS” …
Table Server Instance 5
Table Server Instance 2
Table Server Instance 4
Table Server Instance 3
Table Server Instance 3
Table Server Instance 1
Multiple TS instances share one process TS instances acquire assigned database/ partition ranges from table master upon start
HA VM 1
Table Server Instance 7
Table Server Instance 6
HA VM 2 Failov er or LB
HA VM 3
WAC
Store Table list & properties
HA VM 5
HA VM 4
Storage Layer, SoFS
T1
T2
T3
T4
T5
T6
T7
T8
WAC
Use Service Fabric to achieve High Availability, Scale-out, and Load Balancing
36
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Questions?
37
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Appendix
38
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Differences against Azure Storage
No Standard_RAGRS or Standard_GRS
Premium_LRS: no performance limits or guarantees
No Azure Files yet
Usage meter differences No IaaS transactions in Storage Transactions No internal vs. external traffic distinction in Data Transfer
No account type change, or custom domains
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Blob Service Metadata Persistence Model
Stream Maps being evaluated to move to DB Single table with mixed row/blob types
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
ACS HA Model
Service Fabric is used by WFE, TM/TS, WAC, SRP instances for failover, load balance, deployment/upgrade .
CSU
CSU
Service Fabric Cluster + VM Monitoring HA VM
All ACS service roles are co-locatable.
Service Fabric Cluster + VM Monitoring HA VM
HA VM
HA VM
WFE
WFE
WFE
SRP WAC
SRP collocates with other RPs in the same cluster.
SRP HA VM
Compute cluster VM health monitoring complements in-guest Service Fabric by providing VM-level recovery. Blob service is running as an HA service on the SSU cluster directly on physical nodes, as a peer of SMB. It monitors CSV movements in the cluster and attach/detach to them.
TM
HA VM
SRP
TS
TS
TS
Metric
Health
Metric
SSU
Failover Cluster SSU Node
SSU Node
SSU Node
SSU Node
Blob Service
Blob Service
Blob Service
Blob Service
Physical Scale Unit
Virtual
Physical
Active
Blob service does not failover. It is multiple active configuration.
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Create Table Data Flow Service Fabric Cluster
1. WFE authenticates with Account service (WAC) and sends Table Creation request to WAC
WFE
WFE
HA VM 6
HA VM 7
1
2. WAC adds a new table entry in WAC metadata DB
WAC
3. WAC queries Table Master for corresponding Table Server Instance 4. WAC requests Table Server to create the new table 5. TS instance creates the user table in the DB
4
Create Table
Table Server Instance 8
Table Server Instance 7
Table Server Instance 6
Table Server Instance 5
Table Server Instance 2
Table Server Instance 4
Table Server Instance 3 HA VM 2
HA VM 1
5 T1
T2
Store Table list & properties
3
Table Server Instance 1
Table Master
HA VM 3
HA VM 4
2
HA VM 5
Storage Layer, SoFS
T3
T4
T5
T6
T7
T8
WAC 42
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Insert Entity Data Flow 1. WFE authenticates & authorizes requests with WAC (if not already cached by WFE)
Service Fabric Cluster WFE
WFE
HA VM 6
HA VM 7
2. WFE queries Table Master for the TS instance for the table (if it’s not already cached by WFE)
Query TS Mapping
2
Table command & Response
Table Master
WAC
HA VM 4
HA VM 5
3
3. WFE sends Insert Entity request to specified TS instance 4. TS instance updates DB containing the corresponding table and sends success/failure back to WFE
1
Table Server Instance 8
Table Server Instance 7
Table Server Instance 6
Table Server Instance 5
Table Server Instance 2
Table Server Instance 4
Table Server Instance 3
Table Server Instance 1 HA VM 2
HA VM 1
HA VM 3
4 Storage Layer, SoFS
T1
T2
T3
T4
T5
T6
T7
T8
WAC 43
2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.