Replication and Consistency

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme Replication and Consistency Chapter 7: Replication and Consistency 1 Lehrstuhl für...
Author: Rose Butler
3 downloads 2 Views 208KB Size
Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Replication and Consistency

Chapter 7: Replication and Consistency

1

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Introduction to Replication Replication is… • Replication of data: the maintenance of copies of data on multiple computers • Object replication: the maintenance of copies of whole server objects (i.e. their internal states) on multiple computers Replication can provide … • Performance enhancement, scalability  Remember caching: improve performance by storing data locally, but data are incomplete  Additionally: several web servers can have the same DNS name. The servers are selected by DNS in turn to share the load  Replication of read-only data is simple, but replication of frequently changing data causes overhead in providing “actual” data

Chapter 7: Replication and Consistency

2

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Introduction to Replication Increased availability • Sometimes needed: nearly 100% of time a service should be available • In case of server failures: simply contact another server with the same data items • Network partitions and disconnected operations: availability of data if the connection to a server is lost. But: after re-establishing the connection, (conflicting) data updates have to be resolved Fault-tolerant services • Guaranteeing correct behaviour in spite of certain faults (can include timeliness) • If f in a group of f+1 servers crash, then 1 remains to supply the service • If f in a group of 2f+1 servers have byzantine faults, the group can supply a correct service

Chapter 7: Replication and Consistency

3

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

When to do Replication? Replication at service initialisation • Try to estimate number of needed servers from a customer's specification regarding performance, availability, fault-tolerance, … • Choose places to deposit data or objects  Example: root servers in DNS Replication 'on demand' • When failures or performance bottlenecks occur, make a new replica, possibly placed at a new location in the network • Or: to improve local access operations, place a copy near a client  Example: (DNS) caching, disconnected operations

Chapter 7: Replication and Consistency

4

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Why Object Replication?

• It is useful to consider objects in replication instead of only considering data • Objects have the benefit of encapsulating data and operations on data. Thus, object-specific operation requests can be distributed • But: now one has to consider internal object states! This topic is related to mobile agents, but it becomes more complicated in the case of consistent internal states!

Chapter 7: Replication and Consistency

5

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Requirements for Replication For all of these application areas, one requirement holds: the client should not be aware of using a group of computers! Replication transparency • Clients do not work on multiple physical copies, they only see one logical object which they request to do an action • Clients only expect a single result, not results of each data copy • Needed: propagation of updates Consistency • How to ensure all data copies to be consistent? • Specific degrees depending on the application:  Temporarily inconsistencies could be tolerated  When multiple clients are connected using different copies, they should get consistent results Performance, availability, fault-tolerance ⇔ consistency, up-to-date information Scalability?

Chapter 7: Replication and Consistency

6

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Management of Replicated Data Needed: replication transparency and consistency • Client requests are handled by Front Ends. A front end provides replication transparency • With front ends, clients see a service that gives them access to logical objects, which are in fact replicated at several Replica Managers which guarantee consistency • Client request operations:  read-only requests: calls without making updates  update requests: calls executing write operations (but could also execute read operations) server Client

Front End

Client

Front End

Client

Front End

Chapter 7: Replication and Consistency

RM

RM server

RM server

Service

7

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

System Model • Each logical object is implemented by a collection of physical copies called replicas (the replicas are not necessarily consistent all the time; some may have received updates not yet delivered to the others) • Replica managers  Contain replicas on a computer and access them directly  Replica managers apply operations to replicas recoverably, i.e. they do not leave inconsistent results if they crash  Static systems are based on a fixed set of replica managers  In a dynamic system, replica managers may join or leave (e.g. when they crash)  A replica manager can be a state machine which has the following properties: a) Operations are applied atomically b) The current state is a deterministic function of the initial state and the applied operations c) All replicas start identically and carry out the same operations d) The operations must not be affected by clock readings etc.

Chapter 7: Replication and Consistency

8

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Performing a Request In general: five phases for performing a request on replicated data • Issue request. The front end either  sends the request to a single replica manager that passes it on to the others, or  multicasts the request to all of the replica managers • Coordination. For consistent execution, the replica managers decide  whether to apply the request (e.g. in presence of failures)  how to order the request relatively to other requests (according to FIFO, causal or total ordering) • Execution. The replica managers execute the request (sometimes tentatively) • Agreement. The replica managers agree on the effect of the request, e.g. perform it 'lazily' or immediately • Response. One or more replica managers reply to the front end, which combines the results  For high availability, give first response to the client  To tolerate faults, take a vote

Chapter 7: Replication and Consistency

9

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Replication Example How to provide fault-tolerant services, i.e. a service that is provided correct even if some processes fail? • Simple replication system: e.g. two replica managers A and B each managing replicas of two accounts x and y. Clients use the local replica manager if possible, after responding to a client the update is transmitted to the other replica manager Client 1:

Client 2:

setBalanceB(x,1) setBalanceA(y,2) getBalanceA(y) → 2 getBalanceA(x) → 0

• Initial balance of x and y is $0 – Client 1 first updates x at B (local). When updating y it finds B has failed, so it uses A for the next operation. – Client 2 reads balances at A (local), but because B had failed, no update was propagated to A: x has amount 0. Be careful when designing replication algorithms! You need a consistency model

Chapter 7: Replication and Consistency

10

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Strict Consistency

There are several correctness criteria for replication regarding consistency. Real world: strict consistency • Any read on a data item x returns a value corresponding to the result of the most recent write on x • Problem with strict consistency: it relies on absolute global time

Chapter 7: Replication and Consistency

11

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Linearizability The strictest criterion for a replication system is linearizability • Consider a replicated service with two clients that perform read and update operations o1i resp. o2j. • Communication is synchronous, i.e. a client waits for one operation to complete before doing another • Single server: serialize the operations by interleaving, e.g. o20, o21, o10, o22, o11, o12 • In replication: virtual interleaving; a replicated shared service is linearizable if for any execution there is some interleaving of the series of operations issued by all the clients with  The interleaved sequence of operations meets the specification of a (single) correct copy of the objects  The order of operations in the interleaving is consistent with the real times at which they occurred in the actual execution • Linearizability concerns only the interleaving of individual operations, it is not intended to be transactional

Chapter 7: Replication and Consistency

12

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Sequential Consistency Problem with linearizability: real-time requirement practically is not to fulfil. Weaker correctness criterion: sequential consistency • A replicated shared service is sequential consistent if for any execution there is some interleaving of the series of operations issued by all the clients with  The interleaved sequence of operations meets the specification of a (single) correct copy of the objects  The order of operations in the interleaving is consistent with the program order in which each individual client executed them • Every linearizable service is sequential consistent, but not vice versa: Client 1:

Client 2:

setBalanceB(x,1) getBalanceA(y) → 0

Possible under a naive replication strategy: the update at B has not yet been propagated to A when client 2 reads it.

The real-time criterion for getBalanceA(x) → 0 linearizability is not satisfied, because the reading of x occurs setBalanceA(y,2) after its writing. But: both criteria for sequential consistency are satisfied with the ordering: getBalance getBalance A(y) → 0; and A(x) → 0; setBalanceB(x,1); setBalanceA(y,2) Chapter 7: Replication Consistency

13

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

… more Consistency Sequential consistency is widely used, but has poor performance. Thus, there are models relaxing the consistency guarantees: Causal Consistency (weakening of sequential consistency) • Distinction between events that are causally related and those that are not • Example: read(x); write(y) in one process are causally related, because the value of y can depend on the value of x read before • Consistency criterion: Writes that are potentially causally related must be seen by all processes in the same order. Concurrent writes may be seen in a different order on different machines • Necessary: keeping track of which processes have seen which write (vector timestamps) FIFO Consistency (weakening of causal consistency) • Writes done by a single process are seen by all other processes in the order in which they were issued, but writes from different processes may be seen in a different order by different processes • Easy to implement Weak Consistency, Release Consistency, Entry Consistency

Chapter 7: Replication and Consistency

14

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Consistency Models Consistency

Description

Strict

Absolute time ordering of all shared accesses matters

Linearizability

All processes must see all shared accesses in the same order. Accesses are furthermore ordered according to a global timestamp

Sequential

All processes see all shared accesses in the same order. Accesses are not ordered in time

Causal

All processes see causally-related shared accesses in the same order

FIFO

All processes see writes from each other in the order they were done. Writes from different processes may not always be seen in that order

Chapter 7: Replication and Consistency

15

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Replication Approaches

Distinction between mechanisms: replication… • …for fault tolerance • …for highly available services • …in transactions

Chapter 7: Replication and Consistency

16

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Fault Tolerance: Passive (primarybackup) Replication Replication for fault tolerance: passive or primary-backup model • There is at any time a single primary replica manager and one or more secondary replica managers (backups, slaves) Front End

Client

primary

RM

RM

Backup

…. RM Client

Front End

RM

Backup

Backup

• Front ends only communicate with the primary replica manager which executes the operation and sends copies of the updated data to the backups • If the primary fails, one of the backups is promoted to act as the primary • This system implements linearizability, since the primary sequences all the operations on the shared objects • Variation: clients can read from backups, which reduces the work load for the primary, but doesand only achieve sequential consistency 17 Chapter 7: Replication Consistency

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Execution in Passive Replication There are five phases in performing a client request: 1. Request:  A front end issues the request, containing a unique identifier, to the primary replica manager 2. Coordination:  The primary performs each request atomically, in the order in which it receives it  It checks the unique identifier, if it has already executed the request. If yes, it resends the response 3. Execution:  The primary executes the request and stores the response 4. Agreement:  If the request is an update the primary sends the updated state, the response and the unique identifier to all the backups. The backups send an acknowledgement 5. Response:  The primary responds to the front end, which hands the response back to the client

Chapter 7: Replication and Consistency

18

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Properties of Passive Replication + Simple principle + To survive f process crashes, f +1 replica managers are required + Front end can be designed with little functionality - But: relatively large overheads by using view-synchronous communication: several rounds of communication, latency after a primary crash because of an agreement to a new view • Variation: clients can read from backups  Reduces the work load for the primary  Achieves sequential consistency but not linearizability • Sun Network Information System (NIS) uses passive replication with weaker guarantees than sequential consistency:  Achieves high availability and good performance  The master receives updates and propagates them to slaves using one-to-one communication.  Information retrieval can be made by using either the master or a slave server

Chapter 7: Replication and Consistency

19

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Active Replication The replica managers are state machines all playing the same role and are organised as a group: • A front end multicasts each request to the group of replica managers • All replica managers start in the same state and perform the same operations in the same order so that their state remains identical (notice: totally ordered reliable multicast would be needed to guarantee the identical execution order!) • If a replica manager crashes it has no effect on the performance of the service because the others continue as normal • Failures in a few replicas can be tolerated because the front end can collect and compare the replies it receives Front End

Client

RM

…. Client

RM Front End

Chapter 7: Replication and Consistency

RM

20

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Execution in Active Replication There are the following phases in performing a client request: 1. Request  The front end attaches a unique identifier to a request and uses totally ordered reliable multicast to send it to all replica managers 2. Coordination  The group communication system delivers the requests to all the replica managers in the same (total) order 3. Execution  Every replica manager executes the request. They are state machines and receive requests in the same order, so the effects for correct replica managers are identical (Agreement)  No agreement is required because all managers execute the same operations in the same order, due to the properties of the totally ordered multicast 4. Response  The front end collects responses from the managers (and combines them, if necessary)

Chapter 7: Replication and Consistency

21

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Properties of Active Replication • As replica managers are state machines, we have sequential consistency but do not achieve linearizability (because we do not have perfectly accurate synchronisation algorithms) • Possible to deal with byzantine failures: use 2f +1 replica managers to overrule up to f process failures. For this, a front end has simply to wait for receiving f +1 identical responses from replica managers. Replica managers have to digitally sign their responses! • Variations: • Allow a sequence of read operations to be executed from different replica managers in any order • Or: allow the front end to send a read-only request to only one replica manager (if this manager fails, the front end contacts the next one) • Allow write operations on different data items to be executed in any order

Chapter 7: Replication and Consistency

22

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Highly Available Services How to provide services with high availability, i.e. how to use replication to achieve an availability of (ideally) 100%? • Reasonable response times for as much of the time as possible • Even if some results do not conform to sequential consistency • E.g. a disconnected user may accept temporarily inconsistent results if he can continue to work and fix inconsistencies later Difference to fault tolerant systems • For fault tolerance, all replica managers have to agree on a result – 'eager' evaluation, not acceptable for reaching reasonable response times • Instead for high availability:  Only contact a minimum number of replica managers necessary to reach an acceptable level of service  Client should tied up for a minimum time while managers coordinate their actions  Weaker consistency generally requires less agreement and makes data more available. Updates are propagated 'lazily'

Chapter 7: Replication and Consistency

23

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

The Gossip Architecture • The gossip architecture is a framework for implementing highly available services  Data is replicated close to the location of clients  Replica managers periodically exchange ‘gossip’ messages containing updates they have received from clients • Two basic types of operations are provided: • queries - read only operations • updates - modify (but do not read) the state • Front ends choose any replica manager to send queries and updates to, selected by availability and response times • Two guarantees are made (even if managers are temporarily unable to communicate with one another): • Each client gets a consistent service over time (i.e. the data reflects at least the updates seen by the client so far, even if it uses different replica managers). Vector timestamps are used – with one entry per manager • Relaxed consistency between replicas. All replica managers eventually receive all updates and use ordering guarantees to suit the needs of the application (generally causal ordering). A client may observe stale data

Chapter 7: Replication and Consistency

24

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Query & Update Operations • The service consists of a collection of replica managers that exchange gossip messages • Queries and updates are sent by a client via a front end to any replica manager

Service

• The front end converts operations containing both, reading and writing, in two separate calls

RM RM Val, new

Query, prev FE Query

Val

C

RM

• For ordering operations, the front end sends a timestamp Update id Update, prev prev with each request to denote its latest state FE • A new timestamp new is Update passed back in a read operation to mark the data C state the client has seen last

Gossip

Chapter 7: Replication and Consistency

25

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Timestamps • Each front end keeps a vector timestamp that reflects the latest data value seen by the front end (prev) • Clients can communicate with each other. This can lead to Service causal relationships between client operations which has to RM be considered in the replicated system. Thus, communication RM RM is made via the front ends Gossip including an exchange of vector timestamps allowing the front Vector ends to consider causal ordering FE FE timestamps in their timestamps

C

Chapter 7: Replication and Consistency

C

26

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Gossip Processing of Queries and Updates Phases in performing a client request: 1. Request • Front ends normally use the same replica manager for all operations • Front ends may be blocked on queries 2. Update response - The replica manager replies as soon as it has received the update 3. Coordination - The replica manager receiving a request waits to process it until the ordering constraints apply. This may involve receiving updates from other replica managers in gossip messages 4. Execution - The replica manager executes the request 5. Query response - If the request is a query the replica manager now replies 6. Agreement - Replica managers update one another by exchanging gossip messages with the most recent received updates. This has not to be done for each update separately

Chapter 7: Replication and Consistency

27

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Gossip Replica Manager Other replica managers Replica timestamp

Gossip messages

Replica log

Replica manager Timestamp table Value timestamp

Replica timestamp

Stable Update log

Value

updates Executed operation table

Updates OperationID Update Prev FE

FE

Chapter 7: Replication and Consistency

28

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Replica Manager State Main state components of a replica manager: • Value: reflects the application state as maintained by the replica manager (each manager is a state machine). It begins with a defined initial value and is applied update operations to • Value timestamp: the vector timestamp that reflects the updates applied to get the saved value • Executed operation table: prevents an operation from being applied twice, e.g. if received from other replica managers as well as the front end • Timestamp table: contains a vector timestamp for each other replica manager, extracted from gossip messages. • Update log: all update operations are recorded immediately when they are received. An operation is held back until ordering allows it to be applied • Replica timestamp: a vector timestamp indicating updates accepted by the manager, i.e. placed in the log (different from value’s timestamp if some updates are not yet stable).

Chapter 7: Replication and Consistency

29

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Query & Update Operations • A query includes a timestamp. The operation is marked as pending until the manager's timestamp is larger then the operation's timestamp. • Update operations are processed in causal order. A front end can send an update operation containing timestamp and identifier id to one or several replica managers  When replica manager i receives an update request, it checks whether it is new, by looking for the id in its executed ops table and its log  If it is new, the replica manager - increments by 1 the ith element of its replica timestamp, - assigns the result as new timestamp ts to the update, and - stores the update in its log.  The replica manager returns ts to the front end, which merges it with its vector timestamp (Note: by sending a request to several managers the front end gets back several timestamps which have to be merged)  Depending on the timestamp for an operation, it can be delayed till gossip messages from other replica managers arrive. When the timestamp allows, the replica manager applies the operation and makes an entry in the executed operation table

Chapter 7: Replication and Consistency

30

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Gossip Messages • The timestamp table contains a vector timestamp for each other replica, collected from gossip messages • A replica manager uses entries in this timestamp table to estimate which updates another manager has not yet received. These information are sent in a gossip message • A gossip message m contains the log m.log and the replica timestamp m.ts • A manager receiving gossip message m has the following main tasks:  Merge the arriving log with its own  Apply in causal order updates that are new and have become stable  Remove redundant entries from the log and executed operation table when it is known that they have been applied by all replica managers  Merge its replica timestamp with m.ts, so that it corresponds to the additions in the log

Chapter 7: Replication and Consistency

31

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Update Propagation The given architecture does not specify when to exchange gossip messages. To design a robust system in which each update is propagated in reasonable time, an exchange strategy is needed. The time which is required for all replica managers to receive a given update depends upon three factors: 1. The frequency and duration of network partitions • This is beyond the system‘s control 2. The frequency with which replica managers send gossip messages • This may be tuned to the application 3. The policy for choosing a partner with which to exchange gossip messages • Random policies choose a partner randomly but with weighted probabilities so as to favour some partners over others • Deterministic policies give fixed communication partners • Topological policies arrange the replica managers into a fixed graph (mesh, circle, tree, …) and messages are passed to the neighbours • Other strategies: consider transmission latencies, fault probabilities, …

Chapter 7: Replication and Consistency

32

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Properties of the Gossip Architecture • The gossip architecture is designed to provide a highly available service + Clients with access to a single replica manager can work even when other managers are inaccessible - But this is not suitable for data such as bank accounts - It is inappropriate for updating replicas in real time • Scalability - As the number of replica managers grows, so does the number of gossip messages - For R managers collecting G updates in a gossip message, the number of messages per request is 2 + (R-1)/G  Variation: increase G and improve the number of gossip messages, but make latency worse  For applications where queries are more frequent than updates, use some read-only replicas placed near the client, which are updated only by gossip messages

Chapter 7: Replication and Consistency

33

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Operational Transformation Approach • The so-called Bayou system provides data replication for high availability with weaker guarantees than sequential consistency • Bayou replica managers cope with variable connectivity by exchanging updates in pairs (like in gossip architecture), but it adopts a markedly different approach in that it enables domain specific conflict detection and resolution to take place • All updates are applied and recorded at whatever replica manager they reach • Replica manager detect and resolve conflicts when exchanging information by using domain-specific policies. The effect of undo or alter conflicting operations to resolve them is called operational transformation • Bayou update is a special case of a transaction. It is carried out with the ACID guarantees. Bayou may undo and redo updates to the database as execution proceeds • The Bayou guarantee is that, eventually, every replica manager receives the same set of updates and applies those updates in such a way that the replica managers' databases are identical • In practice, there may be a continuous stream of updates, and the databases may never become identical; but they would become identical if the updates ceased

Chapter 7: Replication and Consistency

34

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Committed and Tentative Updates • Updates are marked as tentative when they are first applied to the database. While they are in this state, they can be undone and re-applied if necessary • Tentative updates are eventually placed in a canonical order and marked as committed • The committed order can be achieved by designating some replica manager as the primary replica manger deciding an order by receiving date • A tentative update ti becomes the next committed update and is inserted after the last committed update cN: Committed c0

c1

c2

Tentative cN

t0

t1

t2

ti

ti+1

All updates t0 to ti-1 are to be reapplied after it

Chapter 7: Replication and Consistency

35

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Dependency Checks and Merge Procedures Every Bayou update contains a dependency check and merge procedure in addition to the operation's specification because of the possibility that an update may conflict with other operation that has already been applied: • A replica manager calls the dependency check procedure before applying the operation • It would check whether a conflict would occur if the update was applied and it may examine any part of of the database to do that • If the dependency check indicates a conflict, then Bayou invokes the operation's merge procedure • That procedure alerts the operation that will be applied so that it achieves something similar to the intended effect but avoids a conflict • The merge procedure may fail to find a suitable alteration of the operation, in which case the system indicates an error • The effect of a merge procedure is deterministic. However – Bayou replica mangers are state machines

Chapter 7: Replication and Consistency

36

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Problem with Bayou Bayou is different from other approaches because it makes replication non-transparent to the application • Increased complexity for the application programmer: provide dependency checks and merge procedures • Increased complexity for the user: getting tentative data and alteration of user-specified operations • The operational transformation approach used by Bayou appears in systems for computer supported cooperative work (CSCW) • This approach is limited in practice to situations where only few conflicts arise, users can deal with tentative data and where data semantics are simple

Chapter 7: Replication and Consistency

37

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Transactions with Replicated Data Till now: consideration of only single operations on a set of replicas. But: objects in transactional systems can be replicated to enhance availability and performance. How to deal with atomic sequences of operations? • The effect of transactions on replicated objects should be the same as if they had been performed one at a time on a single set of objects • This property is called one-copy serializability • Each replica manager provides concurrency control and recovery of its own objects • Assumption for presented methods: two-phase locking is used • Replication makes recovery more complicated… when a replica manager recovers, it restores its objects with information from other managers

Chapter 7: Replication and Consistency

38

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Architectures for Replicated Transactions

• Assumption: a front end sends requests to one of a group of replica managers • In the primary copy approach, all front ends communicate with the primary replica manager which propagates updates to backups • In other schemes, front ends may communicate with any replica manager and coordination between them has to take place. The manager receiving a request is responsible for the cooperation process • (Note: rules as to how many managers are involved vary with the replication scheme) • Question: propagate requests immediately or at the end of a transaction? – In the primary copy scheme, we can wait until end of transaction (concurrency control is applied at the primary) – If transactions access the same objects at different managers, requests need to be propagated so that concurrency control can be applied • Two-phase commit protocol – Becomes a two-level nested 2PC. If a coordinator or worker is a replica manager it will communicate with other managers that it passed requests to during the transaction

Chapter 7: Replication and Consistency

39

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Read One/Write All (ROWA) • Used for transactions • One replica manager is required for a read request, all replica managers for a write request  Every write operation must be performed at all managers, each of which applies a write lock  Each read operation is performed by a single manager, which sets a read lock • Consider pairs of operations by different transactions on the same object:  Any pair of write operations will require conflicting locks at all of the managers  A read operation and a write operation will require conflicting locks at a single manager Client + front end

T

Client + front end

U

deposit(B,3)

getBalance(A) Replica managers

Replica managers

B

A

A

A

Chapter 7: Replication and Consistency

B

B

B

40

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Available Copies Replication • The simple read one/write all scheme is not realistic: – It cannot be carried out if some of the replica managers are unavailable either because the have crashed or because of a communication failure • The available copies replication scheme is designed to allow some managers to be temporarily unavailable – A read request can be performed at any available replica manager – Write requests are performed by the receiving manager and all other available managers in the group – As long as the set of available managers does not change, local concurrency control achieves one-copy serializability in the same way as read one/write all – Problems with this occur if a manager fails/recovers during the progress of conflicting transactions

Chapter 7: Replication and Consistency

41

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Read One/Write All Available (ROWA-A) • T’s getBalance is performed by X whereas Ts deposit is performed by M, N and P • At X, T has read A and has locked it. Therefore U’s deposit is delayed until T finishes • Local concurrency control achieves one-copy serializability provided the set of replica managers does not change. • …but we have managers failing and recovering Client + front end

T

Client + front end

U

getBalance(B) deposit(A,3);

getBalance(A) deposit(B,3);

Replica managers

B Replica managers

delay

Y

Chapter 7: Replication and Consistency

B

B

A

A X

M

P

N

42

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Replica Manager Failure • A replica manager can fail by crashing and is replaced by a new process, which restores the state from a recovery file • Front ends use timeouts to detect a manager failure. In case of a timeout, another manager is tried • If a replica manager is doing recovery, it is currently not up to date and rejects requests (and the front end tries another replica manager) • For one-copy serializability, failures and recoveries have to be serialised with respect to transactions  A transaction observes when a failure occurs  One-copy serializability is not achieved if different transactions make conflicting failure observations  In addition to local concurrency control, some global concurrency control is required to prevent inconsistent results between a read in one transaction and a write in another transaction

Chapter 7: Replication and Consistency

43

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Replica Manager Failure: read/write Conflict

• Replica manager X fails just after T has performed getBalance • Replica manager N fails just after U has performed getBalance. • Both managers fail before T and U have performed their deposit operations Therefore T’s deposit will be performed at replica managers M and P (all available) U’s deposit will be performed at manager Y (all available). • Concurrency control at X does not prevent U from updating A at Y • Concurrency control at M does not prevent T from updating B at M and P T

U

Client + front end

Client + front end

getBalance(B) deposit(A,3);

getBalance(A) deposit(B,3);

Replica managers

B Replica managers

Y

Chapter 7: Replication and Consistency

B

B

A

A X

M

P

N

44

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Additional Concurrency Control: Local Validation Goal: ensure that in a transaction's progress no failure or recovery occurs • Before a transaction commits, it checks for failures and recoveries of the replica managers it has contacted  T would check if N is still unavailable and that X, M and P are still available  If this is the case, T can commit - This implies that X failed after T validated and before U validated - We have N fails → T commits → X fails → U validates  U checks if N is still available (no) and X still unavailable – Therefore U must abort • In case of a failure or recovery:  When a transaction has observed a failure, its local validation communicates with the failed manager to ensure that it has not yet recovered  The check if replica managers have failed since starting a transaction, can be combined with the 2PC protocol

Chapter 7: Replication and Consistency

45

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Network partitions • Part of the network fails creating sub-groups, which cannot communicate with one another • Replication schemes assume partitions will be repaired  Operations done during a partition must not cause inconsistency  Optimistic schemes (e.g available copies with validation) allow all operations and resolve inconsistencies when a partition is repaired  Pessimistic schemes (e.g. quorum consensus) prevent inconsistency e.g. by limiting availability in all but one sub-group Client + front end

Client + front end

T

Network partition

U deposit(B,3)

withdraw(B, 4)

B Replica managers

B

B

Chapter 7: Replication and Consistency

B

46

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Available Copies with Validation • Optimistic approach • The algorithm is applied within each partition • Maintains the normal level of availability for read operations, even during partitions • When a partition is repaired, the possibly conflicting transactions that took place in the separate partitions are validated:  If the validation fails then some steps must be taken to overcome the inconsistencies  If there had been no partition, one of a pair of transactions with conflicting operations would have been delayed or aborted  As there has been a partition, pairs of conflicting transactions have been allowed to commit in different partitions – then the only choice after the event is to abort one of them, this requires making changes in the objects and in some cases, compensating effects in the real world  The optimistic approach is only feasible with applications where such compensating actions can be taken

Chapter 7: Replication and Consistency

47

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Quorum Consensus Methods • To prevent transactions in different partitions from producing inconsistent results, make a rule that operations can be performed in only one of the partitions • Replica managers in different partitions cannot communicate, thus each subgroup decides independently whether they can perform operations • A quorum is a sub-group of replica managers whose size gives it the right to perform operations. The right could be given by having the majority of the replica managers in the partition • In quorum consensus schemes, update operations may be performed by a subset of the replica managers forming a quorum  The other replica managers have out-of-date copies  Version numbers or timestamps can be used to determine which copies are up-to-date  Operations are applied only to copies with the current version number

Chapter 7: Replication and Consistency

48

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Gifford’s Quorum Consensus • Assign a number of votes to each data copy at a replica manager • A vote is a weighting giving the desirability of using a particular copy • Thus, groups of replica managers can be configured to consider different performance or reliability characteristics • each read operation must obtain a read quorum of R votes before it can read from any up-to-date copy • Each write operation must obtain a write quorum of W votes before it can do an update operation • R and W are set for a group of replica managers such that :  W > half the total votes  R + W > total number of votes for the group • In case of a partition it is not possible to perform conflicting operations on the same data file in different partitions • Performance and reliability of write resp. read operations are increased with decreasing W resp. R • Main disadvantage: performance of read operations is degraded

Chapter 7: Replication and Consistency

49

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Quora

Three examples of the voting algorithm: a) A correct choice of read and write set b) A choice that may lead to write-write conflicts c) A correct choice, coming to ROWA (read one, write all)

Chapter 7: Replication and Consistency

50

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Gifford’s Quorum Consensus • Before a read operation, a read quorum is collected:  Make enquiries at replica managers to find a set of copies, the sum of whose votes is not less than R (not all of these copies need be up to date)  As each read quorum overlaps with every write quorum, every read quorum is certain to include at least one current copy  The read operation may be applied to any up-to-date copy • Before a write operation, a write quorum is collected  Make enquiries at replica managers to find a set with up-to-date copies, the sum of whose votes is not less than W  If there are insufficient up-to-date copies, then an out-of-date file is replaced with a current one, to enable the quorum to be established  The write operation is then applied by each replica manager in the write quorum, the version number is incremented, and completion is reported to the client  The files at the remaining available managers are then updated in the background • Two-phase read/write locking is used for concurrency control

Chapter 7: Replication and Consistency

51

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Gifford’s Quorum Consensus Examples Example 1 Example 2 Example 3 Latency Replica 1 (milliseconds) Replica 2 Replica 3 Voting Replica 1 configuration Replica 2 Quorum sizes

75 65 65

75 100 750

75 750 750

Replica 3

1 0 0

2 1 1

1 1 1

R W

1 1

2 3

1 3

Derived performance latency blocking probability probability that a quorum cannot be obtained, assuming probability of 0.01 that any single replica manager is unavailable

Derived performance of file suite: Read

Latency 65 Blocking probability 0.01

75 0.0002

75 0.000001

Write

Latency 75 Blocking probability 0.01

100 0.0101

750 0.03

Chapter 7: Replication and Consistency

52

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Gifford’s Quorum Consensus Examples Example 1 • Is configured for a file with high read to write ratio with several weak representatives and a single replica manager • Replication is used for performance, not reliability • The replica manager can be accessed in 75 ms and the two clients can access their weak representatives on local discs in 65 ms, resulting in lower latency and less network traffic Example 2 • Is configured for a file with a moderate read to write ratio which is accessed mainly from one local network. Local replica manager has 2 votes and remote managers 1 vote each • reads can be done at the local replica manager, but writes must access one local and one remote manager. If the local manager fails only reads are allowed Example 3 • Is configured for a file with a very high read to write ratio • reads can be done at any replica manager and the probability of the file being unavailable is small. But writes must access all managers

Chapter 7: Replication and Consistency

53

Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Conclusion • Replication is used for…  performance gain  fault tolerance  high availability • Several consistency models for different applications • Several architectures for realizing those models • Problem all the time: replication gives new coordination overhead

• That means: synchronization, voting, transactions, … are basic mechanisms for coordination of distributed applications, replication has the purpose to increase the quality of services offered by these applications

Chapter 7: Replication and Consistency

54