CS551 Distributed Hash Tables Structured Systems Bill Cheng http://merlot.usc.edu/cs551-f12 1 Copyright © William C. Cheng

Computer Communications - CSCI 551

CS551 Chord [Stoica01a] Bill Cheng http://merlot.usc.edu/cs551-f12 2 Copyright © William C. Cheng

Computer Communications - CSCI 551

Chord A structured peer-to-peer system Map key to value Emphasis on good algorithmic performance uses consistent hashing 2 O(log N) route storage, O(log N) lookup cost, O(log N) cost to join/leave vs. FreeNet w/emphasis on anonymity Easy if static, but must deal with node arrivals and departures

3 Copyright © William C. Cheng

Computer Communications - CSCI 551

Compare Search in Several Peer-to-Peer Systems Napster: central search engine Freenet: search towards keys, but no guarantees Chord: map keys to linear search space keep pointers (fingers) into exponential places around space probabilistic (depends on hashing)

4 Copyright © William C. Cheng

Computer Communications - CSCI 551

Hashing Nodes and Data 6 0 7

1

6

1 SUCCESSOR(1)=1

2

SUCCESSOR(2)=3

Nodes hash IP addresses to key space because this hashing is random, can expect nodes to be evenly distributed in key space

SUCCESSOR(6)=0 5

3 4

2

Store data in the successor of the data item’s key Property: If each node maintains successor, ... can find any data item

5 Copyright © William C. Cheng

Computer Communications - CSCI 551

Hashing Nodes and Data 6 0 7

1

6

1 SUCCESSOR(1)=1

2

SUCCESSOR(2)=3

Nodes hash IP addresses to key space because this hashing is random, can expect nodes to be evenly distributed in key space

SUCCESSOR(6)=0 5

3 4

2

Store data in the successor of the data item’s key Property: If each node maintains successor, ... can find any data item Nodes have a successor pointer but O(n) performance 6

Copyright © William C. Cheng

Computer Communications - CSCI 551

Improving Search Performance with Finger Tables Finger tables enable logarithmic lookup i-1 i-th finger of node x is successor of x+2 at each step, we halve the remaining distance (in key space) to the target 1 x

1 x+2

0

1 x

2 x+2

1

1 x+2

0

4 x+2

2

2 x+2

1

8 x+2

3

4 x+2

2

x+2

4

8 x+2

3

x+2

4

Challenge: maintaining finger tables! 7 Copyright © William C. Cheng

Computer Communications - CSCI 551

Improving Search Performance with Finger Tables (Cont...) Finger tables enable logarithmic lookup i-1 i-th finger of node x is successor of x+2 at each step, we halve the remaining distance (in key space) to the target x 156 x+2 Ex: look for key y x+2

157

x+2

x+2

158

159

8 Copyright © William C. Cheng

Computer Communications - CSCI 551

Improving Search Performance with Finger Tables (Cont...) Finger tables enable logarithmic lookup i-1 i-th finger of node x is successor of x+2 at each step, we halve the remaining distance (in key space) to the target x 156 x+2 Ex: look for key y 157 x+2 case 1: y is just beyond 159 x+2 158 forward to x+2 159 successor(x+2 ) way more than half the distance to y x+2

159

y

9 Copyright © William C. Cheng

Computer Communications - CSCI 551

Improving Search Performance with Finger Tables (Cont...) Finger tables enable logarithmic lookup i-1 i-th finger of node x is successor of x+2 at each step, we halve the remaining distance (in key space) to the target x 156 x+2 Ex: look for key y 157 x+2 case 2: y is just inside 159 x+2 158 forward to x+2 158 successor(x+2 ) a little over half the distance to y x+2

159

y

10 Copyright © William C. Cheng

Computer Communications - CSCI 551

Improving Search Performance with Finger Tables (Cont...) Finger tables enable logarithmic lookup i-1 i-th finger of node x is successor of x+2 at each step, we halve the remaining distance (in key space) to the target x 156 x+2 Ex: look for key y 157 x+2 case 3: y is just beyond 158 x+2 158 forward to x+2 158 successor(x+2 ) y way more than half the distance to y and so on...

x+2

159

11 Copyright © William C. Cheng

Computer Communications - CSCI 551

Finger Tables Example

finger[3].interval=[finger[3].start,finger[3].1)

i-th finger of node x is i-1 successor of x+2

finger table keys start int. succ. 6 1 [1,2) 1 2 [2,4) 3 4 [4,0) 0

0 7

1

0 7 2

6

5 finger[3].start=5

3 4

1

finger[1].start=2

finger[1].interval= [finger[1].start, finger[2].start) finger[2].start=3

finger[2].interval=[finger[2].start,finger[3].start)

6

2

5

keys finger table start int. succ. 1 2 [2,3) 3 3 [3,5) 3 5 [5,1) 0

3 4

finger table keys start int. succ. 2 4 [4,5) 0 5 [5,7) 0 7 [7,3) 0

12 Copyright © William C. Cheng

Computer Communications - CSCI 551

Node Joins Must keep successors and finger table current Use successors for correctness can always fall back on them to find a key Use finger table for performance must update it, but can tolerate temporary errors Keep successor and predecessor so we can update our neighbors Key observation: can find successors and fingers by doing a lookup on the existing Chord ring!

13 Copyright © William C. Cheng

Computer Communications - CSCI 551

Finding Predecessor and Successor node.find_successor(key) n = find_predecessor(key); return n.successor; node.find_predecessor(key) n = node; while (key ∉ (n,n.successor]) n = n.closest_preceding_finger(key); return n; node.closest_preceding_finger(key) for (i=m; i > 0; i--) if (finger[i].node ∈ (node,key)) return finger[i].node; return node; 14 Copyright © William C. Cheng

Computer Communications - CSCI 551

Join Example before node 6 joins

before node 6 joins

keys finger table start int. succ. 6 1 [1,2) 1 2 [2,4) 3 4 [4,0) 0

0 7

1

2

6

5

finger table keys start int. succ. 1 2 [2,3) 3 3 [3,5) 3 5 [5,1) 0

3 4

keys finger table start int. succ. 1 [1,2) 1 2 [2,4) 3 4 [4,0) 6

finger table keys start int. succ. 6 7 [7,0) 0 0 [0,2) 0 2 [2,6) 3 0 7

1

2

6

5 keys finger table start int. succ. 2 4 [4,5) 0 5 [5,7) 0 7 [7,3) 0

finger table keys start int. succ. 1 2 [2,3) 3 3 [3,5) 3 5 [5,1) 6

3 4

keys finger table start int. succ. 2 4 [4,5) 6 5 [5,7) 6 7 [7,3) 0

when new node enters, it establishes its successor and predecessor and then builds its finger table, and moves any keys it now "owns" 15 Copyright © William C. Cheng

Computer Communications - CSCI 551

Robustness Stabilization algorithm to confirm ring is correct every 30s, ask successor for its predecessor fix your own successor based on this successor fixes its predecessor if necessary also, pick and verify a random finger table entry rebuild finger table entries this way important observation: finger tables can be incorrect for some time (between network sizes of N and 2N) Dealing with unexpected failures: keep successor list of r successors can use these to replicate data

16 Copyright © William C. Cheng

Computer Communications - CSCI 551

Applications

public key

dictionary block H(F)

root-block H(D)

D

inode block

H(B1) data block

B1 File Systems

F data block

signature

H(B2)

B2

Multicast and Anycast (using rendezvous)

sender (S)

receiver (R) (id,R)

sender (S) id,data

R,data

receiver (R)

(id,R)

17 Copyright © William C. Cheng

Computer Communications - CSCI 551

Chord Performance Performance dominated by lookup cost how long does it take to get to the node that stores a key? Chord promises few O(logN) hops on the overlay but, on the physical network, this can be quite far this is often the problem with overlay networks

0

32

64

USA

96

120 128

255

China 64

0 128

96

18 Copyright © William C. Cheng