CS3250 Distributed Systems Lecture 5 More on TCP/IP The Internet Protocol (IP) is a network level protocol which provides an unreliable connection-less service which delivers packets (called datagrams) from a source IP address to a destination IP address. It thus performs routing of packets choosing a path over the network via intermediate routers. Internet Protocol Datagram Format The general form of an IP datagram consists of a header including source and destination IP addresses and length of the datagram in octets (bytes). The format of the data area is not specified by IP and can be used to transmit arbitrary data. Of course the format of this area may depend on higher level protocols in the application and transport layers. In particular the datagram data area will certainly start with a TCP (or UDP) header, followed by the TCP (or UDP) data itself. This data area in turn will normally contain a header of the application layer protocol followed by application data.

Datagram Header

Datagram Data Area

IP Header Format (IPv4) 0

4

VERS

8

HLEN

SERVICE TYPE

IDENTIFICATION TIME TO LIVE

19

16

24

31

TOTAL LENGTH FLAGS FRAGMENT OFFSET HEADER CHECKSUM

PROTOCOL

SOURCE IP ADDRESS DESTINATION IP ADDRESS IP OPTIONS (if any)

PADDING

The version field (VERS) in the header (4 bits) specifies the version of the IP protocol (usually IPv4). The IP software at routers and at the destination checks this field before processing the datagram and rejects it if it cannot handle this version. The header length field (HLEN) is also 4 bits in length and specifies the header length in 32-bit words. The minimum length is 5, but this value may be higher if there are option fields in the header. The type of service (TOS) field is eight bits of which only 6 are used in IPv4, it specifies the priority level of the datagram from zero (lowest) to 7 (highest). If a router supports different priority levels then higher priority packets will be given precedence. Typically levels 6 and 7 are used for control information and ensure that control signals and routing information can be exchanged between routers even when the network is congested. The other field can be used to indicate the type of service required (low delay, high throughput, high reliability) if router software and available network routes support different levels of service. The total length field specifies the length of the datagram (including the header) in octets (bytes). As this field is 16 bits, the maximum size of an IP datagram is 65535 octets and the length of the data area is (TOTAL LENGTH – 4*HLEN) octets. © A Barnes, 2004

1

CS3250/L5

Fragmentation The next three fields control fragmentation of the IP datagrams. Packets may be routed over many different kinds of intermediate network over which the sender has little control. Different types of network (in the data link layer) support different maximum packet sizes. Thus an IP packet may need to be fragmented at some intermediate router in order to transmit it over a particular link. Once fragmented, the packets are not reassembled until they reach the destination address. Hence if a datagram is fragmented at one router, each fragment must be given a IP header of the above form to allow it to be transmitted separately to the destination. Suppose a datagram with a data area of 1400 octets and a header of minimum size (20 octets) needs to be transmitted over a data link which supports a maximum packet size of 656 octets. Suppose also that the data link header comprises 36 octets. Thus the largest IP datagram that can be transmitted has a size of 620 octets. Thus the IP datagram needs to be split into three IP datagrams: the first two have a length of 620 octets (20 for the header and 600 for data) and a third of length 220 (20 for the header and 200 for data). Original datagram Header

1420 octets: 20 octets header + 1400 data octets Data 600 octets

Data 600 octets

Data 200 octets

Fragment 1 Header

Data 600 octets

Fragment 1 (offset 0, MF = 1)

Fragment 2 Header

Data 600 octets

Fragment 2 (offset 600, MF = 1)

Fragment 3 Header

Fragment 3 (offset 1200, MF = 0)

Data 200 octets

The identification field is a 16 bit integer value which uniquely identifies a datagram. As each datagram is transmitted by the source this number is incremented. This is copied into each fragment header and indicates to the eventual receiver that the fragments all belong to the same original datagram. The offset field in the three fragments are set to 0, 600 and 1200 respectively to indicate the position of this data fragment in the original datagram. The total length fields in the three fragments are set to 620, 620 and 220 respectively. The third bit in the flags field is a 'more fragments' (MF) field: if set (=1) it indicates that more fragments follow the current one and if clear (=0) it indicates the current fragment is the last one. All other header fields except the checksum are copied from the header of the original datagram. The fragments are then transmitted and (may) eventually arrive at their final destination not necessarily in the same order in which they were transmitted. However the three 'fragmentation' fields and the total length field enable the IP software at the receiver to determine when it has all the fragments from the original datagram and to reassemble them. In outline this is done as follows: when it receives a fragment in which the MF field is zero it can deduce from the offset and the total length field of this fragment what the length of the original datagram was namely 1420 = 1200 + 220 octets. It then collects all the other fragments with the same identification number (and same source address) and can deduce from the offset fields and total length field whether it has all the required fragments and if so in which order they need to be reassembled. If all are present the original datagram is reconstructed. If one or more fragments are still missing after a certain time-out interval, the whole datagram is discarded. © A Barnes, 2004

2

CS3250/L5

If the first bit in the flags field is set it indicates that the datagram must not be fragmented; if it cannot be transmitted over a particular data link it is (depending on the routing software) re-routed or discarded and an error message sent back to the source. For maximum efficiency the original datagram size should be chosen as large as possible consistent with the constraint that datagrams should be transmitted across all intermediate networks without fragmentation. If the packets are too large fragmentation will occur and this has considerable processing overheads. If the IP packets are too small then the ratio of 'payload' to header bytes is decreased and space in data-link level packets is not utilised fully. More Header Fields The next field in the IP header is the time to live (TTL) field, this 8 bit integer value is set by the sender and indicates the maximum number of 'seconds that the datagram is allowed to remain on the internet before being discarded. When the datagram arrives at an intermediate router it is stored and then forwarded to the next router. If the store and forward process takes T seconds the TTL field is decremented by T (rounded up to the nearest integer). If the TTL field reaches zero before the datagram reaches its final destination the datagram is discarded and an error message sent back to the source. This simple mechanism prevents datagrams travelling around the network forever when there are errors in routing tables. In practice with modern routers the TTL field is decremented by 1 at each router so that the TTL field sets the maximum number of 'hops' allowed between source and destination. The 8-bit protocol field contains an indication of the higher level protocol used to create the datagram data area (usually TCP or UDP but there are other possibilities). The 16-bit checksum field contains a checksum constructed from the datagram header only; it is used to detect whether the header has been corrupted during transmission. The checksum is constructed by breaking the header into 16-bit words (ignoring the checksum field), adding them all together using ones-complement arithmetic and then taking the ones complement of the result. This process is repeated at intermediate routers, and at the final destination and if the checksums fail to match the datagram is discarded. Note that the checksum is constructed from the IP header only. If corruption of data occurs frequently in the underlying network, higher level protocols should include their own checksums on the data portion or risk losing data. The next four octets contain the IP address of the source in network byte order and this is followed by the IP address of the destination machine. These are followed by zero or more IP options. These control such things as recording the route taken by datagrams across the internet (each router adds its IP address to the datagram header as it forwards it) recording a route as above but with time-stamps also added to the header as the datagram is forwarded by each router. specifying a route across the internet (the datagram header contains a list of intermediate IP addresses that partially or completely determine the route that the datagram should follow) imposing security or handling restrictions. There may be a field of zeros to pad the IP header out to a whole number of 32-bit words.

© A Barnes, 2004

3

CS3250/L5

User Datagram Protocol (UDP) This is a transport layer protocol which provides an unreliable connection-less packet delivery service from a source end-point (IP address and port number) to a destination end-point (IP address and port number). It operates on top of the network level IP protocol which delivers packets from a source IP address to a destination IP address. Thus in effect UDP only has to differentiate between different ports on host machines. The UDP protocol also checks the data has not been corrupted in transit and then forwards it to the specified port so that it can be delivered to the required process. Ports are normally buffered so that data is not lost if it arrives before the process is ready to receive it. Each UDP message or datagram consists of two parts: a UDP header and a UDP data area. 0

31

16

SOURCE PORT

DESTINATION PORT UDP CHECKSUM

UDP MESSAGE LENGTH

DATA ..... The header consists of four 16-bit words: source port, destination port, the UDP message length in octets (including the 8 octets for the header) and finally a 16 bit checksum. Note that source and destination addresses are not part of the UDP header. As we have seen the IP header contains these addresses and to duplicate them in the UDP header would increase the overall packet length unnecessarily. The minimum UDP length is 8 (for a header alone) and the maximum is 655151. The checksum calculation in UDP is optional and may be omitted, for example to avoid unnecessary overheads when operating across a highly reliable network. If no checksum is computed the checksum field is set to zero to indicate no checksum has been computed. Otherwise a checksum is computed using ones-complement arithmetic. Note if the checksum result is zero then it is changed to all ones (recall that in ones-complement notation there are two bit-patterns representing zero, namely all zeros and all ones, and hence there is no ambiguity). The UDP checksum uses more information than is stored in the UDP datagram alone. To compute the checksum it first pads out the datagram with a zero octet (if necessary) so that its length is an exact multiple of 16 bit words. For the purposes of calculating the checksum only, it then prepends a 12 octet pseudo-header to the UDP datagram. The whole datagram (both UDP header and UDP data) and its pseudo-header are then split into 16-bit words (ignoring the 16 bits of the checksum field), then these are all added using ones-complement arithmetic and the result stored in the checksum field. Note that the pseudo-header and padding octet (if used) are not transmitted to the network; after the checksum calculation they are discarded. The pseudo-header has the form:

1

20 octets less than 65535 to allow room for an IP header of minimum length without violating the maximum length for IP datagrams. © A Barnes, 2004 4 CS3250/L5

SOURCE IP ADDRESS DESTINATION IP ADDRESS PROTO

ZEROS 0

UDP LENGTH 16

8

31

The 8-bit protocol field contains a code denoting the transport layer protocol used (17 for UDP). Other protocols such as TCP use the same checksum scheme, but of course use a different protocol number here. The UDP length field is the length of the UDP datagram in octets (excluding the pseudo-header). To check that the datagram has not be corrupted in transmission, the checksum is verified at the final destination as follows: the pseudo-header is reconstructed by inspecting the IP header to extract the source and destination IP addresses, the PROTOCOL field and the UDP length (calculated from the IP TOTAL LENGTH and HLEN header fields2). This is prepended to the datagram and the result padded out with a zero octet if necessary so that its length is an exact multiple of 16-bit words (ignoring the 16-bit checksum field itself) and the checksum is calculated as above. The result is compared with the checksum received and if they match the datagram is accepted, otherwise it is discarded. Note that UDP and IP violate strict protocol layering scheme as the transport layer UDP software needs to inspect the network layer IP header to verify the checksum at the destination. Also at the source machine UDP often adds a skeleton IP header to the UDP datagram with the source and destination IP addresses and the protocol field already filled in and then passes it to the IP software which then fills in the remaining fields of the IP header. Furthermore the strict layering may be violated at the source since the source IP address for a multi-homed host depends on the routing (i.e. IP) software. Overleaf the diagram shows how a UDP datagram is encapsulated inside a single IP datagram which in turn is encapsulated in a frame each time it is transmitted across a single physical network. Of course if the maximum frame size is smaller than the IP datagram the latter will be fragmented and then each separate fragment will be incorporated inside a network frame. UDP header

IP header

Frame header

2

UDP Data Area

IP Data Area

Frame Data Area

UDP LENGTH = TOTAL LENGTH – 4*HLEN 5

© A Barnes, 2004

CS3250/L5

Transport Control Protocol (TCP) TCP is a transport layer protocol that provides a reliable connection-oriented delivery service between a communication endpoint (IP address and port) on one machine and a communication endpoint on another. TCP is built on top of the unreliable connection-less service provided by IP in the network layer. It has a number of characteristic features: stream orientation: we think of the data as a stream of bits divided into 8-bit octets. The stream delivery system at the receiver end passes the application the same sequence of octets as the sender application passed to TCP virtual circuit connection: before transfer can start both sending and receiving applications interact with their local operating systems to establish a connection. One application places a call (active open) and the other application must accept the call (passive open). When communication is over, the connection is closed in an orderly manner. A connection can be thought of as two communication endpoints (port number and IP address pairs) and because a connection consists of two endpoints many connections can share a port number on one machine. reliability: this is provided on top of the underlying unreliable service provided by IP by using a system of positive acknowledgement of the receipt of datagrams and retransmission of datagrams for which acknowledgement has not been received after a certain time-out interval. buffered transfer: applications send data to the stream in arbitrary sized chunks; these are usually buffered until to fill a reasonably sized segment before being sent across the virtual connection where they are buffered at the receiver's end and then retrieved by the receiving application in arbitrary sized chunks. There is however a push mechanism to force an immediate transfer of data in a smaller segment across the network without waiting for the buffer to fill. unstructured stream: TCP breaks the stream into segments in a way that does not honour any structure of the data stream from the application level. full duplex connection: connections provided by TCP/IP allow concurrent transfer of data in both directions with no apparent interaction. TCP/IP may use ‘piggy-backing’ of acknowledgements on data segments it is sending over the connection. It is possible for an application to terminate flow in one direction in which case the flow becomes half-duplex. TCP Segment Format TCP splits the sequence of octets (bytes) received from the application layer into segments. These segments do not bear any relation to the message structure at the application level. As with UDP datagrams, a TCP segment consists of a segment header followed by a data section. TCP segments sent and received during the lifetime of a connection can (and do) vary in length from 20 octets (a minimal header and no data section) up to a maximum segment size. The maximum size of a segment is decided by the TCP software when the connection is established. The size of TCP segments is restricted by the maximum size of an IP packet (216 – 1= 65535 octets) less the size of the IP header (≥ 20 octets) and so can never exceed 65515 octets. As the TCP segment contains a TCP header of at least 20 octets, the maximum amount of data is restricted to 65495 octets. However in practice, the network interface layer will define a maximum transfer unit (MTU) for each network and this defines the maximum IP packet size that can be transmitted onto this network. Normally the MTU is a few thousand octets and so is considerably smaller than the theoretical limit of 65535 octets for IP packets. © A Barnes, 2004

6

CS3250/L5

Thus in practice the maximum TCP segment length is equal to the MTU–20 octets. This value includes the TCP header (which is also at least 20 octets in length) and so the maximum length of the data section of the segment is equal to the MTU–40 octets. 4

0

10

24

16

31

DESTINATION PORT

SOURCE PORT

DATA SEQUENCE NUMBER ACKNOWLEDEMENT SEQUENCE NUMBER HLEN

RESERVED CODE BITS

WINDOW SIZE URGENT POINTER

CHECKSUM OPTIONS (if any)

PADDING DATA ......

In the following discussion we will assume a connection has been established between two addresses A and B (IP and port number and will consider a segment being sent from A to B. The first two fields are 16-bit and hold the source and destination port numbers (i.e. the port numbers of A and B respectively). Here and below for a segment being sent from B to A the roles of A and B are reversed. As with UDP, the source and destination IP addresses are not part of the TCP header, but are retrieved from the IP header. The source and destination port numbers are followed by two 32-bit fields: the sequence number uniquely specifies the lowest octet number contained in the current segment. It allows segments to be reassembled in the correct order by the B even if IP delivers them out of order; the acknowledgement number identifies the lowest octet number in the data stream flowing from B to A that A has NOT yet received. Thus the sequence number refers to the data stream flowing in the same direction as the segment whilst the acknowledgement number refers to the data stream flowing in the opposite direction. The 4-bit header length (HLEN) field specifies the length of the header in 32-bit words and its minimum value is 5 (that is 20 octets), but it may be higher if there are TCP options in the header (cf. the HLEN field in the IP header). The TCP header does NOT contain the total length of the segment, however the receiver can determine the length of segment by inspecting the total length field in the IP header and subtracting the length of the IP header. Here again we have an example of the violation of the strict protocol layering scheme as the transport layer TCP software at the destination needs to inspect the network layer IP header to determine the length of a TCP segment. The next two fields are both 6-bits: the first is reserved for future use and is not used in current versions of TCP whilst the second CODE BITS field is used to determine the purpose(s) of the segment.

© A Barnes, 2004

7

CS3250/L5

The six bits have the following meaning if set: bit 10 bit 11 bit 12

URG ACK PSH

bit 13 bit 14 bit 15

RST SYN FIN

the Urgent Pointer field is valid the Acknowledgement number field is valid requests immediate push to the receiving application reset the connection synchronise sequence numbers at start of connection sender has reached the end of its stream

If the ACK flag is set, it indicates the current segment contains an acknowledgment (it may also contain data, if acknowledgments are being piggy backed) . If this bit is clear, there is no acknowledgment in the current segment and the value of the acknowledgment field is ignored. If the RST flag is set, it indicates that the connection is to be terminated without the normal orderly shut-down process. It is also used during the set up of a connection if either party wishes to refuse or back out of a connection request. When sending data to TCP an application may issue a push request to indicate to the TCP software that a segment containing this data should be transmitted (pushed) to the network as soon as possible without waiting for the transmit buffer to fill. In these circumstances the PSH flag is set in the segment header to indicate that, at the receiving end, the data in the segment should be sent to the receiving application without waiting for the receive buffer to fill. Typically this is used when short messages entered interactively at the keyboard need to be exchanged without undue delay by two communicating applications. The Urgent Pointer field allows a segment to contain urgent (or out of band) data. The receiving application is notified of the urgent data immediately on receipt by TCP software regardless of its position in the incoming data stream. Typically it is used to for control signals (such as aborting a program at the receiving end). When the URG field in the code bits field is set, the urgent pointer field in the header is valid and specifies the position in the segment where the urgent data ends (it always starts at the start of the data area of the segment). After processing all the urgent data, the remaining part of the segment is processed in the normal way. The SYN flag is used in the procedure that sets up a connection and will be discussed in the next lecture. The FIN flag is used in the procedure that closes down a connection in an orderly manner and will also be discussed in the next lecture. The 16-bit window size field specifies the current sliding window size. For a segment sent from A to B, it specifies the maximum number of data octets that B is allowed to send without receiving an acknowledgement from A. As acknowledgements are received the window 'slides forward' over the data stream and more octets can then be transmitted. The window size is used for flow control across the connection so that (for example) a slow receiver is not swamped by a fast sender. The checksum field is used for a checksum computed over the TCP header and data area in the exactly same way as the checksum for UDP (it uses a pseudo-header as for UDP but with a different code namely 6 in the PROTO field). Thus TCP/IP violates the strict layering of protocols in the same way as UDP/IP. The option fields if present specify various options; the most important of which is the maximum segment size option. Not all TCP segments need be the same size. This option is use to negotiate a maximum segment size to use for this connection. During the set-up of the © A Barnes, 2004

8

CS3250/L5

connection each partner uses this option field to specify the size in octets of the largest segment it is prepared to accept. The smaller of these two values is then adopted as the maximum segment size to use for this connection. If a host does not specify a maximum segment size the value of 556 octets is used. This allows machines with very different buffer sizes to communicate over an intervening network connection without causing receive buffer overflow at either end. The padding field is a zero octet used to pad out the header so that its length is an exact multiple of 16-bit words (cf. the padding field in IP headers). The header is followed by the data area itself. Sliding Windows TCP uses a positive acknowledgement system with retransmission of unacknowledged segments. Segments are resent if not acknowledged within a certain time-out interval. However rather than waiting for each segment to be acknowledged before sending the next one (which would mean that the network connection was idle for considerable periods), TCP uses a sliding window scheme to improve throughput rates. Only those octets lying inside the window can be sent (in a suitable segment) without waiting for acknowledgement. As octets are acknowledged, the window slides forward over the octet stream. Those to the right of the window cannot be sent until the window slides over them and those to the left of the window have already be sent and acknowledged. Note that the acknowledgement number (and the data sequence number) refer to octets and NOT to segments. Normally when a segment arrives at the receiver all the octets it contains will be acknowledged (by using a number one larger than the last octet in the segment). However, occasionally a segment that arrives may not all fit into the receive buffer and so the excess octets are discarded and only those stored in the buffer are acknowledged. This will cause the sender to eventually resend the discarded octets. In the diagram below we have a window of size 63 and data octets 1-3 have already been sent and acknowledged (i.e. an acknowledgement with sequence number 4 has been received). Octets 4-7 in the window have been sent (but not acknowledged) and octets 8-9 could be sent immediately. However octets 10, 11 etc. cannot be sent until further acknowledgements arrive. octets sent but not acknowledged segment seqence nos

1

octets sent and acknowledged

2

octets that can be sent immediately

6 octet window 3

4

start of window

5

6

7

sent pointer

8

9

10

end of window

11

12

...

octets that can't be sent yet

If no acknowledgment with octet number 5 (or above) arrives within a certain time-out interval, a segment starting at octet 4 will be resent.

3

In practice the window size is usually much larger than this (up to 65535 octets) and typically can contain enough octets to fill several segments. © A Barnes, 2004 9 CS3250/L5

Suppose an acknowledgement arrives with octet number 6 indicating that octets 4 and 5 have arrived at the destination, then the window slides forward two places as shown in the diagram below. Now octets 8-11 can be sent immediately. octets sent but not acknowledged octet seqence nos

1

octets sent and acknowledged

2

3

octets that can be sent immediately

6 octet window 4

5

6

start of window

7

8

9

sent pointer

10

11

end of window

12

...

octets that can't be sent yet

Flow Control Each acknowledgment sent contains a window advertisement, which indicates how many additional octets the receiver is currently prepared to accept. Thus, if the receive buffer is becoming full because the receiving application is consuming data at a slower rate than it is arriving via the connection, the receiver will send an acknowledgement with a small window size. When the receive buffer is becoming empty an acknowledgement with a larger window advertisement can be sent to indicate that the receiver can receive data at a greater rate. An acknowledgment with a window advertisement of size 0 is a request for the other end of the connection to temporarily stop sending data (except for urgent OOB data). This allows a receiver to indicate its receive buffer is temporarily full. Then when data is transferred from the buffer to the receiving application, an acknowledgement (with the same acknowledgement number) can be sent with a non-zero window advertisement so that the sender can resume transmission. On receiving a window advertisement with a value smaller than the number of unsent octets in the sender’s sliding window, the sender should decrease the size of its sliding window. This reduces the chance of the receive buffer overflowing. However, if the value of the window advertisement is larger, the sender may increase the size of its sliding window. This allows a degree of flow control since the larger the size of the sliding window, the more octets that can be sent without waiting for acknowledgement and so (other things being equal) the faster the transmission rate. Window advertisements can also be sent with segments containing no acknowledgement requesting (in effect) that the other end of the connection changes its window size and hence its average transmission rate. The initial sizes of the sliding windows to be used at each end of the connection are advertised by each partner as part of the procedure which sets up the connection. This set-up procedure will be discussed in the next lecture.

© A Barnes, 2004

10

CS3250/L5