Copyright Notice

Copyright Notice This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author’s copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder. © 2009 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

© 2009 Cisco Systems, Inc. All rights reserved. This document is Cisco Public Information.

Page 1 of 1

Packets & Protocols

E d i t o r : C h r i s M e t z • c h m e t z@ c i s c o . c o m

Not All Packets Are Equal, Part I Streaming Video Coding and SLA Requirements Jason Greengrass, John Evans, and Ali C. Begen • Cisco

In this first part of a two-part article, the authors consider the network factors that impact the viewers’ quality of experience (QoE) for IP-based videostreaming services such as IPTV. They describe the IP service-level requirements for a transported video service and explain MPEG encoding to help readers better understand the impact that packet loss has on viewers’ QoE.

T

he Internet Protocol (IP) is becoming the dominant network technology for video transport. In turn, video is becoming an increasingly significant component of IP network traffic. With increased deployment in services such as IPTV and video on demand (VoD), services that users previously received via traditional broadcast and cable formats are now delivered by IP or Mutiprotocol Label Switching (MPLS) networks that also deliver conventional Internet services. Some industry forecasts predict that by 2012, Internet traffic will be 75 times larger than in 2002, and all forms of video traffic combined will account for close to 90 percent of all consumer Internet traffic (http:// newsroom.cisco.com/visualnetworkingindex/). IP-based video service providers have a range of technology options for supporting the necessary service-level requirements — commonly specified in service-level agreements (SLAs) — for delivering quality video service to viewers. They must first understand these requirements in detail in order to be able to determine the relative benefits of different network technology approaches and to choose between them. For consumer video services, real-time streaming applications such as IPTV have the most stringent SLA requirements. With video-streaming applications, a receiver client requests a video stored on a server or produced in real time; the source server streams the video to the receiver, which starts to play out the video before all video stream data have 70

Published by the IEEE Computer Society

been received. Video streaming is used both to broadcast video channels, which are commonly delivered as IP multicast, and for VoD, which is delivered as IP unicast. (With unicast, traffic goes to a single receiver — that is, is point-topoint — whereas with multicast, traffic goes to a group of interested receivers, or is point-tomultipoint.) IP-based streaming video is most commonly transported as a data stream encoded using Motion Picture Expert Group (MPEG) standards and transported via the Real-Time Transfer Protocol (RTP) over the User Datagram Protocol (UDP) over IP. MPEG standards define the encoding used for the actual video stream, whereas RTP payload formats for MPEG1,2 define how real-time audio and video data are formatted for RTP transport. In this first installment of a two-part series, we’ll look at SLA requirements for streaming video services in general and explain the principles of MPEG encoding to help you better understand how IP packet loss affects viewers’ quality of experience (QoE) specifically.

Video SLA Requirements

We can define the key SLA requirements for an IP-based video transport service in terms of delay, jitter, and loss. Let’s examine network SLA requirements for a real-time streaming IPTV service.

Network Delay One-way network delay characterizes the time difference between the receipt of an IP packet at

1089-7801/09/$25.00 © 2009 IEEE

IEEE INTERNET COMPUTING

Not All Packets Are Equal

a defined network ingress point and its transmission at a defined network egress point. The IETF defines a metric for measuring one-way delay.3 The delays a network induces comprise four components: propagation delay along the network path, switching and queuing delays at network elements on the path, and serialization delay — that is, the time it takes to transmit the bits of the packet sequentially onto a link. In addition, application performance might be subject to network control protocol processing delays (such as multicast processing) and delays due to proc­ essing in application end-systems. Network delay can affect enduser interactivity by impacting the “finger-to-eye” delay, which impacts the time it takes for the user to change from one TV channel to another (known as the “channel-change time”). Typically, service providers aim to achieve channel-change times that are less than two seconds, although they can be significantly larger. The round trip network transmission delay from a set-top box (STB) to a video-streaming server is just one component of the total finger-to-eye delay, which can also include remote control and STB processing (typically ~50 milliseconds), IP multicast processing (~50 ms), receiver de-jitter buffer (~50 to 100 ms), decryption delay (~50 ms), random access-point acquisition delay (~0 to 3 seconds), and MPEG decoder buffer (~500 ms to 2 seconds). Service provider networks are generally built hierarchically, with a core network providing inter­ connectivity to regional aggregation networks, which aggregate the local access connections to subscribers. The impact of network delays on channelchange time is generally constrained to the delay between the receiver and the first IP multicast-enabled router or aggregation device — that is, it’s confined to the access and aggregation network delays rather than being JANUARY/FEBRUARY 2009

impacted by the core network. Nonetheless, we must bound end-to-end network delay to bound the resulting network jitter, which affects the receiver de-jitter buffer sizing and hence impacts channel-change time. So, for video-streaming applications, service providers typically target one-way network delays of less than 100 ms (in some cases much less) to achieve overall channel-­change times of 1 to 2 seconds. The Differentiated Services (Diffserv) IP quality of service (QoS) architecture is used to control queuing delays and ensure that service providers can meet their network delay SLAs.

Network Jitter Network jitter is the variation in network delay caused by factors such as fluctuations in queuing and scheduling delays at network elements. We can generally consider jitter to be a variation of the one-way delay for two consecutive packets.4 Receivers use de-jitter buffers to remove the delay variation the network causes. If a video de-jitter buffer is appropriately sized to accommodate the maximum value of network jitter possible, jitter won’t delay play-out beyond the worst-case end-to-end network delay. Conversely, if the de-jitter buffer is too small to accommodate the maximum network jitter, then buffer underflows can occur — that is, the buffer will be empty when the decoder needs to process a frame, resulting in a lost packet and potential video impairment. A de-jitter buffer sized toolarge adds unnecessarily to the end-to-end delay, which might increase channel-change time or decrease VoD responsiveness; thus, in general, receiver de-jitter buffers sized greater than 100 ms over the maximum network jitter are excessive. Diffserv IP QoS mechanisms are used to control network delays, and hence bound maximum network jitter.

Packet Loss Packet loss characterizes the packet drops that occur between a defined network ingress point and a defined network egress point. We consider a packet lost if it doesn’t arrive at the specified egress point within a defined time period. The IETF defines a metric for measuring the one-way packet loss rate (PLR).5 Assuming the receiver de-jitter buffer is appropriately sized, network packet loss has three primary causes: • Congestion. When congestion occurs, queues build up and the network drops packets. To control loss due to congestion, we can apply Diffserv IP QoS mechanisms and employ capacity planning processes. • Lower-layer errors. Bit errors, which might occur due to noise or attenuation in the transmission medium, can result in dropped packets. In practice, actual biterror rates vary depending on the underlying layer-1 or layer-2 technologies used, which are different for different parts of the network. For instance, fiber-based optical links might support bit-error rates as low as 1 errored bit in 1013 transmitted bits (1e-13), whereas asynchronous digital subscriber line (ADSL) services might have bit-error rates as high as 1e-3. Some link-layer technologies employ reliability mechanisms, such as forward error correction (FEC), to recover from commonly occurring bit-error cases and thus reduce the effective PLR. • Network element failures. Most networks are built resiliently; nonetheless network element failures, such as link or router failures, can result in losses of network connectivity, which cause packets to be dropped until the network connectivity is restored around the failed network element. The resulting packet loss period de71

Packets & Protocols High definition 1080i

Standard definition

1920 Slice 1 Slice 2

720 Slice 1 Slice 2

480

1080

Slice 30 Slice 68 16 1 2 3 4 16 Macroblock

Slice

8 8 8 0 1 8 2 3 Block y

8 4 8 Cr

8 5 8 Cb

Figure 1. Blocks, macroblocks, and slices. A block is an 8 × 8 matrix of pixels representing a small chunk of brightness or color. A macroblock contains several blocks, and a series of consecutive macroblocks forms a slice.

(a)

(b)

Figure 2. MPEG pictures. (a) An I-frame carries a complete video picture. (b) A group of pictures (GoP) includes an I-frame and all its pictures leading up to the next I-frame. pends on the capabilities of the underlying network technologies and implementations that have been employed in the network. Most service providers use the Diffserv architecture to help engineer their IP networks to support the required delay, jitter, and loss rates.6 We can’t use Diffserv, however, to control packet loss caused by network element failures or lower-layer errors. Therefore, even where Diffserv is deployed, packet loss can occur, which might result in visual impairments to an impacted video service. To determine the impact that 72

such packet losses have on a transported MPEG video stream, we must first understand the principles of MPEG encoding.

MPEG Video Encoding

A digital snapshot or frame of a black and white 640 × 480 pixel standard-definition (SD) television image that uses 8 bits to represent each pixel’s grayscale consumes 2.45 Mbits of memory. If we updated this image with a frame rate of 30 frames per second (fps), the resulting video bandwidth requirement would be roughly 70 Mbps. This requirement increases significantly when we add more bits to encode the image and add color and audio channels. Typical uncompressed video bit rates are 270 Mbps for SD and 1.485 Gbps for high-definition (HD) sources. Access network bandwidth constraints make streaming such high-bandwidth streams into homes impractical, so we use compression to reduce the video stream’s bit rate. MPEG has produced several standards for video compression that prowww.computer.org/internet/

viders can use for IP-based services, including MPEG-2,7 which is the most widely used encoding scheme for television applications today. Newer encoding schemes such as MPEG-48 and VC-19 (from the Society of Motion Picture Television Engineers [SMPTE]) are becoming more widespread; they offer potential bit-rate reductions of two times that of MPEG2 with comparable quality. SD television that uses MPEG-2 has a video bit rate reduced to approximately 3.75 Mbps, whereas HD television has a bit rate of approximately 18 Mbps. Comparatively, typical MPEG-4 Part 10 encoding uses roughly 3 Mbps for SD and 9 Mbps for HD. An MPEG encoder converts and compresses a video signal into a series of pictures or frames. Generally, only limited change occurs between one frame and the next, so an encoder can compress the video signal significantly by transmitting only the differences. MPEG uses three fundamental techniques to achieve compression: • Subsampling reduces color information that is less sensitive to the eye. • Spatial compression or intra coding removes redundant information within frames using the property that pixels within a single frame are related to their neighbors. • Temporal compression or interframe coding removes redundant information between frames. The following sections describe the MPEG encoding structure and components.

Blocks, Macroblocks, and Slices Each MPEG frame can contain block, macroblock, and slice information: • A block is an 8 × 8 matrix of pixels or corresponding discrete cosine transform information that represents a small chunk of brightness IEEE INTERNET COMPUTING

Not All Packets Are Equal I1 B2 B3 P4 B5 B6 P7 B8 B9 P10 B11 B12 P13 B14 B15 INew GOP (luma) or color (chroma) within the frame. • A macroblock contains several blocks that define a section of the frame’s brightness component and spatially corresponding color components. • A slice is a series of consecutive macroblocks; every slice contains at least one macroblock, although the number of macroblocks within a slice can vary, and their position might change from picture to picture. Figure 1 illustrates the relationship between blocks, macroblocks, and slices.

Frames MPEG-2 has three frame types. Intra, or I-frames, carry a complete video picture, like the one in Figure 2a. They’re coded without reference to other frames and might use spatial compression but don’t use temporal compression. Spatial compression uses the property that pixels within a single frame are related to their neighbors; by removing spatial redundancy, the size of the encoded frame can be reduced and prediction can be used at the decoder to reconstruct the frame. A received I-frame provides the reference point for decoding a received MPEG stream. Predictive-coded, or P-frames, predict the frame to be coded from a preceding I-frame or P-frame using temporal compression. P-frames can provide increased compression compared to I-frames, with a P-frame typically 20 to 70 percent the size of an associated I-frame. Finally, bi-directionally predictive­coded, or B-frames, use the previous and next I-frame or P-frame as their reference points for motion compensation. B-frames provide further compression, typically 5 to 40 percent the size of an associated I-frame.

Group of Pictures In MPEG encoding, frames are arJANUARY/FEBRUARY 2009

Figure 3. A 15:2 group of pictures structure. The structure’s total size is 15 frames, with two B-frames between each P-frame. ranged into groups of pictures (GoPs) like the one in Figure 2b. A GoP includes the I-frame and all subsequent frames leading up to the next I-frame. GoPs typically have 12 or 15 frames, which support National Television System Committee (NSTC) and Phase Alternating Line (PAL) standards at 30 fps (interlaced) and 25 fps, respectively. Many possible GoP structures exist, and the makeup of I-, P-, and Bframes within a GoP is determined by the source video signal’s format, any bandwidth constraints on the encoded video stream (which determine the required compression ratio), and possible constraints on the encoding or decoding delay. A typical 15-frame GoP structure has one I-frame, four P-frames, and 10 B-frames. We can describe a regular GoP structure by its total number of frames (that is, the GoP size) as well as by how many B-frames occur between its P-frames. Figure 3 shows a 15:2 GoP structure — that is, a GoP size of 15 frames and P-frame spacing of two. We can look at MPEG frame sizes from two video clips to see how the differing frames’ sizes will change with the video’s nature. We used two video clips from SMPTE for all the testing and analysis we describe: a low-motion clip (“Susie”) and a high-motion clip (“Football”). We encoded both clips with a constant video bit rate of 4 Mbps using MPEG-2 Main Profile/Main Level (http://en.wikipedia.org/wiki/ MPEG-2#Video_profiles_and_levels) at a 704 × 480 resolution, with a 15:2 GoP at 29.97 fps. Figure 4a illustrates the variation in frame sizes for low- and high-motion content, respectively. A high-motion clip will typically require larger P-frames and B-frames because the differences

between the frames are significant and the temporal redundancy will be low, such that the encoder reduces the I-frame size to fit the imposed encoding rate accordingly. In comparison, the low-motion clip has a much greater difference between its I-frame size and P- and B-frame sizes. Our analysis for HD clips displayed the same characteristics. Figures 4b and 4c show the total bytes per frame type for a single GoP for both low- and high-motion clips, respectively. As you can see, in the low-motion clip, the I-frame constitutes almost half the total number of bytes in the frame, whereas in the high-motion clip, the B-frames constitute 59 percent of the total, and the I-frame accounts for only 11 percent. For a given packet-loss duration, a greater probability exists that the loss will affect an I-frame in the lowmotion clip than in the high-motion clip due to the I-frame’s larger size in the low-motion clip. A lower probability exists that a given outage will affect an I-frame in the high-motion clip, but its high level of motion and greater difference between frames means impairments to P- and Bframes could still be noticeable. We also noted that in addition to inserting a new I-frame at the start of a new GoP, some encoders will generate a new I-frame at a scene change. Hence, although the encoder configuration might constrain the target GoP size, the actual GoP size varies depending on the encoder implementation.

Frame Encoding and Transmission Sequences The frame order within a GoP depends on the interrelationships between frames — that is, which frame is used as the datum or reference 73

Packets & Protocols 140

Bytes (thousands)

120

Susie Low Motion Football

119,286

100 80 60 34,592

40

20,116 23,500

20

13,888 5,814

0 I

P

(a)

B

Frame type I P B

33,840, 11%

60,536, 23% 180,540, 59%

118,628, 46%

94,000, 30%

80,464, 31%

(b)

(c)

Figure 4. Bytes per frame type. (a) We examined both a low-motion clip, “Susie,” and a high-motion clip, “Football,” to determine the average size in bytes for each frame type. We then looked at a single group of pictures (GoP) for (b) Susie and (c) Football to determine the total number of bytes for each frame type. (Source: Society of Motion Picture Television Engineers; used with permission.)

I

B

B

P

B

B

P

Figure 5. Frame reference relationships within a group of pictures (GoP) in MPEG-2. The video quality I-frame directly or indirectly will be the source of all the temporal encoding within a GoP. for the information another frame carries. The I-frame provides direct reference for the B-frames immediately proceeding it within its GoP. It also 74

provides reference for the first P-frame in the GoP. Ultimately, the I-frame directly or indirectly will be the source of all the temporal encoding within a GoP. This is a key consideration when www.computer.org/internet/

looking at the video quality impairments IP packet loss causes. Because B-frames use bidirectional prediction, those at the end of one GoP might use the I-frame in the subsequent GoP for reference. We call this an open GoP structure, and it provides additional coding efficiencies. The first P-frame in a GoP takes reference from the I-frame; subsequent P-frames take reference from the preceding P-frame. Finally, a B-frame takes reference from its immediate reference frame in both directions; this could be a Por an I-frame. In MPEG-2 encoded video, the B-frame won’t provide reference to any other frame. MPEG-4 Part 10, however, can use B-frames as reference frames in hierarchical GoPs. This is one technique MPEG4 Part 10 employs to achieve greater compression than MPEG-2. Figure 5 depicts which frames provide reference to which other frames within a GoP; the arrow’s tail indicates the reference source frame. Due to dependencies between frames, their display order isn’t the same as their transmission order. Figure 6a shows the encoder input order and decoder display order, which are the same. In 6a, frame I1 provides reference for frame P4. In turn, both frames I1 and P4 provide reference for frames B2 and B3. To decode the B-frames, the decoder must have already decoded the associated reference frames. So, the decoder will need to receive and decode frames I1 and P4 before it can decode frames B2 and B3. Similarly, frames B5 and B6 depend on P4 and P7. Figure 6b shows the corresponding transmission order (that is, the encoder output or decoder input order). Where the encoder uses an open GoP structure, frames B14 and B15 use the I-frame in the next GoP for reference.

MPEG Encapsulation within IP To transport MPEG-encoded video over IEEE INTERNET COMPUTING

Not All Packets Are Equal B14 B15 I1 B2 B3 P4 B5 B6 P7 B8 B9 P10 B11 B12 P13 B14 B15 INew GOP IP networks, MPEG frame information is encapsulated within MPEG Transport Stream (TS) packets, which are in turn transported in IP packets. A typical IP packet for transporting MPEG video contains seven 188byte MPEG-TS packets (see Figure 7). An MPEG frame can span multiple IP packets, and a single packet can contain information from two consecutive frames, so the loss of a single packet can result in a loss of information from two frames. Each packet might also contain service information as well as audio and video MPEG packets.

M

PEG encoding employs temporal compression and exploits the redundancy between the subsequent frames in a video to achieve greater compression. The resulting dependencies within the encoded bit stream, however, mean that even short durations of IP packet loss could cause significant visual impairment if a service provider doesn’t make use of loss-concealment techniques at the decoder. In the second part of this article, we’ll study the impact that different durations of IP packet loss have on video QoE. We’ll look at different types of visual impairment resulting from such packet loss and will compare impairments for different loss durations for both standard and high-definition services encoded by MPEG-2.

References 1. D. Hoffman et al., RTP Payload Format for MPEG1/MPEG2 Video, IETF RFC 2250, Jan. 1998; ftp://ftp.rfc-editor.org/ in-notes/rfc2250.txt. 2. J. van der Meer et al., RTP Payload Format for Transport of MPEG-4 Elementary Streams, IETF RFC 3640, Nov. 2003; ftp:// ftp.rfc-editor.org/in-notes/rfc3640.txt. 3. G. Almes, S. Kalidindi, and M. Zekauskas, A One-Way Delay Metric for IPPM, RFC 2679, Sept. 1999, ftp://ftp.rfc-editor. org/in-notes/rfc2679.txt. JANUARY/FEBRUARY 2009

(a)

I1 B14 B15 P4 B2 B3 P7 B5 B6 P10 B8 B9 P13 B11 B12 INew GOP B14 B15 (b)

Figure 6. Frame order. (a) The encoder input order and decoder display order are the same, but (b) the transmission order for the same group of pictures is different.

L2 header (18)

IPv4 header (20)

UDP header (8)

RTP MPEG-2 MPEG-2 MPEG-2 MPEG-2 MPEG-2 MPEG-2 MPEG-2 TS TS TS TS TS TS TS header (188) (188) (188) (188) (188) (188) (188) (12)

Figure 7. MPEG/RTP/UDP/IPv4 encapsulation. MPEG frame information is encapsulated within MPEG Transport Service (TS) packets. Seven 188-byte MPEG-TS packets are typically carried in each IP packet. 4. C. Demichelis and P. Chimento, IP Packet Delay Variation Metric for IP Performance Metrics (IPPM), RFC 3393, Nov. 2002; ftp://ftp.rfc-editor.org/in-notes/rfc 3393.txt. 5. G. Almes, S. Kalidindi, and M. Zekauskas, A One-Way Packet Loss Metric for IPPM, RFC 2680, Sept. 1999, ftp://ftp.rfc-editor. org/in-notes/rfc2680.txt. 6. C. Filsfils and J. Evans, “Deploying Diffserv in IP/MPLS Backbone Networks for Tight SLA Control,” IEEE Internet Computing, vol. 9, no. 1, 2005, pp. 58–65. 7. ISO/IEC 13818, Generic Coding of Moving Pictures and Associated Audio Information (MPEG-2), Int’l Standards Organization/Int’l Electrotechnical Commission, 2007. 8. ISO/IEC 14496-10, Coding of Audio-Visual Objects — Part 10: Advanced Video Coding, Int’l Standards Organization/Int’l Electrotechnical Commission, 2004. 9. SMPTE 421M , “Television — VC-1 Compressed Video Bitstream Format and Decoding Process,” Society of Motion Picture Television Engineers, 2006. Jason Greengrass is a technical lead engineer in Cisco’s Network Solutions Integration and Test Engineering organization. He specializes in the testing and deployment of triple-play network solutions, with a focus on IP-based video quality testing and analysis. He is a Cisco Certified In-

ternet Expert at Routing and Switching. Contact him at [email protected]. John Evans is a distinguished consulting engineer within Cisco’s Development Organization, where he works on the definition of network architectures for service providers. His technology focus spans IP routing and core IP/MPLS technologies, traffic engineering/management, quality of service, and video transport. Evans has an MSc in communications engineering from the University of Manchester Institute of Science and Technology. He authored Deploying IP and MPLS QoS for Multiservice Networks (Elsevier, 2007) and has filed numerous patents in his focus technology areas. He is a member of the IEEE. Contact him at [email protected]. Ali C. Begen is a software engineer in the Video and Content Platforms Research and Advanced Development Group at Cisco, where he participates in video transport and distribution projects. His interests include networked entertainment, multimedia transport protocols, and content distribution. Begen has a PhD in electrical and computer engineering from the Georgia Institute of Technology. He is a member of the IEEE and the ACM and served as the local chair for IEEE ICSC 2008. Contact him at [email protected]. 75