Proxy Servers. Chapter What is a proxy server. 1.2 Proxy server Definition

Chapter 1 Proxy Servers 1.1 1.1.1 What is a proxy server Definition Before describing a proxy server and it’s functionality, let’s first take a loo...
Author: Caren Cooper
10 downloads 1 Views 320KB Size
Chapter 1

Proxy Servers 1.1 1.1.1

What is a proxy server Definition

Before describing a proxy server and it’s functionality, let’s first take a look at the definition of the word proxy. The word proxy is defined by most dictionaries1 as following: • Pronunciation: ’pr¨ ak-sE • Inflected Form(s): plural proxies • Meaning: 1. the agency, function, or office of a deputy who acts as a substitute for another 2. (a) authority or power to act for another (b) a document giving such authority; specifically : a power of attorney authorizing a specified person to vote corporate stock 3. a person authorized to act for another : PROCURATOR

1.2

Proxy server

Now how does the above definition fits in into the world of electronics and internet? Put simply a Proxy Server is a courier of information. It acts as a middle man between one computer and another. All the network traffic flowing between those 2 computers passes through the proxy. This may be to remain anonymous, or to just bridge a connection between two networks. The reasons do however vary quite alot. 1 verwijzing

naar website

1

CHAPTER 1. PROXY SERVERS

2

There are many diferent types of proxy servers out there. Depending on the purpose you can get proxy servers to route many common protocols for example: HTTP A one way request to retrieve Web Pages. Socks Proxy A newer protocol to allow relaying of far more diferent types of data, whether TCP or UDP. SSL An extension was created to the HTTP Proxy Server which allows relaying of TCP data similar to a Socks Proxy Server. This one done mainly to allow encryption of Web Page requests. FTP A protocol for transferring files between 2 computers. ICQ A popular instant messaging system. In this thesis the protocol is completely irrelevant as it is the proxy functionality in general that is discussed.

1.2.1

When proxy servers are useful

Proxy servers are widely used in today’s internet and, as stated above, there are different reasons why they are used. Popular usage of proxies include: 1. Permitting and restricting client access to the Internet based on the client IP address 2. Caching documents 3. Selectively controlling access to the Internet and subnets based on the submitted URL 4. Providing Internet access for companies using private networks 5. Converting data to HTML format so it is readable by a browser In this thesis, caching (2) and something that can be seen as an extension to data converting (5) are the main interests. Instead of converting a textual page to html, the interest lies in converting data into another form that the receiver can decode. Caching will be studied in this chapter, while the converting of data, which will be referred to as transcoding will be studied in the next. A typical setup using a proxy server is shown in figure 1.1. The figure depicts an organization with an internal network. The proxy server is the only computer that can communicate with the internet directly. All the organization’s computers (labelled clients in the figure) use the organization’s proxy server. The proxy server receives the request from the browser in the form of a URL. The proxy server then retrieves the requested information and forwards it to the clients.

CHAPTER 1. PROXY SERVERS

3

Figure 1.1: a typical organization network setup

1.2.2

Caching documents

Web caching is the temporary storage of web objects (such as HTML documents) for later retrieval. There are three significant advantages to web caching: • reduced bandwidth consumption (fewer requests and responses that need to go over the network) • reduced server load (fewer requests for a server to handle) • reduced latency (since responses for cached requests are available immediately, and closer to the client being served) Both bandwidth and server load are expensive. The lower the bandwidth, the lower the costs are for all parties involved for transporting the content from the origin-server to the requesting client. Server load is also a very important factor, the lower the server load the lower the cost for origin-server. The latency factor is an important issue for the requesting client, the lower the latency the higher the user-satisfaction. Caching can be performed by the client application, and is built in to most web browsers. There are a number of products that extend or replace the builtin caches with systems that contain larger storage, more features, or better performance. In any case, these systems cache net objects from many servers but all for a single user. Caching can also be utilized in the middle, between the client and the server as part of a proxy. Proxy caches are often located near network gateways to reduce the bandwidth required over expensive dedicated internet connections. These systems serve many users (clients) with cached objects from many servers. Caching is more effective on the proxy server than on each client system. This saves disk space because only a single copy is cached. Caching on the proxy server means more documents that are often referenced by multiple clients can

CHAPTER 1. PROXY SERVERS

4

be cached more efficiently. The system administrator can predict which documents are worth caching for a long time and which are not. Caching also makes it possible to browse the Web even if a Web server, or even the external network, is down, as long as one can connect to the proxy server. This improves service to remote network resources, such as files hosted at busy FTP sites that are often unavailable remotely, but may be cached locally. Finally, caches can be placed directly in front of a particular server, to reduce the number of requests that the server must handle. Most proxy caches can be used in this fashion, but this form has a different name (reverse cache, inverse cache, or sometimes httpd accelerator) to reflect the fact that it caches objects for many clients but from (usually) only one server.

1.2.3

Communication using a proxy server

The proxy server acts as both a server system and a client system. It is a server when accepting requests from clients, and acts as a client system when its software connects to remote servers to retrieve documents. Typical procedure using a proxy: 1. The client sends a request to the proxy, providing necessary information to address the document 2. The proxy receives the request, translates it if necessary, and either • The server has a local copy and sends this to the client • The proxy server has no local copy and acts as a client itself requesting the document. It has the possibility to cache it and sends a copy to the client.

1.3

The cache replacement problem

This text is not finished yet, words or sentences in red are in my opinion badly chosen descriptions :). The following section describes the cache replacement algorithm as proposed in 2 . The problem statement makes use of a number of parameters for each cached object. It however, does not take into account network parameters such as latency, network throughput, . . . Because of the way the internet is built (ip protocol), network parameters are highly dynamic. The difficulty of using network parameters as costs is that they can vary instantaneously, and the reasons for this change are diverse (eg. traffic congestion, server load, a different package route). There are caching algorithms that use network parameters as a cost, see section ??. In this discussion however, only parameters are used over which we have certainty. 2 Proxy

Cache Replacement Algorithms: A History-Based Approach

CHAPTER 1. PROXY SERVERS

5

Every proxy server has a predefined, limited amount of space it can use to cache documents. Once the content in the cache area reaches this predefined limit, a cache replacement policy should be employed to update the cache content with more recent requested web objects. Each cached Web object is characterized by its so-called staleness which is related with the need to contact the original server to validate the existence of the cache copy. This staleness is related to the fact that the cache server has no awareness about the original objects changes. Each proxy cache server implementation must be reinforced with specific staleness confrontation. The cache replacement problem is defined by introducing a set of parameters that will monitor the Web cache content replacement process. Web cache content can be modeled by an informative hash table of a number of rows, where each row is associated with a particular cached object. Therefore, the number of rows is bounded by the number of cached objects. Each object is identified by its corresponding stored object filename, along with a number of related attributes. The attributes are chosen such that cache replacement could be supported and employed. The most important factors for the cache replacement refer to the objects staleness status, its frequency of access and its retrieval rate. See the full list of attributes in Table 1.1. Attribute C N si bi ti ci li afi keyi popi dfi

Description total available cache area size number of objects in cache server on which object resides size of object in kilobytes time the object was logged time the object was cached time object was changed last number of cache accesses since object was accessed last object original copy identification (eg. URL) popularity of object dynamic frequency of object

Table 1.1: The most important attributes of each cached object i

Definition 1.

The popularity of a Web object i is defined by popi =

hitsi hitstot

PN Where, hitsi is the number of hits for a cached object and hitstot = j=1 hitsj . Therefore popi refers to the popularity of an object as a percentage, and 0 ≤ popi ≤ 1.

CHAPTER 1. PROXY SERVERS

Definition 2.

6

The cached object’s staleness ratio is defined by StRatioi =

ci − li now − ci

Here the numerator corresponds to the time interval between the time the object was cached and the time when the object was last modified. The denominator refers to the time the object is in cache. It is true that StRatio ≥ 0 since: • ci − li ≥ 0, This is true because ....? • now − ci ≥ 0 (now is the current time) Since ci − li is a fixed value and now − ci increases over time, the lower the value of StRatio, the more stale an object i is. It is the main indicator that this object has been in cache for a longer period. Definition 3.

The dynamic frequency of a cached object i is defined by popi dfi = afi

Where afi is the metric to identify the number of accesses to other objects since object i was last referenced (as defined in Table 1.1). We assume afi 6= 0 since we consider objects that have been already cached so they reside in cache after at least one reference to another object. Therefore, the higher the value of dfi for a particular object, the most popular and recently that object was referenced. A cache server has to support mechanisms which will determine whether an object could be cached or not. In case there is not enough space, one or more objects will have have to be removed in order to free sufficient space for new objects. The cache replacement process must guarantee enough space for the incoming objects. Therefore there are two possible actions related to each cached object. Either the object will be remain stored in cache, or it will be purged from cache. A function is needed to identify the action that should be taken for each cached object. Definition 4.

The cache object’s action function is defined by  0 if object i will be purged from cache acti = 1 otherwise

We are now ready to formulate the cache replacement problem in terms of mathematical programming modeling, under our definition of dynamic frequency. In 3 a similar approach has been introduced, where cache replacement was defined as an optimization in the set of NP-hard algorithms. Here, both the staleness ratio StRatio and the dynamic frequency dfi parameters are considered in order to define the cache replacement problem. 3 C. Aggarwal, J.Wolf, and P. S.Yu, “Caching on theWorldWideWeb,” IEEE Trans. Knowledge Data Engrg.11(1), 1999, 94107.

CHAPTER 1. PROXY SERVERS

7

Problem Statement. Suppose that N is the number of objects in the cache, and C is the total capacity of the cache area. The cache replacement problem is to: PN Maximize BLABLABLA Pi=1 N subject to i=1 acti · bi ≤ C In the optimization formula, StRatioi and dfi are used as weight factors characterizing each cached object, they involve both the object’s popularity and updating status. The basic goal of the proposed cache replacement algorithm is to maintain in cache the most frequently-used, non-stale objects.

1.4

Caching algorithms

from wiki: In computer science, a cache is a collection of data duplicating original values stored elsewhere or computed earlier, where the original data are expensive (usually in terms of access time) to fetch or compute relative to reading the cache. Once the data are stored in the cache, future use can be made by accessing the cached copy rather than refetching or recomputing the original data, so that the average access time is lower. Caches have proven extremely effective in many areas of computing, because access patterns in typical computer applications have locality of reference. There are several sorts of locality, but we mainly mean that the same data are often used several times, with accesses that are close together in time, or that data near to each other are accessed close together in time. I’m going to give a detailed overview of a number of popular caching algorithms. Once the space preserved for caching is full, there has to be made a choice which content will be discarded in order to cache new content. T

CHAPTER 1. PROXY SERVERS

1.4.1

FIFO

1.4.2

Size

1.4.3

LRU

1.4.4

LFU

1.4.5

Historically-based caching algorithms

HLRU HLFU

1.4.6

1.5 1.5.1

Greedy Dual

Evaluation of caching algorithms Trace-driven evaluation

8