CSCI-651: Sequential Search; Binary Search

13 February 2002 CSCI-651: Sequential Search; Binary Search Search. Let’s imagine we have a list of data elements, each of which has an associated ke...
Author: Lynette Hicks
1 downloads 0 Views 14KB Size
13 February 2002

CSCI-651: Sequential Search; Binary Search Search. Let’s imagine we have a list of data elements, each of which has an associated key. As an example, the student record system is a huge database of records (containing names, addresses, grades, course lists, etc., etc.) about each of you, and the records are accessed through your social security numbers (student ID) - the key. So let’s imagine that keys are just Strings. We might have then: class ArrayItem { String key; Object data; }

ArrayItem[] keys = new ArrayItem[101]; which gives us an array with 101 elements (indexed 0 through 100) of ArrayItems. If we are searching for a particular key, you can imagine indexing through the array with a for loop that contains the statement: if(searchKey.compareTo(keys[i].key) == 0) return i; That is, we scan through the array looking for a match with searchKey. If we find it, we return the array index where it was found, so that we can subsequently access the data field at that array index to do whatever processing is necessary. If the key we’re looking for is not in the list, we only discover this by sequencing through the entire list, so if we have (in general) n items in the list, the computational effort (number of loop executions, say) in a failed search is proportional to n. What is the cost (computational effort) of a successful search? That depends on where we find the item we’re looking for. It might take just one step - or n steps, or anything in between. A more useful question is: what is the average cost of successful search?

Let’s consider this a little more generally. Suppose that it costs c1 to match the first key, c2 to match the second, ...., cn to match the nth. And suppose that do a long series of searches, and that N1 are devoted to finding the first element, N2 are in search of the second element, ......, Nn are directed to finding the nth element. The total number of searches is N1 + N2 + ..... + Nn = N. Some of the Ni might be 0, meaning that we never searched for the ith item in this particular sequence of N searches. Then we would certainly say that the average cost is the total cost of all the searches divided by the number of searches: avg cost = (c1N1 + c2N2 + ...... + cnNn)/N = f1c1 + f2c2 + .... + fncn where fi = Ni/N, the frequency (fraction of searches) with which the ith item was retrieved (searched for). Plainly, the f’s add up to one. In some applications (searching a static directory, one which almost never changes its contents), we might actually know the search frequencies, and we would plug those numbers into the formula to compute the average search cost. (In fact, if we did know the search frequencies, we would probably take some pains to locate the highest frequency item at the top of the list, the next-highest in the second position, etc., so as to minimize the average search cost.) Mostly however, we don’t know

2

the frequencies, and we typically make an assumption (of ignorance): we assume that all items are retrieved with uniform frequency, that is, all the f’s have the same value. What value is that? If n numbers are the same and add to one, then each must be 1/n. So we have the special formula for the average, under the uniform access assumption: avg cost = (1/n)(c1 + c2 + .... + cn) Now in sequential search, the cost (measured in the number of times the comparison statement above is executed) ci = i, so we get the simplification avg cost = (1/n)(1 + 2 + ...... + n) = n(n+1)/2n = (n+1)/2 a result that most of you know - except you may have forgotten why it turns out that way.

Thus, if we’re searching lists of many thousands (or millions) of items, sequential search is not the way to go. The computational cost of finding an item (on average) is proportional to the number of items in the list. If the list doubles in size, so does the average cost of successful search (as does the cost of a failing search).

Note that the results above hold also if we store the data items in a linked list. Note also that the linked list requires twice as much memory space for references (pointers) since each linked list element contains two references: one to the data objects, and one to the next element in the list. Of course, we will generally be willing to pay the price of this extra memory requirement to gain the flexibility of adding and removing elements to the linked list with relatively low cost. We’ll say more about this below.

What about adding and removing data items from an unordered list? For an array-based list, we can keep track of the index of the last element in the list (we assume we have MAX elements in the array, but that the number of "live" elements is typically less than MAX), and adding a new element is accomplished by incrementing last and (if last is less than MAX) inserting a new ArrayItem at that index. Removing an element from an unordered list (once you’ve found where the element is in the list) is just as easy: replace the element with the one in position last and decrement last. Thus adding and removing (once the item to be removed has been located) are "unit-cost" operations. And the same is true for unordered linked lists. We can add an element by "pushing" it on to the front of the list: no big deal in computational effort. For removal, once we’ve found where the item is in the list, we make the usual changes to the next references in the list to delete the item.

Should we put the list elements in (ascending) key order, the way that phone books and dictionaries are arranged? Does this improve the average look-up cost of sequential search? NO! Only the cost of a failing search is improved, since we don’t have to go all the way through the list to discover that what we’re looking for is not there. (How much is the average failing cost of sequential search on a key-ordered list with n elements?)

3

Binary Search Ordered lists are essential however, if we use the binary search method. Here we exploit the ordering of the keys in a very elegant and effective way. We’ll compute the index of the middle element of the array, and see if the key stored there matches the search key. If so, we’re done. If not we see if the searchkey is smaller than the middle key. If it is, we arrange to "start over" and search the upper half of the array. Else, we arrange to search the lower half of the array. This technique is repeated until we find what we’re looking for - or we have reduced the size of the search list to 0, in which case we know that the searchkey is not in the list. Consider the method below : A static method, which can be invoked by any object in any class by referring to the class name in which the method is defined. It’s not an instance method capable of being invoked only by the objects defined by its class. Java uses the keyword static to denote class methods. As above, we’ll assume that all keys are type String. The String class is one of the built-in Java classes which also implements the Comparable interface, so we’ll code our binary search process in generic terms. public static int binarySearch(Comparable searchKey, ArrayItem[] A){ int lo = 1; int hi = A.length - 1; int mid; /* length is an attribute of arrays, not a method. data stored in indices 1...n; location 0 empty */ while(lo