Comparison of fitness scaling functions in genetic algorithms with applications to optical processing

Comparison of fitness scaling functions in genetic algorithms with applications to optical processing Farzad A. Sadjadi School of Physics and Astronom...
4 downloads 0 Views 3MB Size
Comparison of fitness scaling functions in genetic algorithms with applications to optical processing Farzad A. Sadjadi School of Physics and Astronomy, University of Minnesota, Twin Cities ABSTRACT Many optical or image processing tasks reduce to the optimization of some set of parameters. Genetic algorithms can optimize these parameters even when the functions they map are fairly complicated, but they can only do so the point where the fitness functions they are given can differentiate between good results and the best result. This can occur when the optimal point is in a region (in a three dimensional example) such as a plateau, where all the surrounding points are of very nearly the same fitness. If there are multiple peaks in close proximity, all of nearly the same fitness but with very deep divides, the algorithm will have trouble ‘hopping’ from one to the other. One way to overcome these obstacles is to scale the fitness values given by the fitness function, thereby gently modifying the fitness function from the point of view of the algorithm, thus rewarding the more fit solutions to a higher precision than would naturally occur. Four such scaling methods will be compared based upon their handling of a sample set of optical processing data. Success will be determined by comparing the variance over time, selection pressure over time, and best of generation graphs. Keywords – Genetic Algorithms, Fitness Scaling, Linear Scaling, Evolutionary Programming, Optical Processing, Kolmogorov distance, Optimization.

1. INTRODUCTION With the recent expansion in the use of genetic algorithms in all manners of optical processing, one pressing issue remains to be fully dealt with. That is, how can one design a fitness function that addresses the algorithm’s needs for a clear “winner” in each generation, and the specific optical processing task at hand. Since it is far too much to ask at this time to find the best way to optimize the specific processing task that any given engineer might have, it is the goal of this paper to explore what manner of fitness scaling can offer the most benefits to a generic processing task. Fitness scaling is a step, following the computation of individual fitness values by whatever fitness function necessary for a given processing task, wherein the fitness values for the whole generation are fit to some predetermined pattern. In this paper, four scaling functions will be used with four fitness functions. The fourth fitness function will not be analytic, and will represent a generic data set, where only a finite set of points are known. 1.1 Simple genetic algorithms Simple genetic algorithms search for the most optimum set of variables by using the “survival of the fittest” concept1-4. Note that much of the nomenclature of this field is borrowed from genetics. Initially, the algorithm produces an array of random numbers, typically referred to as the initial population. Each row of this array is representative of one set of values for the variables; rows are referred to as individuals. Each variable is represented in the individual by a set of adjacent numbers (or genes), this set is referred to as a chromosome. Since there are several genes per chromosome, but each variable is ultimately only one number, these genes must be somehow combined into a single value, called a phenotype. After the initial population is generated, each individual must be reduced to its phenotype form (one number per chromosome). This form is then fed through a fitness function that assigns a fitness value to each individual based upon the properties of its chromosomes’ phenotypes. The fitness function is the function that one wishes to be optimized. The fitness values assigned are the property of the function that is to be maximized (minimization can be achieved by simply inverting the fitness function, since the algorithm can only maximize). The next step is to create a new population (the next generation) and repeat the fitness test. To generate this new generation from the old, there are three primary mechanisms: reproduction, crossover, and mutation. Reproduction is the simplest since it is just the copying of one or more individuals from one generation to the next, based upon their fitness values. Crossover is the signature feature of genetic algorithms. This process takes random pairs of individuals and randomly exchanges segments of matching chromosomes. This is best explained with a simple example. Suppose the two selected individuals each have two chromosomes, with five genes per chromosome. Crossover means that in the first

356

Optical Information Systems II, edited by Bahram Javidi, Demetri Psaltis, Proceedings of SPIE Vol. 5557 (SPIE, Bellingham, WA, 2004) 0277-786X/04/$15 · doi: 10.1117/12.563910

chromosome, the first two genes from each individual are exchanged, and in the second chromosome, the last four genes are exchanged. This process produces two new individuals for the new generation. No genetic material can cross from one chromosome to another, nor can any gene leave its original column. Also, the two individuals are selected based upon their fitness values, such that the more fit the individual, the more likely it is to be selected for crossover. Mutation selects individuals based upon fitness too, but this process targets the weaker individuals. Mutation means that within an individual selected as mentioned before, a randomly selected gene is rewritten with a newly randomly generated number. Mutation probabilities dictate not only how many individuals are selected, but also how many genes within an individual are mutated.

Fig. 1. Genetic Algorithm Flowchart

The genetic algorithm used in these trials can best be explained by Fig. 1. There are a few steps included in my algorithm that were not explained in the overview given above, namely the “mapping of the phenotype to the search range”, “statistics”, and “mating” stages. This algorithm uses binary gene encoding, and thus can generate a raw phenotype only as large as

2 G  1,

(1)

where G is the number of genes specified by the user. Also, the raw phenotype conversion cannot accommodate negative values. Therefore, in order to generate phenotypes within a user specified search range, the raw phenotype must be mapped to the desired search range after initial phenotype conversion. The statistics step records the maximum, minimum, and mean fitness values for the generation, as well as the standard deviation for the raw fitness array. It also records the corresponding phenotype of the best fitness. For generations after the first, in addition to the per generation statistics just mentions, global statistics such as best overall fitness (and its associated phenotypes) and the generation at which this fitness was identified are recorded and constantly updated with each new generation. At the end of the run, the algorithm plots the best fit individual of each generation onto 2D and 3D plots of the test function. The algorithm also saves the statistics information as well as all user input for that specific trial into an Excel document for later analysis. The mating stage, which immediately follows the reproduction stage, is only meant as a precursor for the crossover step. This stage randomly reorganizes the population array so that during crossover, adjacent individuals are crossed. Basically the “random selection based on fitness” preamble to the traditional crossover stage is preformed by two separate stages in this algorithm. Probability of crossover therefore is interpreted as how many pairs of individuals undergo crossover. The crossover stage starts at the top of the population array and moves down until it has performed crossover as many times as prescribed by the probability of crossover (the percent of the population that is replaced by crossover corresponds to the probability of crossover) 1.2 Scaling functions All scaling functions can (theoretically1,2) be divided into three categories: Linear, Sigma Truncation, and Power Law. Linear scaling methods (such as flinear below) usually have constants that are not problem dependent, but that may depend on the population characteristics (max, min, mean, etc). Sigma methods include problem depended data. Power

Proc. of SPIE Vol. 5557

357

scaling takes into account the raw fitness values themselves. For these trials, the four methods of scaling all fall under the category of population dependent, linear scaling. One of the most common scaling techniques is traditional linear scaling1,2. This scaling remaps the fitness values of each individual using the following equation

flinear = a + b  fraw ,

(2)

where a and b are constants defined by the user. For these trials, the values of a and b were tied to specific characteristics of the population (see Table 1.) Another scaling option is rank scaling2. This is more of a two-step process. First, all individuals are sorted by their raw fitness scores (they are “ranked”). Then new fitness values are computed based solely on their “rank” using

franked = p  2 

(r  1)( p  1) N 1

,

(3)

where r is the “rank” of the individual, p is the desired selection pressure (best/median ratio), and N is the size of the population. Exponential scaling also begins with ranking all the individuals, but the new fitness values are instead computed with

fexponential = m( r 1) ,

(4)

where each individual’s new fitness is m times greater than the previous individual. Low m can result in high selection pressure and all that that implies. Low pressure means premature convergence, possibly isolating the entire population in a local maximum and not the true maximum. Top scaling is probably the most simple scaling method. Using this approach, several of the top individuals have their fitness set to the same value (which is proportional to the population size), with all remaining individuals having their fitness values set to zero. This simple concept yields

s  N, for r  c ftop =  , for r < c  0,

(5)

where s is some proportionality constant, c is the number of individuals that will be scaled up, and N is the size of the population. Since this gives several individuals identical fitness levels, regardless of how different their raw scores might be, the diversity of the succeeding generations is increased. Of these four scaling options, all have arbitrary user inputs. In an attempt at fairness, the values used in the following tests were selected randomly, but only after they proved not to dramatically affect the outcome of the trials. This was done so that no one scaling method would have an advantage due to better tuning by the user. At the end of the paper, another, possibly better, alternative will be discussed. 1.3 Fitness functions All four scaling methods will be measured against each other given a constant set of population parameters (population size, individual length, etc). Mutation and crossover rates will also be held constant throughout all the trials. Four fitness functions will be used to demonstrate the scaling methods’ effects. The first two are from the De Jong Five2 (F1 and F3 specifically), the first being v

f1 ( xi ) =  xi2 ,

(6)

i=1

where v is the number of variables (in this and all following examples, v = 2 ) or dimensions. For these trials, this function was slightly modified (see Fig. 2.): the dome was inverted (opening downwards now) and the peak was moved

358

Proc. of SPIE Vol. 5557

off of the origin to (5,5,5). The x- and y-axis values ranged from 0 to 10. This function tests the algorithm’s ability to focus on the true maximum. This is because there is little differentiation between two points very close to each other when they are both close to the peak (the function flattens out at the peak). In order to select the true maximum, the function must be able to discriminate between fitness levels that are very close. The second fitness function is v

f2 ( xi ) =  integer( xi ) .

(7)

i=1

The function integer(x) effectively rounds the values of x to the nearest integer. The function looks like the side of a hill with hundreds of flat terraces (see Fig. 3.). Here the x- and y-axis values ranged from –10 to 10. The third test function for these trials,

f3 ( xi ) = integer( x1 )  10 ,

(8)

was very similar to f2 above, the primary difference being that this one more resembles a staircase. This function offers a hidden difficulty. Since its maximum fitness is zero, the algorithm must be able to handle such a value without conflict. This means that there should be no point in the algorithm where anything is divided by the maximum fitness. This may seem to be an innocent quirk, but it has the potential to disrupt some algorithms. Note that this function is symmetric in the y-axis (see Fig. 4.). Both f2 and f3 have multiple solutions that have equally maximum fitness values (the highest points form a plateau rather than a peak). This can cause problems when the population prematurely converges. If this happens, then there is nothing in a specific terrace to indicate that there are higher fitness levels elsewhere. The final test function is not analytical, but rather is a plot of actual data. In this case the z-axis represents the Kolmogorov distance5 between two targets. The x-axis and y-axis in turn represent pairs of transmission / reception polarization angles. There are numerous hills and valleys, with steep walls and plateaus (see Fig. 5.). The peak resides on the edge of a plateau, next to a steep drop off. Since this is a real data set, the function only has integer inputs (ranging from 1 to 24), which are the coordinates of the Kolmogorov distance in the data array. Therefore, since the algorithm naturally produces double precision output, the phenotype generated by the algorithm must be rounded down to the nearest integer (in this case, all numbers to the right of the decimal are dropped). This function represents a typical real-world application of a genetic algorithm (albeit, since this data set is quite small, the genetic algorithm is overkill). The Kolmogorov distance is calculated with

f4 =  p( x |  1 ) p1  p( x |  2 ) p2 dx , where p is the conditional probability density function, x is the received (and as yet unclassified) signature.

Fig. 2. 3D view of

f1 .

Fig. 3. 3D view of

f2 .

(9)

p1 and p2 are a priori probabilities of classes  1 and  2 , and

Fig. 4. 3D view of

f3 .

Fig. 5. 3D view of

f4

Proc. of SPIE Vol. 5557

359

2. DATA AND RESULTS For these trials, all the user input parameters for the four fitness scaling functions were functionally identical. All the constants were the same, and all the fitness dependencies were coded the same way (see Table 1). While the user controls for the genetic algorithm were identical for all tests with a specific fitness function, they varied slightly between fitness functions. The actual values for all trials are given in Table 2. Table 2. Algorithm Variable Definitions Fitness

Table 1. Scaling Variable Definitions

Variable

Function

Variable

Definition

Linear Linear Ranked exponential Top Top

a b p m s c

max(raw) – min(raw)/N 6 2 0.5 9

Genes Variables Individuals Generations Crossover (%) Mutation (%) Range Min. Range Max.

f1

f2

f3

f4

20 2 300 250 98 2 0 10

20 2 300 250 98 2 – 10 10

20 2 300 300 98 2 – 10 10

20 2 300 300 98 2 1 24

Table 3. Scaling Results Fitness Function

f1

f2

f3

f4

Scaling Function True None Linear Ranked Exponential Top True None Linear Ranked Exponential Top True None Linear Ranked Exponential Top True None Linear Ranked Exponential Top

Time (sec) 45.49 45.99 45.91 42.79 42.94 36.49 33.03 34.72 32.78 34.48 49.31 68.29 80.94 73.23 46.72 64.13 65.21 80.16 80.06 79.52

Best Fitness 5.0000 4.9996 5.0000 5.0000 5.0000 4.9969 20.0 19 20 20 20 19 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.9878 1.9878 1.9878 1.9878 1.9878 1.9878

x-coordinate 5.0000 4.9878 4.9978 5.0000 5.0000 5.0510 9.5  x  10.0 8.9498 9.6769 9.5047 9.5304 8.6951 9.5  x  10.0 9.8533 9.7516 9.5297 9.9998 9.9044 9 9 9 9 9 9

y-coordinate 5.0000 5.0166 4.9980 5.0000 5.0000 5.0218 9.5  y  10.0 9.9746 9.6118 9.9722 9.9089 9.8721 -10.0  y  10.0 9.9820 7.6598 -1.3893 -6.4696 -7.6487 4 4 4 4 4 4

2.1 Results Two of the best ways to monitor the progress of an algorithm are variance and convergence (or selection pressure). Variance is, as expected, the standard deviation (of the raw fitness scores) squared. Convergence is when the mean and maximum fitness scores are close. Selection pressure is a good gauge of this. Time of execution, while a function of too many variables to be a primary gauge of success, nevertheless provides some information about the strengths of the

360

Proc. of SPIE Vol. 5557

scaling method. But since the time is recorded for the computation of the same number of generations, with the same test function, and (ideally) the same background tasks the only difference between the trials was the scaling function. Perhaps the most important measure is whether or not the algorithm found the true maximum of the fitness function. For the fitness functions that have multiple maxima, a certain amount of variation can be expected. Since the only termination criterion used for these tests was number of generations, some runs converged while others did not. This leads to some variation among the results for those fitness functions with discernable peaks also. All four measures of effectiveness were recorded for each scaling function when applied to each fitness function. One of the best ways to monitor genetic diversity is to display the best fit individual from each successive generation on a plot of the test function (see Fig. 10-29). This way one can see to what degree each generation has learned from the previous and also see how each generation places on the surface of the function. While this method conceals the order in which the points are placed on the surface, general trends can still be identified. For instance, in Fig. 10., one can certainly see a disproportionate amount of the points lying on the x = 5 coordinate and on the y = 5 coordinate. With linear, ranked, and exponential scaling (Fig. 11-13) such a pattern is suppressed, with exclusive congregation about the maximum point. Similar congregations occur in each of the four test functions for ranked and exponential scaling. Top scaling seems to be the most diverse option. In each of the four test functions, the diversity displayed in Fig. 14, 19, 24, and 29 is greater than with no scaling activated. Linear scaling displayed virtually no difference in diversity in Fig. 21 or 26. f1 Variance Comparison 140 120

None Linear Ranked Exponential Top

70 60 Variance

100 Variance

f2 Variance Comparison 80

None Linear Ranked Exponential Top

80 60 40

50 40 30 20

20

10

0

0 1

31

61

91

121 151 Generations

181

Fig. 6. Variance per generation for

211

241

1

31

f1 .

61

91

211

Fig. 7. Variance per generation for

f3 Variance Comparison

241

271

f2 .

f4 Variance Comparison

40

0.0045

None Linear Ranked Exponential Top

35 30 25 20 15

None Linear Ranked Exponential Top

0.004 0.0035 0.003 Variance

Variance

121 151 181 Generations

0.0025 0.002 0.0015

10

0.001

5

0.0005 0

0 1

31

61

91

121 151 181 Generations

211

Fig. 8. Variance per generation for

241

f3 .

271

1

31

61

91

121 151 181 Generations

211

Fig. 9. Variance per generation for

241

271

f4 .

Variance is also a good measure of diversity, and it can be easily visualized with generation information intact. For the first test function (see Fig. 6), only top scaling was able to increase variance, while ranked and exponential scaling gave virtually indistinguishable variance graphs. For the second test function (see Fig. 7), linear scaling and top scaling improved variance, with linear scaling increasing variance dramatically. Again, ranked and exponential scaling were

Proc. of SPIE Vol. 5557

361

nearly identical and both gave very low variance. For the third test function (see Fig. 8), linear and top scaling started off with higher variance, but top scaling died off over time (after about 100 generations it was just higher than ranked or exponential scaling). Ranked and exponential scaling yet again were the lowest. The final test function (see Fig. 9) had the lowest variance overall, but previous trends continued nonetheless. Linear scaling gave the highest variance, but it was not that far separated from no scaling. Top scaling started out even with no scaling, but dropped even with ranked and exponential after about 100 generations. What seems to be happening is that top scaling starts off higher and maintains a higher variance than no scaling, but then the variance goes down as the population converges to the maximum. Table 3 shows the time of execution, best fitness (and corresponding coordinates) for each test function as found by the genetic algorithm with each scaling function, without scaling activated. The true maximum and its coordinates are also given. Note that since all but the first test function round their coordinates to some degree, only the first test function is a true test of the search precision.

3. SUMMARY AND CONCLUSION Each of the four scaling functions were applied in identical manners to the four test functions. The goal was to attempt to observe what, if any, the positive effects of fitness scaling were in a series of real-world trials. The results clearly indicate that scaling has a dramatic effect on the genetic diversity and on the rate of convergence. Depending on the method of scaling, diversity can be increased or suppressed. Linear scaling (as it was used here, with the two user variables tied to the population as shown in Table 1) increases diversity, and depending on the function. If one were purely interested in finding a solution quickly, then one would want to use exponential scaling. In each of the test functions, this scaling method was able to promptly focus in on the maximum and minimize deviation from that point over the successive generations. Ranked scaling was the only other method to select (5,5) as the maximum of the first test function, and do so with 5 digit accuracy. But, if the fitness function were more complicated than any tested here, if there were multiple peaks, and genetic diversity needed to be preserved for many generations to explore all these peaks, top scaling seems to be the best choice. In two of the four tests, top scaling was able to match the true maximum (in the first test function, it was less than 0.062% off) while still maintaining high variance. In fact, the variance was higher than the case without scaling (at the start). It was mentioned earlier that there is a dilemma when it comes to the user input for the various scaling functions. One solution is to tie these variables to some population statistics (as done for linear scaling in these trials). While this may seem the best method, it is not optimal. One of the best solutions when something needs to be dynamically optimized is to apply a compact genetic algorithm (CGA) to the problem. These algorithms keep track of their progress using a constantly updated probability vector, rather than as a large population (this elevates CGAs above the need for fitness scaling). In fact, since only one or two individuals’ fitness values are calculated per generation, the algorithm can run at near real-time. By applying a CGA to optimize the variable controls for the fitness scaling in a simple genetic algorithm, genetic optimization can be taken to its fullest realization.

REFERENCES 1. 2. 3. 4. 5.

362

Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Programs, Springer-Verlag, New York, 1994. D. E. Goldberg, Genetic Algorithms in Search, Optimization & Machine Learning, Addison-Wesley Pub., Reading, MA, 1989. J. H. Holland, Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence, MIT Press, Cambridge, MA, 1998. K. F. Man, K. S. Tang, and S. Kwong, Genetic Algorithms: Concepts and Designs, Springer-Verlag, London, 1999. P. A. Devijver, and J. Kittler, Pattern Recognition: A Statistical Approach, Prentice-Hall International, London, 1982.

Proc. of SPIE Vol. 5557

Fig. 10. Best of generation for

f1 , no scaling.

Fig. 15. Best of generation for

f2 , no scaling.

Fig. 11. Best of generation for

f1 , linear scaling.

Fig. 16. Best of generation for

f2 , linear scaling.

Fig. 12. Best of generation for

f1 , ranked scaling.

Fig. 17. Best of generation for

f2 , ranked scaling.

Fig. 13. Best of generation for

f1 , exponential scaling.

Fig. 14. Best of generation for

f1 , top scaling.

Fig. 18. Best of generation for

f2 , exponential scaling.

Fig. 19. Best of generation for

f2 , top scaling.

Proc. of SPIE Vol. 5557

363

Fig. 20. Best of generation for

Fig. 25. Best of generation for

f4 , no scaling.

Fig. 21. Best of generation for

f3 , linear scaling.

Fig. 26. Best of generation for

f4 , linear scaling.

Fig. 22. Best of generation for

f3 , ranked scaling.

Fig. 27. Best of generation for

f4 , ranked scaling.

Fig. 23. Best of generation for

f3 , exponential scaling.

Fig. 24. Best of generation for

364

f3 , no scaling.

Proc. of SPIE Vol. 5557

f3 , top scaling.

Fig. 28. Best of generation for

f4 , exponential scaling.

Fig. 29. Best of generation for

f4 , top scaling.

Suggest Documents