The PDF Estimation Problem

The PDF Estimation Problem Scientific Computing and Numerical Analysis Seminar October 5, 2010 The PDF Estimation Problem Outline The Big Picture...
3 downloads 1 Views 775KB Size
The PDF Estimation Problem Scientific Computing and Numerical Analysis Seminar

October 5, 2010

The PDF Estimation Problem

Outline

The Big Picture Basic Probability Theory Hermite Polynomial Interpolation Histogram Interpolation Kernel Estimation Data Regeneration

The PDF Estimation Problem

The Big Picture Continuum-Microscopic Method Steps 1 Create a microscopic system 2 Run the microscopic updating scheme for a short number of time steps 3 Average the results and send these values to the macro-scale 4 Run the macroscopic updating scheme

The PDF Estimation Problem

The Big Picture

Goal: Perform Step 1 of the CM Algorithm by utilizing past information from the micro-scale Track the evolution of the microscopic variables by tracking their probability distribution functions (PDFs) Use these PDFs to predict the PDF of each variable at the desired future point in time

The PDF Estimation Problem

Probability Theory The Probability Distribution Function (PDF) A random variable X is defined by its set of possible values Ω and its probability distribution function f (X ) The probability that X takes on a value between x and x + dx is given by R x+dx f (x)dx x f (x) is such that its integral normalizes to 1 Z f (X )dX = 1 Ω

The PDF Estimation Problem

Probability Theory Expectation The expected value (or mean) of a probability distribution function is given by: Z ∞ E(X ) = xf (x)dx −∞

More generally, the expectation of any function g(X), related to the PDF f(X) is given by: Z ∞ E(g(X )) = g(x)f (x)dx −∞

If f (X ) is unknown, the expectation can be approximated by taking the average from the given data values: PN g(Xi ) E(g(X )) ≈ i=1 N The PDF Estimation Problem

Probability Theory

The Cumulative Distribution Function (CDF) CDF is defined as:Z x

F (x) =

f (X )dX −∞

In words, F (x) represents the probability that X takes on a value between −∞ to x The CDF will be useful for Data Regeneration

The PDF Estimation Problem

Probability Theory Joint Probability Distribution Function (JPDF) Given random variables X1 , X2 , ...XN , the JPDF f (X1 , X2 , ..., XN ) can be interpreted as the probability that X1 ∈ (x1 , x1 + dx1 ), X2 ∈ (x2 , x2 + dx2 ), ... , XN ∈ (xN , xN + dxN ) is given by: Z

x1 +dx1

Z

x2 +dx2

Z

xN +dxN

f (x1 , x2 , ..., xN )dx1 dx2 ..dxN

... x1

x2

xN

The JPDF can be written as a product of single variable PDFs (f (x1 ) ∗ f (x2 )... ∗ f (xN )) if the variables are independent The PDF Estimation Problem

Probability Theory Common Distribution Functions ( Uniform Distribution: f (x) =

Normal Distribution: f (x) =

Uniform

1 b−a

for a ≤ x ≤ b, for x < a or x > b

0 √ 1 2πσ 2

e

−(x−µ)2 2σ 2

Normal The PDF Estimation Problem

Probability Theory

The PDF Estimation Problem Classic problem of Probability Theory Given a set of data, the goal is to determine the PDF f (X ) that produced that data Common techniques: Series Expansions, Histogram Interpolation, and Kernel Estimation

The PDF Estimation Problem

Probability Theory The PDF Estimation Problem Each technique will be tested on a set of data produced by a normal distribution, mean = 0, standard deviation = 1

The PDF Estimation Problem

Probability Theory Error Estimation Error for each technique will be estimated by computing: ˆ 2) E((f (x) − f (x)) The Mean Square Error (MSE) E indicates Expectation or average n  2 X 1 2 ˆ ˆ E((f (x) − f (x)) ) ≈ f (xn ) − f (xn ) n i=1

The PDF Estimation Problem

Hermite Polynomial Expansion Goal is to estimate the underlying PDF f (x) f (x) could be approximated by a truncated series expansion: f (x) =

N X

cn Hn (x)

n=0

where cn are coefficients and Hn (x) are a set of basis functions For this demonstration, we choose Hn (x) to be the orthogonal Hermite polynomials The PDF Estimation Problem

Hermite Polynomial Expansion The Hermite polynomials are defined as: n 2 d −x 2 Hn (x) = (−1)n ex e dx n The Hermite polynomials are orthogonal on (−∞, ∞), meaning: ( Z ∞ 0 if m 6= n 2 Hm (x)Hn (x)e−x dx = √ n!2n π m = n −∞ The orthogonality of Hn (x) will allow for easy computation of the cn coefficients The PDF Estimation Problem

Hermite Polynomial Interpolation

ˆ = f (x)

N X

cn Hn (x)

n=0

ˆ m (x)e−x 2 = f (x)H

N X

cn Hn (x)Hm (x)e−x

2

n=0

Z



ˆ m (x)e−x 2 = f (x)H

−∞

Z



−∞

Z



N X

cn Hn (x)Hm (x)e−x

2

cn Hn (x)Hm (x)e−x

2

−∞ n=0

ˆ m (x)e f (x)H

−x 2

=

N Z X n=0



−∞

The PDF Estimation Problem

Hermite Polynomial Expansion

Z



ˆ n (x)e−x f (x)H

2

−∞

cn cn cn

√ = cn n!2n π Z ∞ 1 ˆ n (x)e−x 2 √ f (x)H = n n!2 π −∞ 1 2 √ E(Hn (x)e−x ) = n n!2 π PN −xi2 ) 1 i=1 Hn (xi )e √ = n N n!2 π

The PDF Estimation Problem

Hermite Polynomial Expansion Results for different numbers of terms in the Expansion:

Terms: 6, 10, 20, 40 The PDF Estimation Problem

Hermite Polynomial Expansion Number of Terms 6 10 20 40

MSE 0.03374 0.00109 0.00291 0.01914

ˆ does poor at the edges of the domain f (x) Errors due to truncation of terms Approximation PN theory says error between f (x) and n=0 cn Hn (x) should decrease as N increases if cn are computed exactly The PDF Estimation Problem

Histogram Interpolation One of the oldest, most common PDF estimation techniques First step is to establish the bins into which data will be sorted Given a starting point x0 and bin width h, the bins can be established as: [x0 + mh, x0 + (m + 1)h] The histogram gets defined as: ˆ = 1 (No. of Xi in same bin as x) f (x) nh The PDF Estimation Problem

Histogram Interpolation ˆ is a piecewise constant estimate of the f (x) underlying PDF f (x) If a continuous function approximation is ˆ can be interpolated (e.g. needed, f (x) splines) Choice of bin endpoints and width will create different results Wide bins: Smooth and blur details in data Narrow bins: Not enough data per bin, resulting approximation very spiky The PDF Estimation Problem

Histogram Interpolation Results for different bin widths: h = 0.8, 0.5

The PDF Estimation Problem

Histogram Interpolation Results for different bin widths: h = 0.2, 0.05

The PDF Estimation Problem

Histogram Interpolation Bin Width 0.8 0.5 0.2 0.05

MSE 3.045e-5 1.589e-5 3.753e-5 2.094e-4

Optimal bin width can be found by solving an error minimization problem (provided f (X ) is known) Formulas exist to estimate optimal bin width for data that is close to normally distributed Example: Sturges’ formula: k = log2 n + 1, where k is the number of bins The PDF Estimation Problem

Kernel Estimation Another very popular PDF estimation technique Similar to histograms, but instead of creating separate bins into which data is collected ˆ ) is computed as a sum of and counted, f (X functions centered at each data point   n X 1 x − X i ˆ = f (x) K nh h i=1

h is still a "width" parameter, and n is the number of data points Xi The PDF Estimation Problem

Kernel Estimation The "kernel" function K is usually a symmetric probability distribution function, like a normal distribution     x−Xi 2 1 x − Xi − 12 h =√ e K h 2πh2 ˆ ) is a sum of normal distributions f (X

The PDF Estimation Problem

Kernel Estimation ˆ ) is a smooth, differentiable function f (X Do not need to choose where to center bins, here bins are centered at each data point and overlap with one another As in histogram interpolation, there are various methods for choosing h Methods include: Minimization of mean square error (which requires knowledge of f (X )) and others such as least squares cross-validation The PDF Estimation Problem

Kernel Estimation Results for test case, different h values:

h = 0.5, 0.2, 0.1, 0.05 The PDF Estimation Problem

Kernel Estimation Bin Width 0.5 0.2 0.1 0.05

MSE 2.983e-4 1.896e-5 2.345e-5 5.110e-5

ˆ ) is a smooth The estimated function f (X function, only requires a choice of bin width, and has low errors Conclusions: Kernel Estimation will be used to estimate the PDFs of data from microscopic variables in the CM model The PDF Estimation Problem

Data Regeneration Given a PDF, we need to generate a set of data from it, to assign to the elements or particles in the micro-system This is done by computing R x the cumulative density function F (x) = −∞ f (X )dX The values of F (x) range from 0 to 1 A random number generator is used to pick a value c ∈ [0, 1] A root-finding algorithm is then used to solve F (x) − c = 0 for x (the desired data point) The PDF Estimation Problem

Data Regeneration

PDF → CDF → Data Set

The PDF Estimation Problem

Summary Kernel Estimation will be used to estimate PDFs of various variables of the microscopic system These PDFs will be collected over time during the microscopic evolution A new PDF at the desired future point in time will be extrapolated from these saved PDFs A new micro-system will be created at the future point in time based on these predicted PDFs The PDF Estimation Problem

References

Silverman, B.W. "Density Estimation for Statistics and Data Analysis", Chapman and Hall, 1986.

The PDF Estimation Problem

Seminar Speakers

We need volunteers to give a talk at this seminar for the following dates: October 20, 27 and November 3, 10

The PDF Estimation Problem

Suggest Documents