The PDF Estimation Problem Scientific Computing and Numerical Analysis Seminar
October 5, 2010
The PDF Estimation Problem
Outline
The Big Picture Basic Probability Theory Hermite Polynomial Interpolation Histogram Interpolation Kernel Estimation Data Regeneration
The PDF Estimation Problem
The Big Picture Continuum-Microscopic Method Steps 1 Create a microscopic system 2 Run the microscopic updating scheme for a short number of time steps 3 Average the results and send these values to the macro-scale 4 Run the macroscopic updating scheme
The PDF Estimation Problem
The Big Picture
Goal: Perform Step 1 of the CM Algorithm by utilizing past information from the micro-scale Track the evolution of the microscopic variables by tracking their probability distribution functions (PDFs) Use these PDFs to predict the PDF of each variable at the desired future point in time
The PDF Estimation Problem
Probability Theory The Probability Distribution Function (PDF) A random variable X is defined by its set of possible values Ω and its probability distribution function f (X ) The probability that X takes on a value between x and x + dx is given by R x+dx f (x)dx x f (x) is such that its integral normalizes to 1 Z f (X )dX = 1 Ω
The PDF Estimation Problem
Probability Theory Expectation The expected value (or mean) of a probability distribution function is given by: Z ∞ E(X ) = xf (x)dx −∞
More generally, the expectation of any function g(X), related to the PDF f(X) is given by: Z ∞ E(g(X )) = g(x)f (x)dx −∞
If f (X ) is unknown, the expectation can be approximated by taking the average from the given data values: PN g(Xi ) E(g(X )) ≈ i=1 N The PDF Estimation Problem
Probability Theory
The Cumulative Distribution Function (CDF) CDF is defined as:Z x
F (x) =
f (X )dX −∞
In words, F (x) represents the probability that X takes on a value between −∞ to x The CDF will be useful for Data Regeneration
The PDF Estimation Problem
Probability Theory Joint Probability Distribution Function (JPDF) Given random variables X1 , X2 , ...XN , the JPDF f (X1 , X2 , ..., XN ) can be interpreted as the probability that X1 ∈ (x1 , x1 + dx1 ), X2 ∈ (x2 , x2 + dx2 ), ... , XN ∈ (xN , xN + dxN ) is given by: Z
x1 +dx1
Z
x2 +dx2
Z
xN +dxN
f (x1 , x2 , ..., xN )dx1 dx2 ..dxN
... x1
x2
xN
The JPDF can be written as a product of single variable PDFs (f (x1 ) ∗ f (x2 )... ∗ f (xN )) if the variables are independent The PDF Estimation Problem
Probability Theory Common Distribution Functions ( Uniform Distribution: f (x) =
Normal Distribution: f (x) =
Uniform
1 b−a
for a ≤ x ≤ b, for x < a or x > b
0 √ 1 2πσ 2
e
−(x−µ)2 2σ 2
Normal The PDF Estimation Problem
Probability Theory
The PDF Estimation Problem Classic problem of Probability Theory Given a set of data, the goal is to determine the PDF f (X ) that produced that data Common techniques: Series Expansions, Histogram Interpolation, and Kernel Estimation
The PDF Estimation Problem
Probability Theory The PDF Estimation Problem Each technique will be tested on a set of data produced by a normal distribution, mean = 0, standard deviation = 1
The PDF Estimation Problem
Probability Theory Error Estimation Error for each technique will be estimated by computing: ˆ 2) E((f (x) − f (x)) The Mean Square Error (MSE) E indicates Expectation or average n 2 X 1 2 ˆ ˆ E((f (x) − f (x)) ) ≈ f (xn ) − f (xn ) n i=1
The PDF Estimation Problem
Hermite Polynomial Expansion Goal is to estimate the underlying PDF f (x) f (x) could be approximated by a truncated series expansion: f (x) =
N X
cn Hn (x)
n=0
where cn are coefficients and Hn (x) are a set of basis functions For this demonstration, we choose Hn (x) to be the orthogonal Hermite polynomials The PDF Estimation Problem
Hermite Polynomial Expansion The Hermite polynomials are defined as: n 2 d −x 2 Hn (x) = (−1)n ex e dx n The Hermite polynomials are orthogonal on (−∞, ∞), meaning: ( Z ∞ 0 if m 6= n 2 Hm (x)Hn (x)e−x dx = √ n!2n π m = n −∞ The orthogonality of Hn (x) will allow for easy computation of the cn coefficients The PDF Estimation Problem
Hermite Polynomial Interpolation
ˆ = f (x)
N X
cn Hn (x)
n=0
ˆ m (x)e−x 2 = f (x)H
N X
cn Hn (x)Hm (x)e−x
2
n=0
Z
∞
ˆ m (x)e−x 2 = f (x)H
−∞
Z
∞
−∞
Z
∞
N X
cn Hn (x)Hm (x)e−x
2
cn Hn (x)Hm (x)e−x
2
−∞ n=0
ˆ m (x)e f (x)H
−x 2
=
N Z X n=0
∞
−∞
The PDF Estimation Problem
Hermite Polynomial Expansion
Z
∞
ˆ n (x)e−x f (x)H
2
−∞
cn cn cn
√ = cn n!2n π Z ∞ 1 ˆ n (x)e−x 2 √ f (x)H = n n!2 π −∞ 1 2 √ E(Hn (x)e−x ) = n n!2 π PN −xi2 ) 1 i=1 Hn (xi )e √ = n N n!2 π
The PDF Estimation Problem
Hermite Polynomial Expansion Results for different numbers of terms in the Expansion:
Terms: 6, 10, 20, 40 The PDF Estimation Problem
Hermite Polynomial Expansion Number of Terms 6 10 20 40
MSE 0.03374 0.00109 0.00291 0.01914
ˆ does poor at the edges of the domain f (x) Errors due to truncation of terms Approximation PN theory says error between f (x) and n=0 cn Hn (x) should decrease as N increases if cn are computed exactly The PDF Estimation Problem
Histogram Interpolation One of the oldest, most common PDF estimation techniques First step is to establish the bins into which data will be sorted Given a starting point x0 and bin width h, the bins can be established as: [x0 + mh, x0 + (m + 1)h] The histogram gets defined as: ˆ = 1 (No. of Xi in same bin as x) f (x) nh The PDF Estimation Problem
Histogram Interpolation ˆ is a piecewise constant estimate of the f (x) underlying PDF f (x) If a continuous function approximation is ˆ can be interpolated (e.g. needed, f (x) splines) Choice of bin endpoints and width will create different results Wide bins: Smooth and blur details in data Narrow bins: Not enough data per bin, resulting approximation very spiky The PDF Estimation Problem
Histogram Interpolation Results for different bin widths: h = 0.8, 0.5
The PDF Estimation Problem
Histogram Interpolation Results for different bin widths: h = 0.2, 0.05
The PDF Estimation Problem
Histogram Interpolation Bin Width 0.8 0.5 0.2 0.05
MSE 3.045e-5 1.589e-5 3.753e-5 2.094e-4
Optimal bin width can be found by solving an error minimization problem (provided f (X ) is known) Formulas exist to estimate optimal bin width for data that is close to normally distributed Example: Sturges’ formula: k = log2 n + 1, where k is the number of bins The PDF Estimation Problem
Kernel Estimation Another very popular PDF estimation technique Similar to histograms, but instead of creating separate bins into which data is collected ˆ ) is computed as a sum of and counted, f (X functions centered at each data point n X 1 x − X i ˆ = f (x) K nh h i=1
h is still a "width" parameter, and n is the number of data points Xi The PDF Estimation Problem
Kernel Estimation The "kernel" function K is usually a symmetric probability distribution function, like a normal distribution x−Xi 2 1 x − Xi − 12 h =√ e K h 2πh2 ˆ ) is a sum of normal distributions f (X
The PDF Estimation Problem
Kernel Estimation ˆ ) is a smooth, differentiable function f (X Do not need to choose where to center bins, here bins are centered at each data point and overlap with one another As in histogram interpolation, there are various methods for choosing h Methods include: Minimization of mean square error (which requires knowledge of f (X )) and others such as least squares cross-validation The PDF Estimation Problem
Kernel Estimation Results for test case, different h values:
h = 0.5, 0.2, 0.1, 0.05 The PDF Estimation Problem
Kernel Estimation Bin Width 0.5 0.2 0.1 0.05
MSE 2.983e-4 1.896e-5 2.345e-5 5.110e-5
ˆ ) is a smooth The estimated function f (X function, only requires a choice of bin width, and has low errors Conclusions: Kernel Estimation will be used to estimate the PDFs of data from microscopic variables in the CM model The PDF Estimation Problem
Data Regeneration Given a PDF, we need to generate a set of data from it, to assign to the elements or particles in the micro-system This is done by computing R x the cumulative density function F (x) = −∞ f (X )dX The values of F (x) range from 0 to 1 A random number generator is used to pick a value c ∈ [0, 1] A root-finding algorithm is then used to solve F (x) − c = 0 for x (the desired data point) The PDF Estimation Problem
Data Regeneration
PDF → CDF → Data Set
The PDF Estimation Problem
Summary Kernel Estimation will be used to estimate PDFs of various variables of the microscopic system These PDFs will be collected over time during the microscopic evolution A new PDF at the desired future point in time will be extrapolated from these saved PDFs A new micro-system will be created at the future point in time based on these predicted PDFs The PDF Estimation Problem
References
Silverman, B.W. "Density Estimation for Statistics and Data Analysis", Chapman and Hall, 1986.
The PDF Estimation Problem
Seminar Speakers
We need volunteers to give a talk at this seminar for the following dates: October 20, 27 and November 3, 10
The PDF Estimation Problem