Machine Learning and Causal Inference With Positive Definite Kernels

Machine Learning and Causal Inference With Positive Definite Kernels Bernhard Schölkopf Empirical Inference Department Max Planck Institute for Biolog...

Author: Hilary Price

1 downloads 2 Views 7MB Size

Report

Download PDF

Recommend Documents

Kernel Bayes Rule: Bayesian Inference with Positive Definite Kernels

Kernels Methods in Machine Learning

Learning with Output Kernels and Latent Kernels

Causal Inference with Panel Data

Causal inference with observational data

Double Machine Learning for Causal and Treatment

Online Learning with Kernels

Learning with Fredholm Kernels

Causal inference some aspects

Causal inference with graphical models in small and big data

Causal Inference in Accounting Research

Principal Stratification in Causal Inference

CS6140 Machine Learning. Kernels. Virgil Pavlu November 30, 2014

Machine Learning with WEKA

Machine Learning with R

Causal Inference With General Treatment Regimes: Generalizing the Propensity Score

Powerful tools for learning: Kernels and Similarity Functions. Powerful tools for learning: Kernels and Similarity Functions

Bayesian networks: Inference and learning

Karcher means of positive definite matrices

Definite meaning and definite marking

Generalized graphlet kernels for probabilistic inference in sparse graphs

Analytics and Machine Learning

Positive Definite Matrices: Data Representation and Applications to Computer Vision

Machine Learning and Inference Laboratory. Semantic and Syntactic Attribute Types in AQ Learning. Ryszard S. Michalski and Janusz Wojtusiak

Machine Learning and Causal Inference With Positive Definite Kernels Bernhard Schölkopf Empirical Inference Department Max Planck Institute for Biological Cybernetics Tübingen, Germany http://www.kyb.tuebingen.mpg.de/bs

Bernhard Schölkopf

Empirical Inference Department

1 Tübingen , July 1, 2009

Empirical Inference • Statistical Inference from data (observations, measurements) • Example 1: perceiving & acting • “The brain is nothing but a statistical decision organ” (Barlow)

• Example 2: doing science • ”If your experiment needs statistics [i.e., inference], you ought to have done a better experiment.” (Rutherford)

• Can one always do a better experiment? Are there hard inference problems? • • • •

High dimensionality – consider many factors simultaneously to find the regularity Complex regularities – nonlinear, nonstationary, etc. Weak prior knowledge – e.g., no mechanistic models for the data Need large data sets – processing requires computers and machine learning methods

Bernhard Schölkopf

Empirical Inference Department

2 Tübingen , July 1, 2009

•

“Anyo

Empirical Inference, II • Classical science (Rutherford..) had a selection bias, trying to avoid hard inference problems (they would have been impossible to solve anyway, without computers). • However, there are regularities in the world that require the solution of hard inference problems (using machine learning). • Examples: • Design of autonomous systems: systems that interact with a complex world cannot be programmed with simple rules • Prediction for complex biological systems, e.g., splice form recognition (Rätsch et al, 2008). • ...

• Caveat: • Machine learning methods usually do not yield comprehensible results, • require skilled users who come up with sensible hypotheses.

Bernhard Schölkopf

Empirical Inference Department

3 Tübingen , July 1, 2009

Overview Statistical Learning Theory Vision and Image Processing

Ulrike von Luxburg

Christoph Lampert

Robot Learning Jan Peters

Kernel Algorithms Machine Learning in Neuroscience

Yasemin Altun Arthur Gretton

Machine Learning in Computational biology Karsten Borgwardt*

Jeremy Hill

Causal and Probabilistic Inference Dominik Janzing

5 Bernhard Schölkopf

Empirical Inference Department

* Joint group with MPI forTübingen Developmental Biology , July 1, 2009

Example of a Pattern Recognition Algorithm

m+

yi

mµ(X) µ(Y) ?

µ(X)-µ(Y) Bernhard Schölkopf, Tübingen, July 1, 2009

10

Reproducing Kernel Hilbert Space (RKHS)

(or some other positive definite k)

Bernhard Schölkopf, July 1, 2009

17

Kernel PCA

Contains LLE, Laplacian Eigenmap, and (in the limit) Isomap as special cases with dependent Schölkopf, Smola & data Müller, 1998 kernels (Ham et al. 2004) Bernhard Schölkopf, July 1, 2009

18

Pattern Recognition Algorithm in the RKHS

19

Support Vector Machines class 1

class 2

Φ

k(x,x’)

=

• sparse expansion of solution in terms of SVs (Boser, Guyon, Vapnik 1992): representer theorem (Kimeldorf & Wahba 1971, Schölkopf et al. 2000)

• unique solution found by convex QP Bernhard Schölkopf, July 1, 2009

20

Massive Data: Classifying astronomical spectra GAIA: a 2011 ESA mission; stereoscopic survey down to V=20, detecting about one billion stars, several million galaxies, 0.5 million quasars • MPI for Astronomy (Heidelberg) will use SVMs to • 1. classify objects into {star, quasar, galaxy} • 2. for stars, estimate temperature, surface gravity, composition and the line-of-sight interstellar extinction • Data: 80 measurements per object, in the optical spectrum

Bernhard Schölkopf, Tübingen, July 1, 2009

21

• Part 2: Kernel Means • (joint work with A. Gretton, K. Fukumizu, B. Sriperumbudur, A. Smola, L. Song)

Bernhard Schölkopf

Empirical Inference Department

23 Tübingen , July 1, 2009

Bernhard Schölkopf, July 1, 2009

24

Bernhard Schölkopf, July 1, 2009

25

Bernhard Schölkopf, July 1, 2009

26

Large distance  can find a function distinguishing the two samples Bernhard Schölkopf, July 1, 2009

27

(done in the Gaussian RKHS) Bernhard Schölkopf, July 1, 2009

28

Bernhard Schölkopf, July 1, 2009

29

Combine this with

Discussion: solves a high-dim. optimization problem… Bernhard Schölkopf, July 1, 2009

30

Bernhard Schölkopf, July 1, 2009

31

• assume we have densities, the kernel is shift invariant, k(x,y) = k(x-y), and all Fourier transforms exist. Note that µ is invertible iff

i.e., (Sriperumbudur, Fukumizu, Gretton, Schölkopf, 2008)

• E.g.: µ is invertible if has full support. For restricted classes of distributions, weaker conditions suffice (e.g., if has nonempty interior, µ is invertible for distributions with compact support). • Example: p source of incoherent light, I indicator function of an aperture. In Fraunhofer diffraction, the intensity image is . Set , then this equals . • This imaging process is not invertible for the class of all light sources, but it is if we restrict the class (e.g., to compact support). Bernhard Schölkopf, Tübingen, July 1, 2009

32

Bernhard Schölkopf, July 1, 2009

34

Bernhard Schölkopf, July 1, 2009

35

Bernhard Schölkopf, July 1, 2009

36

Bernhard Schölkopf, July 1, 2009

37

Our main application of independence tests

• J. Friedman: When you’re doing science, you’re not interested in prediction, but in the nature of the association • One step further: When you’re doing science, you’re not interested in association, but in causality

Bernhard Schölkopf

Empirical Inference Department

39 Tübingen , July 1, 2009

Causal Inference • Reichenbach's Common Cause Principle links Causality and Probability: if X and Y are dependent, then there is a Z causally influencing both; Z screens X and Y from each other (given Z, X and Y become independent):

Z X

Y

• Pearl et al.: causal DAGs; for each vertex, we have • Xi = fi ( DirectCausesi, Noisei ), with independent noise terms. • They satisfy the Causal Markov Condition: given its direct causes, a variable is independent of its non-effects - can estimate causal DAGs up to Markov equivalence • Useless for the simplest graphs (2 variable case) • Simplify this model to Xi = fi ( Parentsi ) + Noisei (think of it as a local linearization – note that this will not always make sense!)40 Bernhard Schölkopf

Empirical Inference Department

Tübingen , July 1, 2009

Causal Inference with Additive Noise, 2-Variable Case Identifiability: forward model: y =f(x)+n with x, n independent. Question: when is there backward model of the same form?

X

?

f(x)=x

f(x)=x+x3

Hoyer, Janzing, Mooij, Peters, Schölkopf: Nonlinear causal discovery with additive noise models. NIPS 21, 2009 Mooij, Janzing, Peters, Schölkopf: Regression by dependence minimization and its application to causal inference. ICML 2009 Peters, Janzing, Gretton, Schölkopf: Detecting the Direction of Causal Time Series. ICML 2009 Bernhard Schölkopf, July 1, 2009

41

Y

Intuition

Bernhard Schölkopf

Empirical Inference Department

42 Tübingen , July 1, 2009

(Hoyer, Janzing, Mooij, Peters, Schölkopf, 2008) Bernhard Schölkopf, July 1, 2009

43

Causal Inference Method

Bernhard Schölkopf

Empirical Inference Department

44 Tübingen , July 1, 2009

Experiments

Bernhard Schölkopf

Empirical Inference Department

45 Tübingen , July 1, 2009

Bernhard Schölkopf

Empirical Inference Department

46 Tübingen , July 1, 2009

Other Experiments

Bernhard Schölkopf

Empirical Inference Department

47 Tübingen , July 1, 2009

Independence-based Regression (Mooij et al., 2009) • Problem: many regression methods assume a particular noise distribution; if this is incorrect, the residuals may become dependent • Solution: minimize dependence of residuals rather than maximizing likelihood of data in regression objective • Use HSIC as a dependence measure

Mooij, Janzing, Peters, Schölkopf: Regression by dependence minimization and its application to causal inference. ICML 2009

Bernhard Schölkopf

Empirical Inference Department

48 Tübingen , July 1, 2009

Detection of Confounders (Janzing et al., 2009) ? X

Y

• Confounded additive noise (CAN) models

• Estimate (u(T), v(T)) using dimensionality reduction • If Ex or Ey is close to zero, output ‘no confounder’ • Identifiability result for small noise Janzing, Peters, Mooij, Schölkopf: Identifying latent confounders using additive noise models. UAI 2009 Bernhard Schölkopf

Empirical Inference Department

50 Tübingen , July 1, 2009

Bernhard Schölkopf, July 1, 2009

51

Bernhard Schölkopf, July 1, 2009

52

Bernhard Schölkopf, July 1, 2009

53

Bernhard Schölkopf, July 1, 2009

54

• Part 3: Some Applications of Kernel Machines

Bernhard Schölkopf

Empirical Inference Department

55 Tübingen , July 1, 2009

Robot Learning • Learning to Control • Fast Model Learning for Inverse Dynamics • Learning Operational Space Control • Learning Motor Primitives • Imitation Learning for Initialization • Reinforcement Learning for Self-Improvement • Perceptual Coupling

Kober and Peters: Learning Motor Primitives for Robotics. ICRA 2009 Chiappa et al.: Using Bayesian Dynamical Systems for Motion Template Libraries. NIPS 21, 2009 Kober and Peters: Policy Search for Motor Primitives in Robotics. NIPS 21, 2009 Nguyen-Tuong et al.: Local GP Regression for Real Time Online Model Learning and Control. NIPS 21, 2009 Peters and Nguyen. Real-Time Learning of Resolved Velocity Control on a Mitsubishi PA-10. ICRA 2008 Nguyen-Tuong and Peters: Local GP Regression for Real-time Model-based Robot Control. IROS 2008 Bernhard Schölkopf

Empirical Inference Department

56 Tübingen , July 1, 2009

Implicit Surface Approximation

Dragon 1: 440K points – decreasing regularisation

Thai Statue: 5M points

Dragon 2: 3.6M points Bernhard Schölkopf, July 1, 2009

Walder et al., 2007 57

The Morphing Problem

I1

Bernhard Schölkopf, July 1, 2009

½ (I1+ I2)

I2

58

Head Morphing

(signed distance and normals, no landmark points, no color) (Schölkopf, Steinke & Blanz, 2006) Bernhard Schölkopf, July 1, 2009

59

with Dept. of Physiology, MPI for Biological Cybernetics Bernhard Schölkopf, July 1, 2009

60

Markerless Automatic Tracking of 3D Surfaces • Geometry and color approximated by 4D implicit and 3D regression • Mesh is initialized and tracked by minimizing an energy involving • • • •

Surface distance Color change Acceleration Mesh regularity

Walder et al., 2009 Bernhard Schölkopf, July 1, 2009

64

thank you for your attention Bernhard Schölkopf, July 1, 2009

67