Data Analytics Applications for Future HPC

Stavanger, 20/6/2013 http://euchina2013.cloudcom.org/ Data Analytics Applications for Future HPC Ron S. Kenett Michael Ashcroft Stavanger, 20/6/2...
Author: Thomas McBride
4 downloads 1 Views 3MB Size
Stavanger, 20/6/2013

http://euchina2013.cloudcom.org/

Data Analytics Applications for Future HPC Ron S. Kenett

Michael Ashcroft

Stavanger, 20/6/2013

Agenda • Background • Big Data • Social Networks • Bayesian Networks • Open Source

Stavanger, 20/6/2013

Stavanger, 20/6/2013

Stavanger, 20/6/2013

Stavanger, 20/6/2013

Big Data

Managing Risk and Costs in OSS Adoption Usability

Biotechnology

Web Services

Stavanger, 20/6/2013

Open Source Software (OSS) Some estimates indicate that by 2016 the prevalence of OSS will exceed 95% in commercial applications

Stavanger, 20/6/2013

Managing Risk and Costs in OSS Adoption

www.riscoss.eu

Stavanger, 20/6/2013

Stavanger, 20/6/2013

XWiki Community

Stavanger, 20/6/2013

NodeXL is a free and open source network analysis and visualization software package for Microsoft Excel 2007/2010. It is a popular package similar to other network visualization tools such as Pajek, UCINet, and Gephi

Stavanger, 20/6/2013

HuffingtonPost user network on Twitter Overall Graph Metrics: Vertices: 4393 Unique Edges: 4750 Edges With Duplicates: 0 Total Edges: 4750 Self-Loops: 0 Reciprocated Vertex Pair Ratio: 0.000210570646451885 Reciprocated Edge Ratio: 0.000421052631578947 Connected Components: 1 Single-Vertex Connected Components: 0 Maximum Vertices in a Connected Component: 4393 Maximum Edges in a Connected Component: 4750 Maximum Geodesic Distance (Diameter): 4 Average Geodesic Distance: 3.474606 Graph Density: 0.000246189810996713 Modularity: Not Applicable NodeXL Version: 1.0.1.238

Top Tweeters in Entire Graph: SheSeauxSaditty BlkSportsOnline yayayarndiva alaa huffingtonpost JohnLusher milrecetas UrbanGem HuffPostPol RCdeWinter

Stavanger, 20/6/2013

geveryone Twitter NodeXL SNA Map and Report for Tuesday, 11 June 2013 at 22:21 UTC The graph represents a network of 1,414 Twitter users whose recent tweets contained "geveryone", taken from a data set limited to a maximum of 10,000 tweets. The network was obtained on Tuesday, 11 June 2013 at 22:21 UTC. There is an edge for each "replies-to" relationship in a tweet. There is an edge for each "mentions" relationship in a tweet. There is a selfloop edge for each tweet that is not a "replies-to" or "mentions". The tweets were made over the 7-day, 1hour, 49-minute period from Tuesday, 04 June 2013 at 20:23 UTC to Tuesday, 11 June 2013 at 22:12 UTC.

Stavanger, 20/6/2013

Xwiki Community Data 1) Mailing lists archives: XWiki users mailing list: http://lists.xwiki.org/pipermail/users/ XWiki devs mailing list: http://lists.xwiki.org/pipermail/devs/ 2) IRC chat archives: http://dev.xwiki.org/xwiki/bin/view/IRC/WebHome 3) Commits (via git) https://github.com/xwiki 4) Code review comments are available on GitHub as well but I am not aware of a fast way to retrieve them all. Maybe GitHub has some APIs for retrieving them and a simple script could be written for the purpose. 5) Everything about bugs and releases: http://jira.xwiki.org

Stavanger, 20/6/2013

Xwiki Communities - example Based on interactions between active users between 2008-2012

A

B C

D

Stavanger, 20/6/2013

Xwiki Communities - example Changes over time between 2008-2012 A

B

A

B

D

C C

C

Jan-Jun 2008

B

A

D

Jan-Jun 2010

D

Jan-Jun 2012

Change in emphasis of various roles

Stavanger, 20/6/2013

Xwiki Communities - example C D B A Satellite community members

Stavanger, 20/6/2013

Xwiki Communities - example C

A A

B D

Satellite community members

Stavanger, 20/6/2013

Usability

The Seven Layers of a Decision Support for User Interface Design (DSUID) The lowest layer – user activity The second layer – page hit attributes (raw data) The third layer – transition analysis (dynamic data) The fourth layer – usability score (quantitative values) The fifth layer – usage statistics (descriptive statistics) The sixth layer – statistical analysis (clusters, BN) The top layer – interpretation Kenett R.S., Harel A. and Ruggeri, F. (2009). Controlling the Usability of Web Services, International Journal of Software Engineering and Knowledge Engineering, 19 (5), pp. 627-651.

Stavanger, 20/6/2013

Usability

The fifth layer bytes

Variable

Mean

StDev

CoefVar

Minimum

Median

Maximum

TextSize

20109

15817

78.66

0.000000000

17941

72359

PageSize

79577

259545

326.16

0.000000000

40273

2352613

PercentExitLoss

14.25

15.95

111.99

0.000000000

10.00

100.00

AccessibilitySco

15.50

17.67

113.97

0.0178

8.74

76.70

PerformanceScore

46.46

25.13

54.10

3.53

41.70

100.00

ReadabilityScore

2780

4746

170.71

0.000000000

1409

43135

UsabilityScore

73.04

27.15

37.17

14.22

69.61

139.40

UsabilityAlert

58.83

39.54

67.21

0.000000000

52.50

222.00

AvgDownloadTime

2.164

4.243

196.09

0.000000000

1.000

27.000

AvgReadingTime

153

1141

746.28

0.000000000

10.0

12288

AvgSeekingTime

62.2

508.6

818.12

0.000000000

10.0

5624.0

sec

Stavanger, 20/6/2013

Usability

The sixth layer

Stavanger, 20/6/2013

Usability

Cluster analysis

Stavanger, 20/6/2013

Usability

Bayesian networks

Stavanger, 20/6/2013

Usability

20%

Stavanger, 20/6/2013

Usability

15%

When high average reading time drops to 15% from 20%, usability improves

Biotechnology

Stavanger, 20/6/2013

Peterson, J. and Kenett, R.S. (2011), Modelling Opportunities for Statisticians Supporting Quality by Design Efforts for Pharmaceutical Development and Manufacturing, Biopharmaceutical Report, 18 (2), pp. 6-16.

Biotechnology

First 3 days

Stavanger, 20/6/2013

After 3 weeks

Stavanger, 20/6/2013

Web Services

Adaptive Web Services Testing Service Provider

Dynamic Test Selection

Service Changes

Service Broker

Dynamic Test Reconfiguration

Adaptive Control

Service Profile

Service Consumer

Dynamic Test Scheduling Dynamic Service Binding

Test Profile

Testing

Test Broker

Test Provider

Test Changes

Test Consumer

Bai X., Kenett, R.S. and Yu W. (2012). Risk Assessment and Adaptive Group Testing of Semantic Web Services. International Journal of Software Engineering and Knowledge Engineering, 22 (5), pp. 595-620. www.worldscientific.com/toc/ijseke/22/05

Stavanger, 20/6/2013

Web Services

Group Testing with Control Architecture Risk-Based Testing

Testing Controller Test Case Selection

WS Testing

Test Cases

Test Case Ranking

Service Ranking

Services

SUT Selection

Service Selection



More Layers?

Service Repository

Service Provider Service Broker Service Consumer

Test Provider

Test Results

Test Broker

Window Size Selection Test Consumer

Test Case Potency Evaluation

Service Evaluation

Testing Optimizer

Stavanger, 20/6/2013

Web Services

Risk Assessment of Semantic Web Services Risk Exposure = Failure Probability * Importance

Failure Probability Experience Based estimation

Ontology Model

Complexity Analysis

Importance Analysis

Workflow Model

Experience Based estimation

Ontology Model

Dependency Analysis

Usage Model

Web Services

Stavanger, 20/6/2013

Amazon Associates Web Services (AAWS)

AAWS WSDL

Ontology Model

19 operations with Complicated data structure 514 classes and 1096 dependency relationships

Stavanger, 20/6/2013

Web Services

Usage-Based Importance Analysis • In SOA, data, services and service compositions are decoupled from each other. S1

D1

CS1

S2 D2 S3 CS2

D3 S4

The more they are used, the more important they are

Web Services

Stavanger, 20/6/2013

Workflow-Based Failure Analysis • To estimate the failure probability of a composite service based on the failure probability of each constituent service and the control construct

Stavanger, 20/6/2013

Bayesian Networks Overview Bayesian networks are stochastic models that decompose complex multivariate systems into networks of much simpler conditional distributions.

Stavanger, 20/6/2013

We • do this by decomposing a system by the chain rule: –

Note: The joint distribution is equal to the product of the conditional distributions. (We lose no information.) We can represent this graphically: • Random variables represented by nodes. • Each node has associated with it a conditional probability distribution. • Each node has incoming edges from nodes associated with the variables upon which that node’s condition probability distribution is conditional.

A

P(B|A)

P(A)

B

C

P(D|A,B,C)

P(C|A,B)

D

E

P(E|A,B,C,D)

Stavanger, 20/6/2013

We • simplify the decomposition by exploiting known or discovered conditional independencies. and are conditionally independent given

A

Node

Conditional Independencies

A

-

B

C and E, given A

C

B, given A

D

A and E, given B and C

E

A, B and D, given C

P(B|A)

P(A)

B

C

P(D|A,B,C)

P(C|A,B)

D

E

P(E|A,B,C,D)

Stavanger, 20/6/2013

Bayesian Networks and Big data These models can be learnt automatically from data. Learning a Bayesian network involves learning: • The structure of the network (that encodes the conditional independencies present). • The parameters of the conditional distributions associated with each variable/node.

Data Causes

Predictive analytics applications Diagnostic applications

Actions

Events

Stavanger, 20/6/2013

Model Building and Parameter Learning • Prior knowledge regarding the structural relationships can be encoded using white and black lists. • Structural constraints can be included, such as limiting the maximum number of parents. • The parameters required remain constant, regardless of the size of the dataset

Stavanger, 20/6/2013

Model Comparison Classical algorithm: Greedy Search Inputs: 1. The space of all possible topologies T. 2. A number of transitions which can take us from one topology to another. ● Insert Edge ● Remove Edge ● Reverse Edge Given a topology, t, call the set of topologies that can be transitioned to from t the neighbours of t. Output: A topology corresponding to a local maxima in T. 3. Begin at a random topology, T. 4. Score . 5. Repeat: ● Score all neighbours of current topology. ● Transition to the highest scoring neighbour, as long as it is better than the current topology. .

A video of a search over possible topologies using Inatas System Modeler. Network augmented with decision and utility nodes, and search is over equivalence classes of topologies ordered by set-theoretic inclusion of conditional independencies

Stavanger, 20/6/2013

Model Comparison • Multiple topologies can encode the same sets of conditional independencies. In this case we say they are Markov equivalent. • For multiple reasons, it becomes attractive to search over Markov equivalence classes of topologies rather than topologies themselves.

A

P(A)

P(B)

B P(A|B)

P(B|A)

B

C

A

P(C|A)

C P(D|B,C)

D

E

P(E|C)

P(D|B,C)

D

P(C|A)

E

P(E|C)

Two Bayesian networks are Markov equivalent if and only if they share the same skeleton (edges ignoring direction) and v-structure (head-to-head meetings of edges).

Stavanger, 20/6/2013

Incomplete Data • The previous discussion assumed that the training data was complete. • In cases where data is incomplete, the same procedures can be followed after ’completing’ the data set using one of a number of methods, including: –

Expectation Maximisation



Gibbs Sampler

• The complexity of these algorithms is extremely large, and it is often prohibitive when dealing with large (>100,000) datasets. • Fortunately, it is possible to ignore incomplete data on a local basis due to the decomposability of the scoring criteria. This generally suffices in large cases.

Stavanger, 20/6/2013

Online Adaptability • Structural learning is expensive. • Ideally, we wish models to adapt to new data rather than require relearning. • Parameter learning can continue after model generation. Bayesian methods again offer a high quality, simple solution. • Ongoing likelhood analysis of data can indicate if complete strucural relearning may be necessary.

Stavanger, 20/6/2013

RESTful Bayesian Networks Remotely manage your models through HTTPS based API.

Inatas Web API Learning Inference and Decision Making Online Adaption Ensemble models Exact and Monte Carlo methods Dynamic resources allocation Secure (HTTPS + private/public key encryption if desired) Wide range of density estimation methods from kernel and wavelet based non-parametric methods – Open source plug-ins. – – – – – – – –

Stavanger, 20/6/2013 • • • • • • • • • • • • • • • •

Ashcroft, M, An Introduction To Bayesian Networks in Systems and Control, (2012), 18th International Conference on Automation and Computing (ICAC) Bai, X., Kenett, R.S. and Yu, W. (2012). Risk Assessment and Adaptive Group Testing of Semantic Web Services. International Journal of Software Engineering and Knowledge Engineering, 22 (5), pp. 595-620. http://www.worldscientific.com/toc/ijseke/22/05 Chickering, D, “Learning equivalence classes of Bayesian network structures,” Journal of Machine Learning Research, no. 2, pp. 445–498, 2002. Chickering, D, “Optimal structure identification with greedy search,” Journal of Machine Learning Research, no. 3, pp. 507–554, 2002 Cooper , G.F. (1990). The computational complexity of probabilistic inference using Bayesian belief networks, Artificial Intelligence, 42, pp, 393-405. Gruber, A. and Ben Gal, I., (2012) “Efficient Bayesian Network Learning for System Optimization in Reliability Engineering," Quality Technology & Quantitative Management, Vol. 9, No. 1, pp. 97-114. Jensen, F. V. (2001). Bayesian Networks and Decision Graphs, Springer. Harel, A., Kenett, R.S. and Ruggeri, F. (2009). Modeling Web Usability Diagnostics on the basis of Usage Statistics, in Statistical Methods in eCommerce Research, W. Jank and G. Shmueli (editors), Wiley. Kenett, R.S., Harel, A. and Ruggeri, F. (2009). Controlling the Usability of Web Services, International Journal of Software Engineering and Knowledge Engineering, Vol. 19, No. 5, pp. 627-651. Kenett, R.S. and Raanan, Y. (2010). Operational Risk Management: a practical approach to intelligent data analysis, Wiley and Sons. http://www.wiley.com/WileyCDA/WileyTitle/productCd-047074748X.html Kenett, R.S., Gruber, A. and Ben Gal, I. (2012). Applications of Bayesian Networks to Small, Mid-size and Massive Data, The 6th Inter. Workshop on Applied Probability – IWAP 2012, Jerusalem, June 11-14th. Kenett, R.S. (2012). Applications of Bayesian Networks to Operational Risks, Healthcare, Biotechnology and Customer Surveys, Proceedings of the 22nd Colombian Statistics Symposium, The National University of Colombia, Bucaramanga, Columbia, July 17-22nd. http://ssrn.com/abstract=2172713 Kenett, R.S. (2012). Risk Analysis in Drug Manufacturing and Healthcare, in Statistical Methods in Healthcare, Faltin, F., Kenett, R.S. and Ruggeri, F. (editors in chief), John Wiley and Sons. Koski, T. and Noble, J. (2009). Bayesian Networks – An Introduction, John Wiley & Sons, UK. Pearl, J. (1985). Bayesian Networks: A Model of Self-Activated Memory for Evidential Reasoning" (UCLA Technical Report CSD-850017). Proceedings of the 7th Conference of the Cognitive Science Society, University of California, Irvine, CA. pp. 329–334. Pearl, J. (2009). Causality: Models, Reasoning, and Inference, 2nd ed., Cambridge University Press, UK.