Stavanger, 20/6/2013
http://euchina2013.cloudcom.org/
Data Analytics Applications for Future HPC Ron S. Kenett
Michael Ashcroft
Stavanger, 20/6/2013
Agenda • Background • Big Data • Social Networks • Bayesian Networks • Open Source
Stavanger, 20/6/2013
Stavanger, 20/6/2013
Stavanger, 20/6/2013
Stavanger, 20/6/2013
Big Data
Managing Risk and Costs in OSS Adoption Usability
Biotechnology
Web Services
Stavanger, 20/6/2013
Open Source Software (OSS) Some estimates indicate that by 2016 the prevalence of OSS will exceed 95% in commercial applications
Stavanger, 20/6/2013
Managing Risk and Costs in OSS Adoption
www.riscoss.eu
Stavanger, 20/6/2013
Stavanger, 20/6/2013
XWiki Community
Stavanger, 20/6/2013
NodeXL is a free and open source network analysis and visualization software package for Microsoft Excel 2007/2010. It is a popular package similar to other network visualization tools such as Pajek, UCINet, and Gephi
Stavanger, 20/6/2013
HuffingtonPost user network on Twitter Overall Graph Metrics: Vertices: 4393 Unique Edges: 4750 Edges With Duplicates: 0 Total Edges: 4750 Self-Loops: 0 Reciprocated Vertex Pair Ratio: 0.000210570646451885 Reciprocated Edge Ratio: 0.000421052631578947 Connected Components: 1 Single-Vertex Connected Components: 0 Maximum Vertices in a Connected Component: 4393 Maximum Edges in a Connected Component: 4750 Maximum Geodesic Distance (Diameter): 4 Average Geodesic Distance: 3.474606 Graph Density: 0.000246189810996713 Modularity: Not Applicable NodeXL Version: 1.0.1.238
Top Tweeters in Entire Graph: SheSeauxSaditty BlkSportsOnline yayayarndiva alaa huffingtonpost JohnLusher milrecetas UrbanGem HuffPostPol RCdeWinter
Stavanger, 20/6/2013
geveryone Twitter NodeXL SNA Map and Report for Tuesday, 11 June 2013 at 22:21 UTC The graph represents a network of 1,414 Twitter users whose recent tweets contained "geveryone", taken from a data set limited to a maximum of 10,000 tweets. The network was obtained on Tuesday, 11 June 2013 at 22:21 UTC. There is an edge for each "replies-to" relationship in a tweet. There is an edge for each "mentions" relationship in a tweet. There is a selfloop edge for each tweet that is not a "replies-to" or "mentions". The tweets were made over the 7-day, 1hour, 49-minute period from Tuesday, 04 June 2013 at 20:23 UTC to Tuesday, 11 June 2013 at 22:12 UTC.
Stavanger, 20/6/2013
Xwiki Community Data 1) Mailing lists archives: XWiki users mailing list: http://lists.xwiki.org/pipermail/users/ XWiki devs mailing list: http://lists.xwiki.org/pipermail/devs/ 2) IRC chat archives: http://dev.xwiki.org/xwiki/bin/view/IRC/WebHome 3) Commits (via git) https://github.com/xwiki 4) Code review comments are available on GitHub as well but I am not aware of a fast way to retrieve them all. Maybe GitHub has some APIs for retrieving them and a simple script could be written for the purpose. 5) Everything about bugs and releases: http://jira.xwiki.org
Stavanger, 20/6/2013
Xwiki Communities - example Based on interactions between active users between 2008-2012
A
B C
D
Stavanger, 20/6/2013
Xwiki Communities - example Changes over time between 2008-2012 A
B
A
B
D
C C
C
Jan-Jun 2008
B
A
D
Jan-Jun 2010
D
Jan-Jun 2012
Change in emphasis of various roles
Stavanger, 20/6/2013
Xwiki Communities - example C D B A Satellite community members
Stavanger, 20/6/2013
Xwiki Communities - example C
A A
B D
Satellite community members
Stavanger, 20/6/2013
Usability
The Seven Layers of a Decision Support for User Interface Design (DSUID) The lowest layer – user activity The second layer – page hit attributes (raw data) The third layer – transition analysis (dynamic data) The fourth layer – usability score (quantitative values) The fifth layer – usage statistics (descriptive statistics) The sixth layer – statistical analysis (clusters, BN) The top layer – interpretation Kenett R.S., Harel A. and Ruggeri, F. (2009). Controlling the Usability of Web Services, International Journal of Software Engineering and Knowledge Engineering, 19 (5), pp. 627-651.
Stavanger, 20/6/2013
Usability
The fifth layer bytes
Variable
Mean
StDev
CoefVar
Minimum
Median
Maximum
TextSize
20109
15817
78.66
0.000000000
17941
72359
PageSize
79577
259545
326.16
0.000000000
40273
2352613
PercentExitLoss
14.25
15.95
111.99
0.000000000
10.00
100.00
AccessibilitySco
15.50
17.67
113.97
0.0178
8.74
76.70
PerformanceScore
46.46
25.13
54.10
3.53
41.70
100.00
ReadabilityScore
2780
4746
170.71
0.000000000
1409
43135
UsabilityScore
73.04
27.15
37.17
14.22
69.61
139.40
UsabilityAlert
58.83
39.54
67.21
0.000000000
52.50
222.00
AvgDownloadTime
2.164
4.243
196.09
0.000000000
1.000
27.000
AvgReadingTime
153
1141
746.28
0.000000000
10.0
12288
AvgSeekingTime
62.2
508.6
818.12
0.000000000
10.0
5624.0
sec
Stavanger, 20/6/2013
Usability
The sixth layer
Stavanger, 20/6/2013
Usability
Cluster analysis
Stavanger, 20/6/2013
Usability
Bayesian networks
Stavanger, 20/6/2013
Usability
20%
Stavanger, 20/6/2013
Usability
15%
When high average reading time drops to 15% from 20%, usability improves
Biotechnology
Stavanger, 20/6/2013
Peterson, J. and Kenett, R.S. (2011), Modelling Opportunities for Statisticians Supporting Quality by Design Efforts for Pharmaceutical Development and Manufacturing, Biopharmaceutical Report, 18 (2), pp. 6-16.
Biotechnology
First 3 days
Stavanger, 20/6/2013
After 3 weeks
Stavanger, 20/6/2013
Web Services
Adaptive Web Services Testing Service Provider
Dynamic Test Selection
Service Changes
Service Broker
Dynamic Test Reconfiguration
Adaptive Control
Service Profile
Service Consumer
Dynamic Test Scheduling Dynamic Service Binding
Test Profile
Testing
Test Broker
Test Provider
Test Changes
Test Consumer
Bai X., Kenett, R.S. and Yu W. (2012). Risk Assessment and Adaptive Group Testing of Semantic Web Services. International Journal of Software Engineering and Knowledge Engineering, 22 (5), pp. 595-620. www.worldscientific.com/toc/ijseke/22/05
Stavanger, 20/6/2013
Web Services
Group Testing with Control Architecture Risk-Based Testing
Testing Controller Test Case Selection
WS Testing
Test Cases
Test Case Ranking
Service Ranking
Services
SUT Selection
Service Selection
…
More Layers?
Service Repository
Service Provider Service Broker Service Consumer
Test Provider
Test Results
Test Broker
Window Size Selection Test Consumer
Test Case Potency Evaluation
Service Evaluation
Testing Optimizer
Stavanger, 20/6/2013
Web Services
Risk Assessment of Semantic Web Services Risk Exposure = Failure Probability * Importance
Failure Probability Experience Based estimation
Ontology Model
Complexity Analysis
Importance Analysis
Workflow Model
Experience Based estimation
Ontology Model
Dependency Analysis
Usage Model
Web Services
Stavanger, 20/6/2013
Amazon Associates Web Services (AAWS)
AAWS WSDL
Ontology Model
19 operations with Complicated data structure 514 classes and 1096 dependency relationships
Stavanger, 20/6/2013
Web Services
Usage-Based Importance Analysis • In SOA, data, services and service compositions are decoupled from each other. S1
D1
CS1
S2 D2 S3 CS2
D3 S4
The more they are used, the more important they are
Web Services
Stavanger, 20/6/2013
Workflow-Based Failure Analysis • To estimate the failure probability of a composite service based on the failure probability of each constituent service and the control construct
Stavanger, 20/6/2013
Bayesian Networks Overview Bayesian networks are stochastic models that decompose complex multivariate systems into networks of much simpler conditional distributions.
Stavanger, 20/6/2013
We • do this by decomposing a system by the chain rule: –
Note: The joint distribution is equal to the product of the conditional distributions. (We lose no information.) We can represent this graphically: • Random variables represented by nodes. • Each node has associated with it a conditional probability distribution. • Each node has incoming edges from nodes associated with the variables upon which that node’s condition probability distribution is conditional.
A
P(B|A)
P(A)
B
C
P(D|A,B,C)
P(C|A,B)
D
E
P(E|A,B,C,D)
Stavanger, 20/6/2013
We • simplify the decomposition by exploiting known or discovered conditional independencies. and are conditionally independent given
A
Node
Conditional Independencies
A
-
B
C and E, given A
C
B, given A
D
A and E, given B and C
E
A, B and D, given C
P(B|A)
P(A)
B
C
P(D|A,B,C)
P(C|A,B)
D
E
P(E|A,B,C,D)
Stavanger, 20/6/2013
Bayesian Networks and Big data These models can be learnt automatically from data. Learning a Bayesian network involves learning: • The structure of the network (that encodes the conditional independencies present). • The parameters of the conditional distributions associated with each variable/node.
Data Causes
Predictive analytics applications Diagnostic applications
Actions
Events
Stavanger, 20/6/2013
Model Building and Parameter Learning • Prior knowledge regarding the structural relationships can be encoded using white and black lists. • Structural constraints can be included, such as limiting the maximum number of parents. • The parameters required remain constant, regardless of the size of the dataset
Stavanger, 20/6/2013
Model Comparison Classical algorithm: Greedy Search Inputs: 1. The space of all possible topologies T. 2. A number of transitions which can take us from one topology to another. ● Insert Edge ● Remove Edge ● Reverse Edge Given a topology, t, call the set of topologies that can be transitioned to from t the neighbours of t. Output: A topology corresponding to a local maxima in T. 3. Begin at a random topology, T. 4. Score . 5. Repeat: ● Score all neighbours of current topology. ● Transition to the highest scoring neighbour, as long as it is better than the current topology. .
A video of a search over possible topologies using Inatas System Modeler. Network augmented with decision and utility nodes, and search is over equivalence classes of topologies ordered by set-theoretic inclusion of conditional independencies
Stavanger, 20/6/2013
Model Comparison • Multiple topologies can encode the same sets of conditional independencies. In this case we say they are Markov equivalent. • For multiple reasons, it becomes attractive to search over Markov equivalence classes of topologies rather than topologies themselves.
A
P(A)
P(B)
B P(A|B)
P(B|A)
B
C
A
P(C|A)
C P(D|B,C)
D
E
P(E|C)
P(D|B,C)
D
P(C|A)
E
P(E|C)
Two Bayesian networks are Markov equivalent if and only if they share the same skeleton (edges ignoring direction) and v-structure (head-to-head meetings of edges).
Stavanger, 20/6/2013
Incomplete Data • The previous discussion assumed that the training data was complete. • In cases where data is incomplete, the same procedures can be followed after ’completing’ the data set using one of a number of methods, including: –
Expectation Maximisation
–
Gibbs Sampler
• The complexity of these algorithms is extremely large, and it is often prohibitive when dealing with large (>100,000) datasets. • Fortunately, it is possible to ignore incomplete data on a local basis due to the decomposability of the scoring criteria. This generally suffices in large cases.
Stavanger, 20/6/2013
Online Adaptability • Structural learning is expensive. • Ideally, we wish models to adapt to new data rather than require relearning. • Parameter learning can continue after model generation. Bayesian methods again offer a high quality, simple solution. • Ongoing likelhood analysis of data can indicate if complete strucural relearning may be necessary.
Stavanger, 20/6/2013
RESTful Bayesian Networks Remotely manage your models through HTTPS based API.
Inatas Web API Learning Inference and Decision Making Online Adaption Ensemble models Exact and Monte Carlo methods Dynamic resources allocation Secure (HTTPS + private/public key encryption if desired) Wide range of density estimation methods from kernel and wavelet based non-parametric methods – Open source plug-ins. – – – – – – – –
Stavanger, 20/6/2013 • • • • • • • • • • • • • • • •
Ashcroft, M, An Introduction To Bayesian Networks in Systems and Control, (2012), 18th International Conference on Automation and Computing (ICAC) Bai, X., Kenett, R.S. and Yu, W. (2012). Risk Assessment and Adaptive Group Testing of Semantic Web Services. International Journal of Software Engineering and Knowledge Engineering, 22 (5), pp. 595-620. http://www.worldscientific.com/toc/ijseke/22/05 Chickering, D, “Learning equivalence classes of Bayesian network structures,” Journal of Machine Learning Research, no. 2, pp. 445–498, 2002. Chickering, D, “Optimal structure identification with greedy search,” Journal of Machine Learning Research, no. 3, pp. 507–554, 2002 Cooper , G.F. (1990). The computational complexity of probabilistic inference using Bayesian belief networks, Artificial Intelligence, 42, pp, 393-405. Gruber, A. and Ben Gal, I., (2012) “Efficient Bayesian Network Learning for System Optimization in Reliability Engineering," Quality Technology & Quantitative Management, Vol. 9, No. 1, pp. 97-114. Jensen, F. V. (2001). Bayesian Networks and Decision Graphs, Springer. Harel, A., Kenett, R.S. and Ruggeri, F. (2009). Modeling Web Usability Diagnostics on the basis of Usage Statistics, in Statistical Methods in eCommerce Research, W. Jank and G. Shmueli (editors), Wiley. Kenett, R.S., Harel, A. and Ruggeri, F. (2009). Controlling the Usability of Web Services, International Journal of Software Engineering and Knowledge Engineering, Vol. 19, No. 5, pp. 627-651. Kenett, R.S. and Raanan, Y. (2010). Operational Risk Management: a practical approach to intelligent data analysis, Wiley and Sons. http://www.wiley.com/WileyCDA/WileyTitle/productCd-047074748X.html Kenett, R.S., Gruber, A. and Ben Gal, I. (2012). Applications of Bayesian Networks to Small, Mid-size and Massive Data, The 6th Inter. Workshop on Applied Probability – IWAP 2012, Jerusalem, June 11-14th. Kenett, R.S. (2012). Applications of Bayesian Networks to Operational Risks, Healthcare, Biotechnology and Customer Surveys, Proceedings of the 22nd Colombian Statistics Symposium, The National University of Colombia, Bucaramanga, Columbia, July 17-22nd. http://ssrn.com/abstract=2172713 Kenett, R.S. (2012). Risk Analysis in Drug Manufacturing and Healthcare, in Statistical Methods in Healthcare, Faltin, F., Kenett, R.S. and Ruggeri, F. (editors in chief), John Wiley and Sons. Koski, T. and Noble, J. (2009). Bayesian Networks – An Introduction, John Wiley & Sons, UK. Pearl, J. (1985). Bayesian Networks: A Model of Self-Activated Memory for Evidential Reasoning" (UCLA Technical Report CSD-850017). Proceedings of the 7th Conference of the Cognitive Science Society, University of California, Irvine, CA. pp. 329–334. Pearl, J. (2009). Causality: Models, Reasoning, and Inference, 2nd ed., Cambridge University Press, UK.