Mining moving flock patterns in large spatio-temporal datasets using a frequent pattern mining approach. Andres Oswaldo Calderon Romero

Mining moving ﬂock patterns in large spatio-temporal datasets using a frequent pattern mining approach Andres Oswaldo Calderon Romero March 2011 Cou...

Author: Erika Anthony

8 downloads 0 Views 3MB Size

Report

Download PDF

Recommend Documents

Frequent Pattern mining Using Novel FP-Growth Approach

Chapter 5, Frequent Pattern Mining

Survey on Frequent Pattern Mining

Pattern Decomposition Algorithm for Data Mining Frequent Patterns

Using Pattern Decomposition Methods for Finding All Frequent Patterns in Large Datasets

Foundation for Frequent Pattern Mining Algorithms Implementation

Mining Periodic Frequent Patterns using Period Summary and Map-Reduce

Mining Best-N Frequent Patterns in a Video Sequence

Keywords : Data mining,weighted frequent pattern mining,updated TDB, Mining with tree structure

Frequent Item Set Mining

Mining Frequent Patterns in Print Logs with Semantically Alternative Labels

Frequent Subgraph Mining

Frequent subsequence mining

Frequent Pattern Mining with Serialization and De-Serialization

Frequent Contiguous Pattern Mining Algorithms for Biological Data Sequences

CRIME PATTERN DETECTION USING DATA MINING

A probabilistic algorithm for mining frequent sequences

Sequential Pattern Mining: Concepts and Primitives. Mining Sequence Patterns in Transactional Databases

A Survey of Frequent Subgraph Mining Algorithms

Mining Sequential Patterns with Constraints in Large Databases

Discovering Linguistic Patterns using Sequence Mining

Data Mining I. Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods. Keith E. Emmert

Mining Compressing Sequential Patterns

SPATIOTEMPORAL DATA MINING: ISSUES, TASKS AND APPLICATIONS

Mining moving ﬂock patterns in large spatio-temporal datasets using a frequent pattern mining approach Andres Oswaldo Calderon Romero March 2011

Course Title:

Geo-Information Science and Earth Observation for Environmental Modelling and Management

Level:

Master of Science (MSc.)

Course Duration:

September 2009 – March 2011

Consortium partners:

University of Southampton (UK) Lund University (Sweden) University of Warsaw (Poland) University of Twente, Faculty ITC (The Netherlands) 2011–

GEM thesis number:

Mining moving ﬂock patterns in large spatio-temporal datasets using a frequent pattern mining approach by Andres Oswaldo Calderon Romero

Thesis submitted to the University of Twente, faculty ITC, in partial fulﬁlment of the requirements for the degree of Master of Science in Geo-information Science and Earth Observation for Environmental Modelling and Management.

Thesis Assessment Board

Chairman: Prof. Dr. Menno-Jan Kraak External Examiner: Dr. Jadu Dash First Supervisor: Dr. Otto Huisman Second Supervisor: Dr. Ulanbek Turdukulov

Disclaimer This document describes work undertaken as part of a programme of study at the University of Twente, Faculty ITC. All views and opinions expressed therein remain the sole responsibility of the author, and do not necessarily represent those of the university.

Abstract

Modern data acquisition techniques such as Global positioning system (GPS), Radio-frequency identiﬁcation (RFID) and mobile phones have resulted in the collection of huge amounts of data in the form of trajectories during the past years. Popularity of these technologies and ubiquity of mobile devices seem to indicate that the amount of spatio-temporal data will increase at accelerated rates in the future. Many previous studies have focused on eﬃcient techniques to store and query trajectory databases. Early approaches to recovering information from this kind of data include single predicate range and nearest neighbour queries. However, they are unable to capture collective behaviour and correlations among moving objects. Recently, a new interest for querying patterns capturing ‘group’ or ‘common’ behaviours have emerged. An example of this type of pattern are moving ﬂocks. These are deﬁned as groups of moving objects that move together (within a predeﬁned distance to each other) for a certain continuous period of time. Current algorithms to discover moving ﬂock patterns report problems in scalability and the way the discovered patterns are reported. The ﬁeld of frequent pattern mining has faced similar problems during the past decade, and has sought to provided eﬃcient and scalable techniques which successfully deal with those issues. This research proposes a framework which integrates techniques for clustering, pattern mining detection, postprocessing and visualization in order to discover and analyse moving ﬂock patterns in large trajectory datasets. The proposed framework was tested and compared with a current method (BFE algorithm). Synthetic datasets simulating trajectories generated by large number of moving objects were used to test the scalability of the framework. Real datasets from diﬀerent contexts and characteristics were used to assess the performance and analyse the discovered patterns. The framework shows to be eﬃcient, scalable and modular. This research shows that moving ﬂock patterns can be generalized as frequent patterns and state-of-the-art algorithms for frequent pattern mining can be used to detect the moving ﬂock patterns. This research develops preliminary visualization of the most relevant ﬁndings. Appropriate interpretation of the results demands further analysis in order to display the most relevant information.

Keywords: Frequent pattern mining, Flock patterns, Trajectory datasets.

Acknowledgements

I would like to express my sincere gratitude to my ﬁrst supervisor, Dr. Otto Huisman, and second supervisor, Dr. Ulanbek Turdukulov, for their great support and guidance during this research. I think I was the most fortunate student for having the chance to work with such great scientists. I very appreciate your support, critical comments and suggestions. Thank you so much!!! I would also like to thank Petter Pilesjo, Malgorzata Roge-Wisniewska, Andre Kooiman and Louise van Leeuwen for their valuable help at diﬀerent stages of my studies. A special “Thank you!!!” goes to all my GEM friends for the wonderful time we had together. You were my second family during the past months and I never will forget you. I will miss you a lot. I would like to dedicate this thesis to my parents, Marcelo and Esperanza, my brother and sisters, Carlos, Paola and Carolina, and my little nephew and niece, Chris and Gabi. Thank you for believing in me even when I found it diﬃcult to believe in myself. I owe you much more than this. Finally, I want to thank my ﬁancee. Nancy, you are the love of my life. Thank you for all your inﬁnite love, support and patience during all this time. I love you!!!

Contents 1 Introduction 1.1 Background . . . . . . . . . 1.2 Problem statement . . . . . 1.3 Research identiﬁcation . . . 1.3.1 Research objectives . 1.3.2 Research questions . 1.3.3 Innovation aimed at 1.3.4 Related work . . . . 1.4 Thesis structure . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

1 1 2 2 3 3 3 4 5

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

7 7 8

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

2 Framework Deﬁnition 2.1 Identifying patterns in moving objects . . . . . . 2.2 Basic Flock Pattern algorithm . . . . . . . . . . . 2.3 Finding frequent patterns in traditional databases . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Shopping basket analysis: an example . . 2.3.2 Maximal and Closed frequent patterns . . 2.4 Proposed Framework . . . . . . . . . . . . . . . . 2.4.1 Getting a ﬁnal set of disks per timestamp 2.4.2 From trajectories to transactions . . . . . 2.4.3 Frequent Pattern Mining Algorithms . . . 2.4.4 Postprocessing Stage . . . . . . . . . . . . 2.5 Flock Interpretation . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

9 10 11 12 13 13 14 15 15

3 Implementation 3.1 BFE Implementation . . . . 3.2 Synthetic Generators . . . . 3.3 Synthetic Datasets . . . . . 3.4 Internal Comparison . . . . 3.5 Framework Implementation 3.6 Computational Experiments 3.7 Validation . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

17 17 18 18 18 22 26 26

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

i

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

4 Study Cases 4.1 Tracking Icebergs in Antarctica . . . . . . . . 4.1.1 Implications and possible applications 4.1.2 Data cleaning and preparation . . . . 4.1.3 Computational experiments . . . . . . 4.1.4 Results . . . . . . . . . . . . . . . . . 4.1.5 Findings in iceberg tracking . . . . . . 4.2 Pedestrian movement in Beijing . . . . . . . . 4.2.1 Implications and possible applications 4.2.2 Data cleaning and preparation . . . . 4.2.3 Computational experiments . . . . . . 4.2.4 Results . . . . . . . . . . . . . . . . . 4.2.5 Findings in pedestrian movement . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

29 29 30 31 32 34 34 37 37 37 38 39 39

5 Discussion 5.1 Implementation and Performance Issues . . . . . . . 5.1.1 Impact of size trajectory . . . . . . . . . . . . 5.1.2 Possible solutions . . . . . . . . . . . . . . . . 5.2 Interpretation Issues . . . . . . . . . . . . . . . . . . 5.2.1 Number of patterns and quality of the results 5.2.2 Overlapping problem and alternatives . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

45 45 45 46 46 46 47

6 Conclusions and Recommendations 6.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49 49 50

References

51

Appendices

59

A Main source code of the framework implementation

59

ii

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

List of Figures 1.1

2.1 2.2

2.3 2.4 2.5 3.1 3.2 3.3 3.4 3.5 3.6 3.7

3.8

3.9

4.1 4.2 4.3 4.4 4.5

A ﬂock pattern example: {T1 , T2 , T3 }. Ti illustrates diﬀerent trajectories, ci encloses a disk in which trajectories are considered close to each other and ti represents consecutive time intervals (after [82]). . . . . . . . . . . BFE Algorithm for computing set of ﬁnal disks per each timestamp and to join and report ﬁnal ﬂock patterns (source: [82]). . . . . . . . . . . . . . . BFE pruning stages. (a) The initial set of disks. (b) Just disks which overpass μ are retained (μ = 3). (c) Redundant disks with subset members are removed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shopping Basket Analysis example (source: [33]) . . . . . . . . . . . . . . A trajectory dataset example. . . . . . . . . . . . . . . . . . . . . . . . . . Example of a ﬂock where diﬀerent interpretation can apply. . . . . . . . . Oldenburg network representation. . . . . . . . . . . . . . . . . . . . . . . San Joaquin network representation. . . . . . . . . . . . . . . . . . . . . . Comparison of internal execution time for the SJ25KT60 dataset. . . . . . Comparison of internal execution time for the SJ50KT55 dataset. . . . . . Systematic diagram for the proposed framework. . . . . . . . . . . . . . . Overlapping problem during the generation of ﬁnal disks. . . . . . . . . . Performance of BFE algorithm and the proposed framework with diﬀerent values for in SJ25KT60 dataset. The additional parameters were set as μ = 5 and δ = 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Performance of BFE algorithm and the proposed framework with diﬀerent values for in SJ50KT55 dataset. The additional parameters were set as μ = 9 and δ = 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Visualization of the results from BFE (Left) and the proposed Framework (Right). BFE displays 448 ﬂocks while the proposed framework 104. . . . Reported positions for all icebergs in the Iceberg dataset (1978, 1992-2009). The circumpolar and coastal currents (West and East wind drifts) around the Antarctic continent (source: [93]). . . . . . . . . . . . . . . . . . . . . Spatial location of Antarctic krill catches (doted and line regions). Black areas illustrate ice shelves and fast ice during summer (source: [63]). . . . Comparison between BFE algorithm and the proposed Framework performance for diﬀerent values of in Icebergs06 dataset. . . . . . . . . . . . . General view of the discovered patterns in Icebergs06 Dataset. Arrows indicate the direction of the ﬂocks. . . . . . . . . . . . . . . . . . . . . . .

iii

2

8

9 11 14 16 19 20 21 21 23 24

26

27 28 30 31 32 33 35

4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14

5.1

Detail of discovered ﬂocks in Icebergs06 Dataset. Arrows indicate the direction of the ﬂocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . General view of the discovered patterns from January 01 to February 15. General view of the discovered patterns from June 03 to August 17. . . . Distribution points in study area. Left shows the sparse distribution around China. Right focuses on 5th Ring Road area in Beijing (source: [98]). . . Comparison of both methods with diﬀerent values for in Beijing dataset. General view of the discovered ﬂocks in the Beijing Dataset. . . . . . . . . Close-up around the region which concentrates the major number of ﬂocks. Some universities and IT institutions are highlighted. . . . . . . . . . . . . Patterns shorter than 5 Km during workdays. Circle encloses the major concentration around TSP region. Arrows highlight other locations. . . . Patterns showing diﬀerent routes to connect TSP area with the South. Yellow patterns go from TSP to South, green patterns show the return. . Example of reported ﬂocks with diﬀerent values of . . . . . . . . . . . . .

iv

35 36 36 38 39 40 40 41 42 47

List of Tables 2.1

Transactional version of the dataset from Figure 2.4. . . . . . . . . . . . .

14

3.1 3.2 3.3

Data format from generator. . . . . . . . . . . . . . . . . . . . . . . . . . . Synthetic Datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Number of combinations required for speciﬁc time intervals in SJ50KT55 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Number of ﬂocks generated before and after postprocessing phase for BFE and the proposed framework in SJ25KT60 dataset. . . . . . . . . . . . . . Number of ﬂocks generated before and after postprocessing phase for BFE and the proposed framework in SJ50KT55 dataset. . . . . . . . . . . . . .

19 20

3.4 3.5

4.1 4.2 4.3 4.4 4.5 4.6

Iceberg trajectories during 2006 in Antarctica. . . . . . . . . . . . . . . . Number of ﬂocks generated before and after postprocessing in Icebergs06 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Description of the discovered ﬂock patterns in Icebergs06 dataset. The ﬁrst column corresponds to tags in Figures 4.5 and 4.6. . . . . . . . . . . . . . GPS log trajectories in Beijing. . . . . . . . . . . . . . . . . . . . . . . . . Number of ﬂocks generated before and after postprocessing in Beijing dataset. Description of the discovered ﬂock patterns in Beijing dataset. . . . . . .

v

22 27 28 32 33 34 38 38 43

vi

Chapter 1

Introduction 1.1

Background

Modern data acquisition techniques such as Global positioning system (GPS), Radiofrequency identiﬁcation (RFID), mobile phones, wireless sensor networks, and general surveys have resulted in the collection of huge amounts of geographic data during the past years. The popularity of these technologies and ubiquity of mobile devices seem to indicate that the amount of georeferenced data will increase at accelerated rates in the future. However, and despite the growing demand, there are few tools available to apply a proper analysis of spatio-temporal datasets. The natural complexities in data handling, accuracy, privacy and its huge volume have become the analysis of spatial data into a challenging task. Traditional spatial analysis is not an eﬀective solution. They were developed in at time when access and quality of geodata was poor, as a result, they can not oﬀer scalability conditions to manage the increasing dimensionality of data. Therefore, there is an urgent need for new and eﬃcient techniques to support the analysis and potential extraction of valuable information from voluminous and complex spatio-temporal datasets. Trajectory data associated with moving objects is one of the ﬁelds which has increased in volume considerably. Early approaches to recovery of information from this kind of data include single predicate range and nearest neighbour queries, for instance, “ﬁnd all the moving objects inside area A between 10:00 AM and 2:00 PM” or “how many cars drove between Main Square and the Airport on Friday”. Recently, diverse studies have focused in querying patterns capturing group behaviour in moving object databases, for instance: moving clusters, convoy queries and ﬂock patterns [42, 47, 43, 82, 54]. Flock pattern detection is particularly relevant due to the characteristics of the object of study (animals, pedestrians, vehicles or natural phenomena), how they interact each other and how they move together [50, 31]. [82] deﬁne moving ﬂock patterns as groups of entities moving in the same direction while being close to each other for the duration of a given time interval (Figure 1.1). They consider group of trajectories to be close together if there exists a disk with a given radius that encloses all of them. The current approach to discover moving ﬂock patterns consists in ﬁnding a suitable set of disks in each time instance and then merging the results from one time instance to another. As consequence,

1

Figure 1.1: A ﬂock pattern example: {T1 , T2 , T3 }. Ti illustrates diﬀerent trajectories, ci encloses a disk in which trajectories are considered close to each other and ti represents consecutive time intervals (after [82]).

the performance and number of ﬁnal patterns depends on the number of disks and how they are combined. In parallel, some areas of traditional data mining have also focused on discovering frequent patterns in general attribute data. Association rule learning and frequent pattern mining [37] are popular and well researched methods for discovering interesting relations between variables in large databases. Frequent patterns are itemsets, subsequences, or substructures that appear in a dataset with frequency no less than a user-speciﬁed threshold. Initially, association rule learning and frequent pattern mining algorithms were designed to solve a speciﬁc task in the commerce sector [33]. However, the approach shares interesting similarities with the problem of ﬁnding moving ﬂock patterns, for example, the eﬃcient handling of candidates and combinations [1, 39].

1.2

Problem statement

Proposed algorithms to discover ﬂock patterns scan the data in order to ﬁnd disks which can be joined between consecutive time instances. The number of possible disks in a given time interval can be quite large and the cost to join those disks between time intervals can be quite expensive. Handling and analysis of all possible combinations have a direct impact on the algorithm’s performance. [82, 10] have tested some heuristics and approximations aiming to reduce the number of disks evaluated. However, experimental results still show large response times. In addition, the number and quality of the discovered ﬂock patterns make it particularly diﬃcult to perform a proper interpretation of the results.

1.3

Research identiﬁcation

Traditional data mining techniques, such as association rule learning and, particularly, frequent pattern mining, were faced with combination and interpretation issues. This investigation aims to deﬁne a new methodology to mine moving ﬂock patterns in trajectory

2

datasets based on the frequent pattern mining approach, aiming to tackle the aforementioned drawbacks. Procedures and conceptualization will be outlined together with its validity and usefulness using synthetic and real study cases.

1.3.1

Research objectives

In order to accomplish this purpose there are three main objectives: 1. To conceptualise an appropriate procedure to ﬁt the concept of moving ﬂock patterns into the frequent pattern mining methodology. 2. To implement a framework for pattern recognition in moving object datasets based on the methodology proposed. 3. To test the performance of the resulting framework using study cases with real and synthetic datasets.

1.3.2

Research questions

1. For design: (a) How to apply the basic concepts of the frequent pattern mining approach in spatio-temporal datasets? (b) How to adapt existing methods and data structures to ﬁt the speciﬁc requirements of frequent pattern mining algorithms? (c) What would be an appropriate method to visualize and interpret the results? 2. For testing: (a) How does the proposed framework perform in datasets with diﬀerent characteristics? (b) Which parameters and characteristics are the most important in determining the algorithm’s performance? (c) Is the proposed framework applicable to diﬀerent context and phenomena? (d) Are the results from the framework useful and interpretable?

1.3.3

Innovation aimed at

Innovation in this research will be aimed towards the implementation of a novel moving ﬂock pattern framework which adapts traditional frequent pattern mining techniques in order to reduce the number of combinations and to improve the understanding of the results. The scalability and performance of the proposed framework will be tested with synthetic and real datasets in the context of human movement (pedestrians) and natural phenomena (icebergs). Generation and visualization of the most relevant results will be also explored.

3

1.3.4

Related work

Due to the increasing collection of movement datasets, the interest on querying patterns which describe collective behaviour has also increased. [82] enumerate three groups of ‘collective’ patterns in moving object databases: moving clusters, convoy queries and ﬂock patterns. Both moving clusters [42, 47, 53] and convoy queries [43, 44] have in common that they are based on clustering algorithms, mainly density-based algorithms such as DBSCAN [21]. The main diﬀerences between those two techniques are how they join clusters between two consecutive time intervals and the use of an extra parameter to specify minimum duration time in convoy queries. Although these methods are closely related to ﬂock patterns, they diﬀer from the latter technique because the resulting clusters do not assume a predeﬁned shape. Previous work in detection of moving ﬂock patterns are reported by [30] and [10]. They introduce the use of disks with a predeﬁned radius to identify groups of trajectories moving together in the same direction. All trajectories which lie inside of the disk in a particular time instance are considered a candidate pattern. The main limitation of this procedure is that there is a inﬁnite number of possible placements of the disk at any time instance. Indeed, [30] have shown that the discovery of ﬁxed ﬂocks, patterns where the same entities stay together during the entire interval, is an NP-hard problem. [82] are the ﬁrst to present an exact solution for reporting ﬂock patterns in polynomial time, and also for those that can work eﬀectively in real-time. Their work reveals that polynomial time solution can be found through identifying a discrete number of locations to place the centre of the ﬂock disk. They propose the Basic Flock Evaluation (BFE) algorithm based on time-joins and combinations, and four other algorithms based on heuristics, to reduce the total number of candidates disks to be combined and, thus, the overall computational cost of the BFE algorithm. However, pseudo-code and experimental results still show relatively high computational complexity, long response time and a large number of discovered ﬂocks which makes interpretation diﬃcult. Recently, [88] have proposed a new moving ﬂock pattern deﬁnition and developed the corresponding algorithm based on the notion of spatio-temporal coherence. The experimental results focus on ﬁnding ﬂock patterns in pedestrian datasets. Although they used a real dataset collected in a National Park in Netherlands, it is relatively too small to test appropriately the scalability of this algorithm. An interesting contribution in this study is a comparison framework of existing ﬂock detection approaches according to the classiﬁcation criteria recently introduced by [92] for collective movement. In order to reduce the time response, spatial data structures and indexes have been tested, e.g. k-d tree and some variations. [10] have applied skip-quadtrees which make use of compressed quadtrees as the bottom-level structure. However their study just explores ﬂock identiﬁcation in single time intervals, so the inclusion of temporal variables was not considered. Traditional data mining techniques, and particularly the ﬁeld of frequent pattern mining, have treated the number of combinations by reducing the number of elements to be combined or compacting the size of the dataset. [1] have applied pruning techniques based on the downward-closure property, which guarantees that all the subsets from a frequent pattern must be also frequent. Using this property, authors identiﬁed invalid candidates and then removed them from the analysis. However this technique still scans through the dataset repeatedly. [36] proposed an intermediate layer which organizes the records in a compact data

4

structure called frequent-pattern tree (FP-Tree). Main advantages of this methodology are compression of datasets, minimization of scans and detection of patterns without candidate generation [36, 12]. Recently, [39] have proposed a novel and improved FP-tree structure applied in diﬀerent contexts, for instance: market basket, association rules and sequential patterns. [75] have applied this methodology successfully to ﬁnd co-orientation patterns from satellite imagery. The empirical results show an improvement around one degree of magnitude respect to the traditional approach. Recently, the Linear time Closed itemset Miner (LCM) [81] have demonstrated a remarkable performance in dense databases using Binary Decision Diagrams, a compact graph-based data structure. Frequent patterns can be eﬃciently processed by using algebraic operations. LCM requires linear time to mine frequent patterns when the data compression works well. A comparison performance of LCM and other state-of-the-art techniques can be consulted in [9, 28]. However, [33] show how frequent pattern mining may generate a huge number of frequent patterns. It is even worse when there exist long patterns in the data. This is because if a pattern is frequent, each of its subpatterns is frequent as well. It clearly increases the complexity of analysis and understanding. To overcome this problem, Closed and Maximal pattern mining were proposed [7, 69]. The general idea is to report just the longest patterns avoiding its subpatterns. The aforementioned techniques have been applied successfully to diverse scenarios such as bioinformatics [17, 16], GIS [60, 35] and marketing [96, 24]. Interested reader should refer to [37] for a complete survey in the current status of the frequent pattern mining approach. Additionally, the Frequent Itemset Mining Implementation repository (FIMI) [26] have gathered a collection of open source implementations for the most eﬃcient and scalable Frequent/Closed/Maximal pattern mining algorithms. Overall, the frequent pattern mining approach has made a tremendous progress in the last decade and it is thought that this can contribute adequately to solve the drawbacks of ﬁnding moving ﬂock patterns in trajectory datasets.

1.4

Thesis structure

The remainder of the thesis is outlined as follows: Chapter 2 explains the basic concepts to identify patterns in moving objects. The Basic Flock Pattern algorithm is introduced together with the formal deﬁnition of a moving ﬂock pattern. Afterwards, frequent pattern mining in traditional databases is brieﬂy discussed in order to explain deeper relevant concepts used in following chapters. Then, the general steps of the proposed framework are explained. Finally, a discussion about possible ﬂock interpretations is presented. Chapter 3 concentrates basically in implementation and technical issues. The ﬁrst part explains the methods and technologies used in the development of the BFE algorithm. Then it explains the generation and main characteristic of synthetic datasets used to test the implementation. Later, it focus on the internal comparison between the two phases of the BFE algorithm. Afterwards, the main issues in the implementation of the proposed framework are described. The ﬁnal part of the chapter present a performance comparison between BFE and the proposed framework using the aforementioned synthetic datasets. Chapter 4 focuses on study cases with real datasets. Two diﬀerent moving entities are studied: pedestrians and icebergs. The chapter presents similar tests evaluated with synthetic datasets, together with justiﬁcation, possible applications and results discussion.

5

Chapter 5 deals with a more detailed discussion about the framework implementation. The main point of discussion are the impact of the size of trajectories in the framework’s performance. Then, the discussion focus on the limitations and alternatives of the techniques used in the framework in the understanding and interpretation of the results. Finally, chapter 6 shares the conclusions and recommendations.

6

Chapter 2

Framework Deﬁnition 2.1

Identifying patterns in moving objects

Due to the increasing availability of spatial databases diﬀerent methodologies have been explored in order to ﬁnd meaningful information hidden in this kind of data. New understanding in how diverse entities move in a spatial context have demonstrated to be useful in topics as diverse as sports [41], socio-economic geography [23], animal migration [20] and security and surveillance [58, 71]. Early approaches to recovery information from spatio-temporal datasets include adhoc queries aimed to answer single predicate range or nearest neighbour queries, for instance, “ﬁnd all the moving objects inside area A between 10:00 AM and 2:00 PM” or “how many cars drove between Main Square and the Airport on Friday”. Spatial query extensions in common GIS software packages and DBMS are able to run this type of queries, however these techniques try to ﬁnd the best solution exploring each spatial object at a time according to some metric distance (usually Euclidean). As results, it is diﬃcult to capture collective behaviour and correlations among the involved entities using this type of queries. Recently, a new interest for querying patterns capturing ‘group’ or ‘common’ behaviour among moving entities have emerged. Of particular interest is the development of approaches to identify groups of moving objects whose share a strong relationship and interaction in a deﬁned spatial region during a given time duration. Some examples of these kinds of approaches are moving cluster [47] [42], convoy queries [43] and ﬂock patterns [30] [10] [82]. Although diﬀerent interpretations can be taken, a ﬂock pattern refers to a predeﬁned number of entities which stay close enough during at least a given time interval. The challenge to identify this kind of movement patterns is particularly relevant due to the intrinsic interactions among the members of the ﬂock, specially in the context of animals, pedestrian or vehicles. In this research an alternative framework for discovering moving ﬂock patterns is proposed. Part of this framework is based on an existing state-of-the-art algorithm, extended to take advantage of well-known and tested frequent pattern mining algorithms in the area of association rule learning. The details of these concepts and the methodology used to build the proposed framework will be discussed in the following sections.

7

Figure 2.1: BFE Algorithm for computing set of ﬁnal disks per each timestamp and to join and report ﬁnal ﬂock patterns (source: [82]).

2.2

Basic Flock Pattern algorithm

Flock pattern ﬁnding was ﬁrstly introduced by [31] and [50], however they did not consider the notion of duration in time. In a ﬁrst approximation to identify ﬂocks, just two variables were used : a constant maximum distance among moving objects () which represents the radius of a disk and a minimum number of moving objects (μ) which should lie inside of that disk. Later [30] added a minimum time duration (δ) to be considered as a parameter of a ﬂock. Initial experiments showed that ﬁnd an appropriate location for the disk was not a trivial problem. It is shown in [30] that discovering the longest duration ﬂock pattern is an NP-hard problem. For that reason, the work presented only approximation algorithms. Recently, [82] introduced an on-line algorithm to ﬁnd moving ﬂock patterns called Basic Flock Evaluation algorithm (BFE). This appears to be the ﬁrst work to present exact solutions for reporting ﬂock patterns in polynomial time. It was decided to share the general deﬁnition for moving ﬂock patterns used in [82] illustrated in Figure 1.1. It deﬁnes a dataset of trajectories and the parameters , μ and δ as function’s inputs: Deﬁnition Given are a set of trajectories τ , a minimum number of trajectories μ > 1(μ ∈ N), a maximum distance > 0( ∈ R) and a minimum time duration δ > 1(δ ∈ N).A ﬂock pattern F lock(μ, , δ) reports all maximal size collections F of trajectories where: for each fk in F the number of trajectories in fk is greater or equal than μ(|fk | ≥ μ) and there exist δ consecutive time instances such that for every ti ∈ [fkti ..fkti +δ ], there is a disk with center ctki and radius /2 covering all points in fkti . The general operation of the BFE algorithm can be explained in two parts. A ﬁrst function (left at Figure 2.1) aim to build a ﬁnal set of disks which, per each timestamp, brings together a minimum number of objects that remain close enough each other. The second part (right at Figure 2.1)joins candidate disks which share the same set of objects during consecutive timestamps if and only if it exceeds the minimum value of μ. In

8

Figure 2.2: BFE pruning stages. (a) The initial set of disks. (b) Just disks which overpass μ are retained (μ = 3). (c) Redundant disks with subset members are removed.

addition, the minimum duration parameter δ must be satisﬁed for the objects to be reported. The ﬁrst section of the algorithm uses a grid-based index to organize the set of locations at every time instance and identify couples of points which are less than units of distance one another. For each pair of points is possible generate two disks with radius /2 that have those points on their circumference. These two disks are considered as candidates before of testing if they ﬁt the required minimum number of μ trajectories. In large data sets, the number of possible pairs of points and, therefore disk candidates, can be huge. Additional tasks are therefore required in order to eliminate redundancy in the initial set of disks. If the complete set of trajectories within a disk also appear in other disk, just one of them should be kept it. The algorithm organizes the initial set of disks in a KD-Tree structure, so it is easy to detect groups of disks which intersect each other and then to check if one of them has supersets or subset elements with another disk. Figure 2.2 illustrates the pruning stages to calculate a valid set of ﬁnal disks. When a ﬁnal set of disks is found for consecutive time instances, the second part of the algorithm compares one by one the disks in each set to ﬁnd those which have a minimum number of trajectories in common (μ). When a new timestamp is explored, the new disks which match the requirements are joined with the previous stored candidates. At the moment that one of them is longer than the δ parameter, it is immediately reported. However, the number of disks in a given time instance can be quite large and the cost to join those disks in a ﬂock pattern can be quite expensive. BFE limits the number of candidates storing just those with δ time duration. As consequence of that, BFE reports ﬂocks with a ﬁxed time duration.

2.3

Finding frequent patterns in traditional databases

Frequent patterns are itemsets, subsequences, or substructures that appear in a dataset with frequency no less than a user-speciﬁed threshold [37]. The issue of unveiling inter-

9

esting patterns in databases under diﬀerent contexts has been a recurrent research topic during the last 15 years. General data mining has become widely recognized as a critical ﬁeld by companies of all types. As a part of the data mining methods, the task of associations rule learning have studied diﬀerent frequent pattern mining algorithms to identify relevant trends in datasets in diﬀerent disciplines [17, 60, 96]. One of the areas where the techniques of association rule learning and frequent pattern mining algorithms have been more often applied is in analysing data and market trends in transactions of costumers of large supermarkets and stores [1]. Usually this technique has been called ‘the shopping basket problem’ even though the methods derived to solve it can be applied under diﬀerent contexts [33]. During this chapter these techniques will be referred to as ‘Shopping Basket Algorithms’ to facilitate their explanation and reference. The shopping basket problem represents an attempt by a retailer to discover which items its costumers frequently purchased together [79]. The goal is an understanding of the behaviour of a typical customer and the identiﬁcation of valuable items and relationships among them. For this kind of problem the input is a given database with information about the items purchased. When a customer pays for its products at the cashier, a record with the bought items is inserted into the database. In a general view, it is enough to capture just the transaction ID and the product ID (one record per each item purchased). It is known as {TID:itemset} schema. As the records in the database usually refer to transactions, these databases are called transactional databases. The goal of shopping basket analysis is to ﬁnd sets of items (itemsets) that are “associated” and the fact of their association is often called an association rule [79]. For instance, if we know that a high percentage of customers are buying milk and bread at the same time in their visits to a supermarket, this relationship represents an association rule. It can be used to formulate new marketing strategies, promotions, introduction of new products, catalog design, cross-marketing or shelf space planning [33]. It is usual to locate associated items in diﬀerent aisles and high-proﬁt or new products between them to ensure they are exposed to more customers [79]. [24] discussed other case studies applied in commerce and marketing where diﬀerent association rules methods are explored. During the last years many improvements and new techniques have been developed and proposed in order to enhance and take advantage of the beneﬁts of association rules analysis.

2.3.1

Shopping basket analysis: an example

Given the small example illustrated by Figure 2.3, we can take as input a database of 4 transactions. Visually, it is easy to identify that Milk is present in 3 out of 4 transactions. It is also easy to see that Bread appear in all of the transaction where Milk is. Therefore, we can report the pair Milk and Bread as a frequent pattern and, for example, infer an association rule as: M ilk ⇒ Bread

[support : 0.75, conf idence : 1]

where support and conﬁdence are two measure of the rule interestingness. A support count threshold of 0.75 means than the number of transactions involving Milk is equal to 75% (3 out of 4) of the total number of transactions in the database. A conﬁdence of 1 means that all (100%) transactions where Milk appears, also Bread appears. This two measures are used to assess the quality of the obtained rules, which in large databases can be signiﬁcant and they are deﬁned as the parameters minimum support and minimum conﬁdence in most of the association rules algorithms.

10

Figure 2.3: Shopping Basket Analysis example (source: [33])

The process to retrieve a complete set of association rules from large databases can be divided in two parts. First, all of possible itemsets which get over the support threshold are found. This group is called frequent itemsets and refer to the most frequent patterns in the database. The techniques used to discover the set of frequent itemsets are also called frequent pattern mining algorithms. Then, from the frequent itemsets, strong associations are generated among the members of each itemset. Depending on the size of a itemset, all possible combination among its members are computed to obtain pairs of antecedent and consequence statements which will deﬁne a rule. The conﬁdence value is used in this stage to report just the most signiﬁcant rules.

2.3.2

Maximal and Closed frequent patterns

Although the ﬁrst generation of algorithms designed to mine associated rules aim to ﬁnd the complete group of frequent itemsets, in large databases using low values for minimum support threshold this number can be huge [33]. This is because if an itemset is frequent, each of its subsets is frequent as well. Long itemsets will contain large number of shorter frequent subsets. For instance, let a long itemset I = {a1 , a2 , ..., a100 } with 100 items. It is usually called type 100 100 100 or 100-itemset (for its number of members). It will contain 1-itemsets, 2-itemsets, and so on. The total number of frequent itemsets that 1 2 it would contain would be: 100 1

+

100 2

+ ... +

100 100

= 2100 − 1 ≈ 1.27 ∗ 1030

This magnitude of values is obviously too large to handle even for computer applications. To overcome this drawback the concepts of closed frequent pattern and maximum frequent pattern are used. A pattern α is a closed frequent pattern if α is frequent and

11

there exists no other pattern, with the same support, whose contains α. On the other hand, a pattern α is a maximal frequent pattern if α is frequent and there exists no other pattern, with any support, whose contains α. For example: α = {a1 , a2 , a3 , a4 : 2}

α is maximal

β = {a1 , a2 , a3 : 4}

β is closed but not maximal

The set of maximal frequent patterns is important because it contains the set of longest patterns such that any kind of frequent pattern which exceeds the minimum support can be generated. [33] provides a detailed and theoretical deﬁnition. For clariﬁcation, these two concepts can be illustrated with an additional example: Suppose a database D contains 4 transactions: D = {a1 , a2 , ...a100 ; a1 , a2 , ...a100 ; a20 , a21 , ...a80 ; a40 , a41 , ...a60 } Note that the ﬁrst transaction is repeated twice. The minimum support min sup = 2. A complete search for all itemsets will generate a vast number of combinations. However, the closed frequent itemset approach will ﬁnd only 3 frequent itemsets: C = {{a1 , a2 , ...a100 : 2}; {a20 , a21 , ...a80 : 3}; {a40 , a41 , ...a60 : 4}} The set of closed frequent itemsets contains complete information to generate the rest frequent itemsets with their corresponding support. It is possible to derive, for example, {a50 , a51 : 4} from {a40 , a41 , ...a60 : 4} or {a90 , a91 , a92 : 2} from {a1 , a2 , ...a100 : 2}. On the other hand, we just obtain one maximal frequent pattern, in this case: M = {{a1 , a2 , ...a100 : 2}} From the results it is known that {a50 , a51 } and {a90 , a91 , a92 } are frequent patterns, although it is not possible to assert their actual support counts.

2.4

Proposed Framework

It is thought that current frequent patterns mining algorithms developed in the area of association rule learning have made a tremendous progress bringing eﬃcient and scalable algorithms for discovering frequent itemsets in transactional databases which can be applied on numerous research frontiers. Therefore the main aim of the remainder of this thesis is to explore a methodology which allow the identiﬁcation of moving ﬂock patterns using traditional and powerful algorithms for association rule mining. In order to accomplish this goal, a framework including 4 steps is proposed: 1. Obtain a ﬁnal set of valid clusters in each timestamp. 2. Construct a transactional version of the trajectory dataset based on the disks visited by each trajectory. 3. Apply a frequent pattern mining algorithm in the generated database. 4. Perform postprocessing procedures to check consecutiveness, prune duplicates and report patterns. Each of the steps of the proposed framework are explained in the remainder of this chapter.

12

2.4.1

Getting a ﬁnal set of disks per timestamp

The ﬁrst step of the framework is to identify a ﬁnal set of clusters in each timestamp. Although the ﬁrst step of the BFE algorithm is aﬀected by the number of trajectories, the initial implementation showed acceptable time responses in preliminary testing on large synthetically generated datasets (See Section 3.4). This fact promoted its use as a ﬁrst step in the proposed framework. The main objective with this is the generation of a ﬁnal set of disks which cluster the number of trajectories in groups according to proximity. This step still uses the parameter to deﬁne the diameter of the disks and μ for pruning procedures to reduce the number of valid disks. For simplicity, BFE algorithm and the proposed framework uses a ﬁx disk shape; a circumference with a predeﬁned radius and the Euclidean distance metric. However different shapes and metrics could be used. Indeed, alternative spatial clustering techniques, such as DBSCAN or grid-based methods, which allow the identiﬁcation of dense regions with a minimum number of trajectories, could be used at this stage. These issues are discussed further in Section 5.2.2.

2.4.2

From trajectories to transactions

In a general sense, spatio-temporal datasets are comprised of information for the location of an entity at a speciﬁc time. Each entry in the dataset reﬂects an observation of a point, which in turn describes a speciﬁc trajectory. To be able to analyse trends in the data, we assume that spatio-temporal datasets contain at least 4 ﬁelds: a trajectory ID to which belongs a point, the time when it was measured and the X, Y coordinates of the location. In order to use frequent pattern mining algorithms, the input database should follow the {TID:itemset} schema (see Section 2.3). The ID of the trajectory can be used to identify its corresponding transaction, but it is necessary to deﬁne an Item ID which collects information for the time and location for each point. An unique ID is tagged to each disk generated in the ﬁrst step of the framework. In addition, information about which trajectories visited a disk in a particular time interval is stored in a separate table, so it is possible to get a transactional version of the trajectory if we match the time and location of a point with the ID of the corresponding disk. A speciﬁc disk will represent a particular region in space and time and each trajectory can be translated according to the disks which this visits during its lifetime. This concept is illustrated in the following example: At Figure 2.4 we can see a dataset of 7 trajectories (Ti ). From that, 5 disks can be identiﬁed throughout the dataset lifetime (ci ). Table 2.1 is created from the disks which are visited for each trajectory at a speciﬁc timestamp (ti ). If Table 2.1 is treated as a transactional databases, it is possible to apply any frequent pattern mining algorithm to ﬁnd the frequent patterns. For instance, let set the minimum support count (min sup) at the same value that the minimum number of trajectories μ. If we use μ = 3 the patterns {C1 , C2 , C4 : 3} and {C3 , C5 : 3} should be found. These patterns contain the information about the trajectory members and duration of the possible moving ﬂock patterns. It is no necessary a complete set of all frequent patterns. The set of maximal frequent patterns will retrieve the required information. The main advantage of using this approach is that the longest ﬂock patterns are reported. The maximal or closed sets of frequent patterns avoids the need to set a parameter δ to limit the duration of the patterns. In the proposed framework, the parameter δ is only used to set the minimum duration allowed, but ﬂocks with any duration will be reported. By contrast, BFE used δ to report ﬂocks

13

Figure 2.4: A trajectory dataset example. Table 2.1: Transactional version of the dataset from Figure 2.4.

TID T1 T2 T3 T4 T5 T6 T7

Disk IDs C1 , C2 , C4 C1 , C2 , C4 C1 , C2 , C4 C3 , C5 C3 , C5 C3 , C5 ∅

with this speciﬁc time duration in order to minimize the number of intermediate ﬂocks to be combined. As a result, the ﬁnal number of ﬂocks reported by the proposed framework is signiﬁcantly smaller than the number of ﬂocks reported by BFE. Although the patterns are considered as valid output from frequent pattern mining algorithms, they will require additional checking before they can be reported as valid ﬂocks.

2.4.3

Frequent Pattern Mining Algorithms

Since [1] many improvements and new methods have been proposed by the scientiﬁc community to ﬁnd frequent patterns in an eﬃcient and robust way. The most popular solutions involve the use of compact data structures which compress the original database such as FP-Trees [36, 39] and Binary Decision Diagrams [80, 81, 61]. Their main principles have resulted in diﬀerent implementations depending on the context and they have also inspired additional variations in order to ﬁnd representative types of patterns such as maximal and closed frequent itemsets. The Frequent Itemset Mining Implementation repository (FIMI) [26] is one of the most important initiatives to discuss and analyse the performance in computation time

14

and memory of the most relevant algorithms in this topic. In addition, it collects open source code and sample datasets from the original authors. [25, 27] gave an introductory survey of the state-of-the-art methods and techniques as well as their performance with diﬀerent types of datasets and parameters. According to the needs of the proposed framework, the technique which shows better results with preliminary datasets was the Linear time Closed itemset Miner (LCM)[81]. LCM demonstrated an remarkable eﬃciency using extremely low values of support in dense datasets, two characteristics present in mining moving ﬂock patterns. LCM is a backtracking (or depth-ﬁrst) algorithm based on recursive calls. The algorithm inputs a frequent itemset P and generate new itemsets by adding unused item to P . Then, for each new frequent itemset, it computes recursive call with respect to P . The process ends when new items cannot be added. Here, we omit the detailed description of the algorithm which is described in [80, 81].

2.4.4

Postprocessing Stage

As discussed above, information about time and location for each point of the trajectories was encoded into unique IDs for the disks. Once the LCM algorithm retrieves the set of frequent patterns, it is necessary to decode this information and check the quality and validity of the ﬂocks. It is possible that the members of a valid frequent pattern belong to disks in non-consecutive times, so it is necessary to check this requirement, in addition to the minimum duration (δ), before reporting it as a valid ﬂock. As in the BFE algorithm it is required to prune possible duplicate patterns. Due to the fact that a ﬁxed diameter is used to deﬁne the disks, it is inevitable that some disks overlap others. Points belonging to diﬀerent disks at the same time interval lead to the generation of redundant patterns. An additional scan is needed in order to identify and remove repeated ﬂocks. Alternatives to avoid this behaviour are discussed in Section 5.2.2.

2.5

Flock Interpretation

Although a formal deﬁnition was stated previously in this document diﬀerent interpretations of a ﬂock are possible depending on the application. Figure 2.5 illustrates a case where according to the context and nature of the moving objects diverse set of patterns can be derived. The diﬀerent interpretations are supported by the concepts of maximal and closed frequent patterns in the implementation of the proposed framework. Let set μ = 3. If a maximal frequent pattern approach is implemented using a minimum support count equal than μ (min sup = 3), a moving ﬂock pattern with member {T1 , T2 , T3 } from time t1 to t6 would be identiﬁed. It is the general scenario used in the tests to measure the performance of the framework. A second alternative will use the closed frequent pattern approach. In this case, four ﬂock patterns could be identiﬁed with diﬀerent start times and number of members. They are: {T1 , T2 , T3 } from time t1 to t6 , {T1 , T2 , T3 , T4 } from time t2 to t4 , {T1 , T2 , T3 , T5 } from time t3 to t5 and {T1 , T2 , T3 , T4 , T5 } from time t3 to t4 . It will bring more details about the interaction among moving objects but it will increase considerably the number of ﬁnal ﬂocks. However, it was useful during the validation stage because it generated a set of patterns similar to that generated by the BFE algorithm. Finally, based on the maximal frequent pattern approach, a third alternative is proposed doing a further analysis over the additional trajectories. After identiﬁcation of the

15

Figure 2.5: Example of a ﬂock where diﬀerent interpretation can apply.

core members of the ﬂock (leaders), the additional points will be treated as followers of the core trajectories. In this fashion, just one ﬂock will be reported from the example, where {T1 , T2 , T3 } from t1 to t6 , will be the leader trajectories. T4 , joining the ﬂock at time t2 until t4 , and T5 , joining it from t3 to t5 , will be tagged as the corresponding followers. The last interpretation is semantically more appropriate because reﬂects the intrinsic attraction and repulsion forces present, especially, in social entities such as animals or pedestrians. For instance, it is able to represent how a person joins a crowd, interacts with its members for a moment and then he leaves it. However, this approach needs additional processing and the format of the results require a more suitable representation. This interpretation was implemented in the visualization of the patterns generated with the real datasets.

16

Chapter 3

Implementation 3.1

BFE Implementation

An implementation of the BFE algorithm was developed keeping two goals in mind. First, to understand the bottlenecks processes during the execution of the method. Experimental results in [82] showed high time responses dealing with large datasets but it does not clarify which parts of the algorithm are the most aﬀected. Second, an available implementation of the BFE algorithm would be useful so parts of the code could be re-used in the development of the proposed framework and testing of the results. Based on the pseudo-code published in [82], a version of the BFE algorithm was developed using several open source libraries and utilities. An initial attempt used Java 1.6 programming language connected to spatial functions provided by PostGIS [72]. Spatial queries were used to calculate the optimal location of the ﬁnal set of disks at the ﬁrst stage of BFE algorithm. However, this approach showed low performance due to multiple read/write operations and indexing. Together with this, diﬃcult integration of SQL results and eﬃcient spatial data structures (e.g. KD-Tree) was also a limitation. An alternative was an application written in 100% pure Java which allows to work with the data in main memory avoiding multiple read/write operations. JTS Topology Suite (JTS) [85] was used for this purpose. It is an API for processing linear geometry which provides a complete, simple and robust implementation of distance and topological functions on the 2-dimensional plane. JTS implements the geometry model deﬁned in the Simple Features Speciﬁcation for SQL by OpenGIS Consortium [65]. The software is published under the GNU Lesser General Public License (LGPL). Although JTS supports almost all the spatial functions oﬀered by PostGIS, it requires eﬃcient data structures to manage attribute data. Fastutils [78] is a fast and compact implementation which extends the Java Collections Framework oﬀered by default. It provides type-speciﬁc maps, lists, sets and trees with a small memory footprint and fast access and insertion, minimizing the number of write/read operations. It was developed by the Laboratory for Web Algorithmics (LAW) at the University of Milan. The source code and API are released as free software under the Apache License 2.0. Additional data management (specially for storing the resulting patterns) and some query veriﬁcation was performed using PostGIS and OpenJump GIS [86].

17

3.2

Synthetic Generators

Many diﬀerent approaches have been proposed in order to model moving entities under diﬀerent criteria and scenarios. [70, 73, 48, 13] represent alternative eﬀorts to recreate the movements and dynamics of diverse entities such as pedestrians, cars and even ﬁshing ships in the real world. In this research, a group of synthetic datasets were created using a framework for generating moving objects, as is described in [13, 14], to test the initial implementation of the BFE algorithm. An important characteristic provided by this generator was the possibility that moving objects follow a given network. In addition with the supplied network, one can set distinct parameters, e.g. number of objects, number of intervals and maximum speed. Each edge in the network and trajectory is associated with a category of roads and a probability permitting varying movement speeds and lifetime duration. The source code and sample networks are available on the project’s website at [11].

3.3

Synthetic Datasets

[11] provides a set of examples and resources which can be used in the online demo or downloadable version of the generator. To begin with, a relatively small dataset collecting position of 1000 random moving objects in the German city of Oldenburg was used to test the above explained BFE implementation. The network data (edges and nodes ﬁles) are available in the website. The simulated data collects latitude and longitude of generated points during 140 time slices. The total number of locations stored is 57016 points. Figure 3.1 illustrates the network used for this dataset and the Table 3.1 shows the output format from the generator. The Oldenburg dataset was useful to test the ﬁnal implementation and results from the BFE algorithm, but it was relatively small to test the scalability of the method. Two additional synthetic datasets were created using the network from San Joaquin also provided at the project’s website. Figure 3.2 illustrates this network. The ﬁrst dataset collects 992140 simulated locations for 25000 moving objects during 60 timestamps. The second one collects 50000 trajectories from 2014346 points during 55 timestamps. Table 3.2 summarizes the main information from the synthetic datasets used at this stage and a tag name which will be used in the remainder of the thesis.

3.4

Internal Comparison

Using the large datasets previously generated and the implementation of BFE algorithm, a set of tests were performed to analyse the performance of the technique. The main idea of these tests was to identify bottlenecks and diﬀerences between the two internal phases of the algorithm. For each time interval in the dataset, the execution time for getting the ﬁnal set of disks and for joining possible ﬂocks was recorded separately. At the end of each test, the individual times for each interval were summed up. Figure 3.3 shows the performance of BFE algorithm in the SJ25KT60 dataset with the value ranging from 50 to 300 metres. The values for the minimum number of trajectories (μ) and minimum time duration (δ) were setted to 5 trajectories and 3 consecutive timestamps respectively. Similar test was performed using the SJ50KT55 dataset setting diﬀerent values for . Parameters μ and δ were setted to 9 trajectories and 3 consecutive timestamps respectively. Time performance for this case can be seen in Figure 3.4.

18

Change in ε [ SJ25KT60−P992140M5D3 ]

100 0

50

Processing time (s)

150

Getting Flocks Getting Disks

50

100

150

200

250

300

ε (m)

Figure 3.3: Comparison of internal execution time for the SJ25KT60 dataset. Change in ε 350

[ SJ50KT55−P2014346M9D3 ]

200 150 0

50

100

Processing time (s)

250

300

Getting Flocks Getting Disks

50

100

150

200

250

300

ε (m)

Figure 3.4: Comparison of internal execution time for the SJ50KT55 dataset.

21

Table 3.3: Number of combinations required for speciﬁc time intervals in SJ50KT55 dataset. Time Number of Number of Number of Time for Time for interval disks previous ﬂocks needed disks (s) ﬂocks (s) generated and disks combinations 10 11 12 13 14 15

2112 2070 2121 2031 1918 1950

3469 3331 3414 3283 3094 2929

7326528 6895170 7241094 6667773 5934292 5711550

4.3 6.3 4.2 4.0 5.0 4.2

15.9 16.4 16.4 15.6 14.2 13.5

As is shown in Figures 3.3 and 3.4 the increment in the radius of the disk aﬀects both stages of the algorithm. However, it is clear that after a critical point (around 150 metres in SJ25KT60 and 200 metres in SJ50KT55) the most aﬀected step is the combination and checking of possible ﬂocks. While with low magnitudes of , joining possible ﬂocks is slightly faster that getting a ﬁnal set of disks, for larger values the latter step is much faster than the former. This can be explained by the number of combination required in the second part of the algorithm. As the radius of the disk increases, it will enclose more trajectories. As a result, the number of disks which exceeds the minimum number of trajectories will rise considerably. This number of disks in each time interval has to be compared one by one with the number of disks generated in the next time interval plus the set of candidates disks identiﬁed until that moment. If the size of those sets are large enough, it can take exponential time to combine all their elements. Table 3.3 illustrates the problem. It shows a segment of the SJ50KT55 dataset between the time intervals 10 and 15 with a value of 300 metres. At this instance, around 2000 new disks are generated each timestamp. As the number of stored disks is also large (3000 approximately) the number of combinations is signiﬁcantly high. It takes on average more than three times longer to analyse such large number of combinations than to generate the set of ﬁnal disks for this dataset.

3.5

Framework Implementation

A functional prototype of the proposed framework was implemented in Java 1.6. To build the proposed framework, it was decided to keep the ﬁrst part of the BFE algorithm but to address the combinatorial problem using a frequent pattern mining approach. A systematic diagram for the proposed framework is shown in Figure 3.5. The pseudo-code of the proposed framework is presented in Algorithm 3.1 at the end of this section. The framework works with a plain text ﬁle as input, with the same format that is generated by the synthetic generator (See Table 3.1). The initial step in the framework implementation re-use the procedure to calculate the ﬁnal set of disks for each timestamp of the BFE implementation (line 2 in Algorithm 3.1). At this stage, an eﬃcient data structure was introduced to associate point locations in each trajectory with their respective disk in order to generate a transactional version of the dataset (line 3 to 9 in Algorithm 3.1). It is expected that from a disk ID (ci .id),

22

Figure 3.5: Systematic diagram for the proposed framework.

the values for the points contained by it (ci .points) and time interval (ci .time) can be retrieved. As just those point locations which lie inside of a valid disks are associated, trajectories beyond a threshold distance () from others are pruned at this stage. Consequently, in most of the cases the translation from trajectories to transactions results in a considerable reduction in the number of valid trajectories. However it also introduces limitations; As BFE uses a ﬁxed distance to cluster the trajectories, it is inevitable that some disks overlap others. As consequence, the same point location can be associated with more than one disk. Figure 3.6 illustrates a snapshot of the Oldenburg dataset using = 200 (metres) and μ = 3 (trajectories). Trajectories such as T2 and T3 can be easily associated with a unique disk. On the other hand, T4 is contained by two and T1 by three diﬀerent disks. While isolated locations such as T5 will not appear in the transactional version, T4 and T1 will increase their number of members. However it seems that this does not aﬀect the ﬁnal size of the transactional version, which results to be considerably smaller than the original dataset. When the transactional version of the dataset (D) is complete it is passed, together with the minimum support threshold (min sup), as parameters of the LCM algorithm (line 12 in Algorithm 3.1). It is a independent program, written in C programming language, available for download at [26]. Two variants of the program are available; LCM max and LCM closed will retrieve the maximal or closed set of frequent patterns depending on the case. The output M (line 12 in Algorithm 3.1) will be a plain text ﬁle where each line is

23

Figure 3.6: Overlapping problem during the generation of ﬁnal disks.

a maximal pattern which contains a set of Disk IDs separated by spaces. The set of core trajectories and consecutiveness is checked in the post-processing stage. Lines 14 to 18 in Algorithm 3.1 declare initial values to iterate through the maximal pattern. Afterwards, information for time intervals and trajectory members is retrieved for each disk contained in the pattern (lines 20 and 21 in Algorithm 3.1). The start and end for each ﬂock pattern is set after checking time consecutiveness. The set of trajectories common to all the disks in a maximal pattern are considered as the leader trajectories (line 23 in Algorithm 3.1). In many cases, each frequent pattern can be associated with a unique ﬂock pattern. However, it is possible that long frequent patterns contain disks from non-consecutive time intervals. It will report various ﬂock patterns from the same maximal pattern if each segment is greater than the minimum time duration (δ) (lines 26 to 29 and 32 to 34 in Algorithm 3.1). As in the BFE algorithm, the overlapping problem required the pruning of duplicates and redundant patterns. Using a tree structure, the set of suitable ﬂocks patterns are stored. In this way, patterns with the same members (and the same start and end timestamps) will be easily detected and excluded (additional validation in lines 26 and 32 in Algorithm 3.1). Redundant patterns occur when two patterns share exactly the same members but the time duration of one of them is contained by the longest one. Using the same data structure, this kind of pattern can be also detected, keeping just the longest one. Once the postprocessing stage ﬁnishes, the ﬁnal ﬂock patterns are saved to a ﬁle. The last phase of the framework covers the visualization of the resulting ﬂock patterns. Key information about a speciﬁc ﬂock pattern are its start and end timestamps and the trajectory IDs of their members. From this information the location (latitude and longitude position) of the members along its lifetime can be queried from the original dataset. However in large spatio-temporal datasets this could be costly. The implementation stores a line representation of the ﬂock, together with its key information when a ﬂock passes the postprocessing stage. Two variants were used as representations depending on the context and application: ﬁrstly, a line generated from the centroids of the trajectory

24

members at each time interval, and secondly, the longest trajectory belonging to any of the members of the ﬂock. The ﬂock representation follows the Simple Features Speciﬁcation for SQL published by the Open GIS Consortium, so it can be visualized by several vector-based GIS software such as OpenJump or Google Earth. While OpenJump was useful to display the spatial extension of a ﬂock, it was diﬃcult to represent changes in time. Last updates of the KML speciﬁcation [91] introduces additional elements ( and ) for description of spatio-temporal data. They allow the animation of vector georeferenced features, such as trajectories, on Google Earth. For simplicity, many of the ﬁnal visualizations were performed separately from the main code. Python 3.1 were used to create KML ﬁles which represented the ﬁnal ﬂocks reported from the postprocessing stage (See Chapter 4). The main source code of the implementation is shown in Appendix A.

Algorithm 3.1 Computing ﬂocks using a frequent pattern mining algorithm Input: parameters μ, and δ, set of points T Output: ﬂock patterns F 1: for each new time instance ti ∈ T do 2: C ← call Index.Disks(T [ti ], ) 3: for each ci ∈ C do 4: P ← ci .points 5: for each pi ∈ P do 6: ci .time ← ti 7: D[pi ] ← add ci .id 8: end for 9: end for 10: end for 11: min sup ← μ 12: M ← call LCM max(D, min sup) 13: for each max pattern ∈ M do 14: id0 ← max pattern[0] 15: c0 ← C[id0 ] 16: u ← c0 .points 17: u.tstart ← c0 .time 18: n ← max pattern.size 19: for i = 1 to n do 20: idi ← max pattern[i] 21: ci ← C[idi ] 22: if ci .time = ci−1 .time + 1 then 23: u ← u ∩ ci .points 24: u.tend ← ci .time 25: else 26: if u.tend − u.tstart δ and u ∈ / F then 27: F ← add u 28: u.tstart ← ci .time 29: end if 30: end if 31: end for 32: if u.tend − u.tstart δ and u ∈ / F then 33: F ← add u 34: end if 35: end for 36: return F

25

// call Algorithm 1 in Figure 2.1 // points enclosed by ci

// call LCM Algorithm [81]

// number of items in max pattern

// are disks consecutive?

Change in ε [ SJ25KT60 ] BFE Proposed Framework

100 50

Processing time (s)

150

●

●

● ● ●

●

0

●

50

100

150

200

250

300

ε (m)

Figure 3.7: Performance of BFE algorithm and the proposed framework with diﬀerent values for in SJ25KT60 dataset. The additional parameters were set as μ = 5 and δ = 3.

3.6

Computational Experiments

Using the prototype implementation of the framework and the BFE algorithm, a set of computational experiments were performed in order to evaluate the quality of the generated patterns and the execution performance of the proposed approach. The SJ25KT60 and SJ50KT55 datasets were evaluated using diﬀerent parameter values. Although direct comparison between the two methods is not completely fair because of the diﬀerent characteristics of the output, it is useful to measure whether the proposal is feasible and capable. The results were produced on an AMD Athlon 64 X2 dual processor with 3 gigabytes of RAM and a 120GB 7200 RPM hard disk, running Ubuntu Linux 2.6.32. In all cases, experiments ran Java conﬁgured with 2048 megabytes of memory. For the two datasets the diameter of the ﬂock was changed in intervals ranging from 50 to 300 metres. Figures 3.7 and 3.8 show the ﬁnal results.

3.7

Validation

As mentioned in Section 2.4.2 the proposed framework, unlike BFE, is able to identify the longest ﬂock patterns. In addition, depending on the deﬁnition of a ﬂock, results could be reported in several ways. This makes it diﬃcult to compare directly the output from the two methods; A long pattern reported by the proposed framework could be represented by several ﬂocks from the BFE algorithm, since it reports ﬂocks with ﬁxed time duration.

26

Change in ε

●

BFE Proposed Framework

300 200

Processing time (s)

400

500

[ SJ50KT55 ]

100

●

●

● ●

50

100

●

0

●

150

200

250

300

ε (m)

Figure 3.8: Performance of BFE algorithm and the proposed framework with diﬀerent values for in SJ50KT55 dataset. The additional parameters were set as μ = 9 and δ = 3.

Tables 3.4 and 3.5 show the number of ﬂocks reported by each technique before and after removing duplicates and redundant patterns. The strategy designed to test the validity of the ﬂocks uses a script programmed in Java to check that the set of patterns generated by BFE are contained by those generated by the proposed framework. After a set of tests using the outputs from Section 3.6 it was proved that all the patterns from the BFE results found a pattern in the proposed framework results which contain them. Visual examination also shows that there is not signiﬁcant diﬀerence between the results from both methods. Figure 3.9 shows the results for the Oldenburg dataset. The parameter used in the representation were: = 100(metres), μ = 3 and δ = 3. Table 3.4: Number of ﬂocks generated before and after postprocessing phase for BFE and the proposed framework in SJ25KT60 dataset. (m) 50 100 150 200 250 300

BFE Proposed Framework Original Pruned Original Pruned 86 905 2900 7853 18320 35796

84 773 2429 5737 10955 18656

27

27 221 636 1604 3215 5904

26 194 547 1316 2482 4291

Chapter 4

Study Cases Besides the synthetic datasets, the proposed framework was evaluated with trajectories collected from real case scenarios. Although synthetic datasets are a good approximation to reality, real datasets provide genuine characteristics and information about trajectory data. However, for technical reasons, it is very complicated to track large number of moving entities in real life. Limitations in equipment, access or privacy concerns are some factors which constrain the sources of data. After some preliminary evaluation, two real datasets from diﬀerent contexts were selected to test the proposed framework. The ﬁrst dataset tracks iceberg movement in Antarctica using a variety of satellite sensors since 1978. The second one collects movement information from a group of people around the metropolitan area of Beijing, China.

4.1

Tracking Icebergs in Antarctica

Antarctic icebergs are formed by the separation of massive sections of ice from ice shelves and glaciers. Several researches have studied and monitored iceberg movement in Antarctic during the past 3 decades using diverse technologies and purposes [6, 77, 57, 84]. The National Ice Centre (NIC) and Brigham Young University Microwave Earth Remote Sensing Laboratory (BYU) have used a variety of satellite sensors to manually track large Antarctic icebergs and collects their positions. [6] presented a long term analysis of the Antarctic iceberg activity based on scatterometer and radiometer data. They claim that although the increasing in the number of icebergs reported could be explained by advance in the tracking technologies, recent calving events (icebergs or glacier split on smaller mass of ice) may represent a natural variability in iceberg activity. NIC and BYU have produced an Antarctic iceberg tracking database which includes icebergs identiﬁed during 1978 and from 1992 to 2009 period. On average, each iceberg is reported every 1 to 5 days using ﬁve diﬀerent satellite instruments. The high temporal resolution of the dataset gives valuable information about the ocean currents in the study area. It is used for mariners to provide more accurate positional information to operate in the Antarctic region. The Iceberg database gathers latitude, longitude, date and identiﬁcation of 217 icebergs with more than 15100 point locations during the study period. In addition to the

29

Figure 4.1: Reported positions for all icebergs in the Iceberg dataset (1978, 1992-2009).

basic information, the dataset also includes the iceberg’s size and the instrument used in its tracking. Figure 4.1 illustrates the study area and reported positions for all the iceberg in this dataset.

4.1.1

Implications and possible applications

Most of the iceberg movements in Antarctica are inﬂuenced by speed and direction of winds and ocean currents in the Southern ocean. The Southern ocean comprises the southernmost waters of the World Ocean, generally taken to be south of 50◦ S latitude and encircling Antarctica. The Southern Ocean includes the Antarctic Circumpolar Current (ACC) which circulates around Antarctica from west to east and the Antarctic Coastal Current, also called East Wind Drift (EWD), that ﬂows anti-clockwise, driven by polar winds ﬂowing from the east [93](Figure 4.2). The ACC and EWD are today the largest ocean currents, and the major means of exchange of water between the basin of the Paciﬁc, Atlantic and Indian oceans. It is a well established fact that oceans play a pivotal role in global warming [4, 66]. ACC is vital in this aspect as it picks up and cools water descending from warmer latitudes. In this sense, the Antarctic ice pack doubtless plays a key role not just in varying the heat exchange between ocean and atmosphere but also in reﬂecting motion characteristics of the currents such as direction and speed. Following changes in the currents, through out monitoring of groups of icebergs, results highly relevant in order to understand the behaviour and impacts of the currents in global weather patterns. Another important ecological aspect of monitoring icebergs is associated with ﬁshery production. Antarctic krill represents a multimillion industry reporting more than 100000

30

Figure 4.2: The circumpolar and coastal currents (West and East wind drifts) around the Antarctic continent (source: [93]).

tonnes being caught each year [64]. At the same time, the biological importance of krill in the Antarctic ecosystem also have raised an increasing concern for its conservation and monitoring. However, the size of this species limits its tracking and study. Early researches noted that overall distribution of krill matched the distribution of sea ice and ocean currents [5, 40, 62]. Figure 4.3 shows the spatial distribution of krill around Antarctica which also coincides with the patterns found in this research (See Figure 4.5). Discovering frequent moving patterns in the icebergs could support the study of krill and other species distribution. After studying these implications it is clear that moving ﬂock patterns have interesting contributions in the study of regional and global climate as well as underwater biodiversity in Antarctica.

4.1.2

Data cleaning and preparation

Some characteristics of iceberg movement data required special treatment. Although the temporal resolution for most of the data was high, several trajectories presented jumps in time or they overlapped Antarctic inland areas. It was decided to apply a linear interpolation to daily basis after removing the inland points and trajectories with less than 3 recorded points. The new clean dataset contains 210876 locations formed by 198 trajectories. However, because icebergs were tracked for long time periods, often the associated trajectories cover

31

Figure 4.3: Spatial location of Antarctic krill catches (doted and line regions). Black areas illustrate ice shelves and fast ice during summer (source: [63]).

several years. With the applied interpolation, the average size of each trajectory climbed to more than 1200 timestamps. To reduce the dimensionality of the dataset, the analysis focused on iceberg trajectories from 2006 since they presented the largest amount of records. Table 4.1 shows the details of the ﬁnal dataset. Table 4.1: Iceberg trajectories during 2006 in Antarctica.

Dataset

Study Area

Icebergs06 Antarctica

4.1.3

Number of Number of Time Trajectories Points Intervals (Avg) 49

16131

329

Computational experiments

With the selected dataset a set of tests were performed using both BFE algorithm and the proposed framework. At this time, the parameter was changed to the order of Kilometres. The values ranged from 100 Km to 800 Km due to the characteristics of the icebergs as moving objects and the nature and extension of the study area. The other parameters remain constant at μ = 3 and δ = 3. The results of the experiments are shown in Figure 4.4. Table 4.2 summarizes the original number of reported ﬂocks after and before pruning for both methods.

32

Change in ε 20

[ Icebergs06 ] BFE Proposed Framework

●

10

Processing time (s)

15

●

● ●

● ● ●

5

●

0

●

100

200

300

400

500

600

700

800

ε (Km)

Figure 4.4: Comparison between BFE algorithm and the proposed Framework performance for diﬀerent values of in Icebergs06 dataset.

Table 4.2: Number of ﬂocks generated before and after postprocessing in Icebergs06 dataset. (Km) 100 200 300 400 500 600 700 800

BFE Proposed Framework Original Pruned Original Pruned 3388 7523 6655 12406 15566 15473 20429 37835

2557 4947 3163 4463 5022 5925 7067 9922

33

108 485 612 897 1355 645 715 1112

94 356 524 585 700 516 529 918

Table 4.3: Description of the discovered ﬂock patterns in Icebergs06 dataset. The ﬁrst column corresponds to tags in Figures 4.5 and 4.6. Tag in Figure

Members

Range in Time

A1 A2 A3 A4 B1

[A53B,A43F,A53A] [C08,B15D,D19] [C08,B15D,D19] [B15B,B15L,B15R] [B15M,B15Q,B15A]

Jan-01 to Jul-23 May-26 to Dec-31 May-04 to May-21 Jun-28 to Oct-06 Feb-16 to Sep-29

B2

[B15M,B15K,B15A] May-06 to Nov-26

B3

[B16,B15I,B17A]

B4

[B15N,B15M,B15A] Jan-01 to Sep-16

4.1.4

Jan-01 to Dec-31

Length (Km)

Followers

455 A54: Feb-11 to Jun-30 1552 D18: Jun-02 to Jul-16 462 843 632 B15I: Abr-16 to Sep-19 B15N: Feb-16 to Sep-20 B15P: Feb-16 to Aug-23 B15K: Abr-20 to Sep-20 800 B15N: May-08 to Sep-20 B15P: May-06 to Aug-20 B15Q: May-07 to Sep-10 50 B15N: Abr-15 to Sep-19 B15P: Abr-15 to Aug-26 B15M: Abr-16 to Dec-03 B15Q: Abr-11 to Dec-31 B15A: Abr-16 to Dec-03 904 B15I: Abr-15 to Sep-16 B15P: Jan-01 to Aug-26 B15K: Abr-20 to Sep-16 B15Q: Feb-16 to Sep-10

Results

A set of ﬂocks corresponding to results using = 200(Km) were selected to perform a preliminary visualization. Despite of postprocessing, a high number of patterns and similarity remain in the results. For these reasons, additional ﬁlters were used. These ﬁlters pruned ﬂocks with spatial length less than 20 Km (10% of ) and applied the alternative deﬁnition of a ﬂock discussed in Section 2.5. This strategy merges those patterns which share one or more trajectories into leaders and followers. The ﬁnal results are shown in Figures 4.5 and 4.6. Figure 4.5 shows a general overview of the discovered patterns. There is a major concentration of ﬂocks on the south-west. A close-up for that region can be seen in Figure 4.8. Table 4.3 describes in more detail the characteristics of the discovered patterns.

4.1.5

Findings in iceberg tracking

Iceberg movement in the Southern ocean is in continual motion pushed by winds and the aforementioned currents. Results from the experiments show that groups of icebergs follow closely the EWD (See Figure 4.5). As [32] state, sea ice movement are strongly related to seasons. It can be noted that similar behaviour is shared by icebergs. Flock results (Table 4.3) reach a minimum between January and mid February (Figure 4.7). Then, ice advances most rapidly in April and May reaching a maximum from early June to around mid August (Figure 4.8). Finally number of ﬂocks reduce rapidly at the end of

34

Figure 4.5: General view of the discovered patterns in Icebergs06 Dataset. Arrows indicate the direction of the ﬂocks.

Figure 4.6: Detail of discovered ﬂocks in Icebergs06 Dataset. Arrows indicate the direction of the ﬂocks.

35

Figure 4.7: General view of the discovered patterns from January 01 to February 15.

Figure 4.8: General view of the discovered patterns from June 03 to August 17.

36

September and November. These ﬁndings results are consistent with observations made by [19] (cited in [32]) about sea ice movement.

4.2

Pedestrian movement in Beijing

This study case is based on Beijing using a GPS trajectory dataset provided by Microsoft Asia Research. The dataset was collected during the Geolife project [59] by 165 anonymous users in a period of two years from April 2007 to August 2009. Locations were recorded by diﬀerent GPS loggers or Smart-phones and most of them present a high sampling rate. 95% of the tracks were logged every 2 to 5 seconds or every 5 to 10 metres per point. Although some locations in the dataset are distributed in over 30 cities in China and even America and Europe, the majority of the data was created in Beijing. The dataset collects information about latitude, longitude, altitude, date and time recording a broad range of users movements, not just routines such as go work and back to home but also some amusement and sport activities. It is important to mention that users could use any kind of transportation during the tracking, so the trajectories could refer to movement by foot, car or public transport. Previous researches have explored interesting locations, travel sequences and mobility patterns from this collection [98, 99]. The dataset and further information are freely available at the project’s website [59].

4.2.1

Implications and possible applications

Applications of moving patterns in urban movement have been widely discussed in recent studies [97, 99, 52, 83]. They could bring to light hidden and relevant information about trends and interaction among people. By mining people’s movement history, it is possible to measure the similarity between users and perform personalized recommendation for an individual. Furthermore, they can be used to detect problems in mobility and support decision makers in ﬁelds such as urban planning and public transport systems. Predictions and forecasting [49, 76, 87] are interesting topics which can be supported by this kind of knowledge. Indeed, as itemsets in association rule learning, moving ﬂock patterns are the ﬁrst step towards discovery of trends in movement. Finding correlations between places that people visit will be relevant and useful in ﬁelds such as location-based services and ubiquitous computing.

4.2.2

Data cleaning and preparation

The dataset groups the trajectories of each user in a separated folder. Each folder contains one or more GPS log ﬁles. In turn, each GPS log ﬁle could store one or more trajectories. From each trajectory, it was extracted timestamp and location and coupled to an sequential identiﬁer. After merging all the log ﬁles, a total of 15450 trajectories were obtained. However, this collection presented a sparse distribution in space and time. The region around the 5th Ring Road in the metropolitan area of Beijing and the time period between January and April of 2009 showed the major concentration of trajectories. These two constrains were selected to generate a sample dataset. The result was interpolated every minute to obtain a ﬁnal dataset which contains 2562 trajectories

37

Figure 4.9: Distribution points in study area. Left shows the sparse distribution around China. Right focuses on 5th Ring Road area in Beijing (source: [98]). Table 4.4: GPS log trajectories in Beijing.

Dataset

Study Area

Beijing Beijing, China

Number of Number of Time Trajectories Points Intervals (Avg) 2562

264737

103

for 264737 location points. Figure 4.9 and Table 4.4 summarize the key information of this dataset.

4.2.3

Computational experiments

Similar tests that the previous in this chapter were performed with the selected trajectories. The value of changed from 50 to 300 metres. These values result appropriate for this kind of data where pedestrians generally have to follows constrained networks. The values for the others parameters remain constant at μ = 3 and δ = 3. Figure 4.10 shows the comparison of two methods and Table 4.5 summarizes the number of ﬂocks obtained in each test. Table 4.5: Number of ﬂocks generated before and after postprocessing in Beijing dataset. (m) 50 100 150 200 250 300

BFE Proposed Framework Original Pruned Original Pruned 2486 2954 3182 3285 3373 3473

2483 2946 3168 3285 3373 3473

38

259 230 212 188 181 160

210 184 173 155 147 133

Change in ε 6

[ Beijing ] BFE Proposed Framework

● ●

3

●

●

●

●

0

1

2

Processing time (s)

4

5

●

50

100

150

200

250

300

ε (m)

Figure 4.10: Comparison of both methods with diﬀerent values for in Beijing dataset.

4.2.4

Results

Similar ﬁlters used in the last study case were also used for this dataset, especially to reduce similarity among data. Flocks corresponding to results for = 100(metres) were selected for the visualization. In general, users travel to suburbs at south, north and east with the city centre (Figure 4.11). The major concentration of patterns focus on the north-west area between the 4th and 5th Ring Roads. Around the area, several universities and IT business are located (Figure 4.12). This area is referred to as TSP (Tshingua Science Park) during this discussion. Table 4.6 describes a subset of the results with ﬂock patterns larger than 5 Km and lasting 20 or more minutes.

4.2.5

Findings in pedestrian movement

Results were equally distributed during workdays (51%) and weekends (49%). From ﬂocks happening in workdays, 83.5% were shorter than 5 Km and happened between 9 AM and 7 PM. A vast majority of them concentrated in the TSP region. It is plausible that many of the users carrying the GPS loggers were academics or researchers and their working place is located there (See Figure 4.13). Focus on large distance in ﬂocks, patterns larger than 5 Km represented 20% of the data. Most of them (74%) happened during Fridays and the weekend. This could explain the diversity in the time occurrence. Larger patterns connected areas at north and south of the city with the TSP region and the city centre with East at diﬀerent hours during the day, especially late evening. It is interesting that many of the longest ﬂocks follows well deﬁned expressways and

39

Figure 4.11: General view of the discovered ﬂocks in the Beijing Dataset.

Figure 4.12: Close-up around the region which concentrates the major number of ﬂocks. Some universities and IT institutions are highlighted.

40

Figure 4.13: Patterns shorter than 5 Km during workdays. Circle encloses the major concentration around TSP region. Arrows highlight other locations.

roads. To go North, the route of the patterns coincides with Badaling Expressway. Many patterns to go East follow Jingha Expressway which also coincides with Line 1 of the Beijing Subway System. To South, people tend to follow the West 5th or 4th Ring Roads. For this route, it can be observed that patterns try to connect the roads in diﬀerent ways. It can be indicative of traﬃc jams or change in mobility conditions (Figure 4.14). One interesting set of patterns showed a repetitive and atypical routine during the week between April 10th and 14th, which coincided with Easter week, when a group of 4 people moved from TSP to an unknown destination in the south and return at the same time during that time period. Unfortunately, without additional information about the users it is impossible to explain this kind of behaviour. That evinces the need of contextual data (age, genre, occupation, etc.) about the users for a better understanding of the discovered ﬂocks. The presence of artefacts in the results also has to be mentioned. Group of ﬂocks moving in straight lines indicate gaps during the tracking, especially in the city centre (1st, 2nd and 3r Ring Roads). It is possible that buildings, and especially skyscrapers, distort or reduce GPS signals. Although preprocessing and data cleaning could prevent the problems, the interpolation process could introduce paths which conduce to spurious routes.

41

Figure 4.14: Patterns showing diﬀerent routes to connect TSP area with the South. Yellow patterns go from TSP to South, green patterns show the return.

42

43

1/17(Sat) 1/17 1/17 1/17 2/08(Sun) 2/14(Sun) 3/06(Fri) 3/06 4/10(Fri) 4/10 4/11(Sat) 4/11 4/11 4/12(Sun) 4/12 4/12 4/12 4/13(Mon) 4/13 4/14(Tue) 4/14 4/14 4/15(Wed) 4/19(Sun)

Date (M/DD)

[214, 2019, 7954, 10120] 4/19

[1885, 8066, 10125] [1885, 8066, 10125] [1885, 8066, 10125] [1885, 8066, 10125] [2481, 9140, 9165] [1471, 1476, 11068] [2108, 7901, 10018] [2108, 7901, 10018] [244, 2069, 7992, 10157] [244, 2069, 7992, 10157] [245, 2070, 7993] [245, 2070, 7993] [245, 2070, 7993, 10158] [11685, 12585, 13196] [247, 2072, 7994, 10160] [247, 2072, 7994, 10160] [11686, 12586, 13197] [248, 2073, 7852, 10161] [248, 2073, 7852, 10161] [208, 2013, 10257] [208, 2013, 7853] [208, 2013, 7853] [210, 2015, 10259] [214, 2019, 7954]

Members

18:17-18:59

10:35-11:59 19:16-20:36 21:11-21:26 21:28-21:59 10:01-10:49 16:01-17:08 10:26-11:20 11:28-11:59 16:01-16:56 20:53-21:59 13:30-15:59 16:01-16:44 20:51-21:59 09:15-09:35 16:01-16:48 20:40-21:59 20:55-21:19 16:01-16:43 20:47-21:59 13:00-14:06 19:01-19:58 20:21-21:52 16:01-16:34 10:35-12:25

Time (HH:MM)

00:42

01:24 01:20 00:15 00:31 00:48 01:07 00:54 00:31 00:55 01:06 02:29 00:43 01:08 00:20 00:47 01:19 00:24 00:42 01:12 01:06 00:57 01:31 00:33 01:50 10257: 10257: 7855: 10120:

19:12-19:45 20:41-21:52 16:01-16:27 10:35-12:08, 12:14-12:19

10158: 13:50-15:59

Followers

Continued on next page

30.2

29.0 43.6 13.8 16.7 28.7 13.4 12.0 6.8 38.1 32.2 8.6 36.0 30.8 14.8 37.2 30.5 16.1 37.4 31.8 37.2 11.5 31.8 18.6 36.0

Duration Length (HH:MM) (Km)

Table 4.6: Description of the discovered ﬂock patterns in Beijing dataset.

44

12:00-12:59 20:23-21:25 15:04-15:49

00:59 01:02 00:45

4/25(Sat) 4/25 4/26(Sun)

Followers

14.1 8.4 10040: 10:39-12:11 11129: 11:26-11:36, 11:41-11:45 13.4 31.4 30.3

Duration Length (HH:MM) (Km)

[150, 1890, 10261] [150, 1890, 10261] [151, 1891, 10262]

Time (HH:MM) 00:54 01:33

Date (M/DD)

[214, 2019, 7954, 10120] 4/19 19:01-19:55 [130, 1853, 8185] 4/22(Wed) 10:39-12:12

Members

Table 4.6 – continued from previous page

Chapter 5

Discussion One of the main concerns of this research is understanding the performance implications of the proposed framework. For this reason it was evaluated with synthetic and real datasets with diﬀerent characteristics. Although the overall size of the dataset was expected to mark a trend in the performance, it is clear that this is not the only one important factor. Additionally, the number of reported ﬂocks is a critical issue that aﬀect the proper interpretation of the ﬁnal results. Although the problem of duplicates and redundant patterns is tackled in the postprocessing stage, high similarity among the patterns still remain. Causes and some strategies to reduce the number of reported ﬂocks and enhance their understanding are discussed.

5.1 5.1.1

Implementation and Performance Issues Impact of size trajectory

Initial tests with synthetic datasets showed a high performance of the proposed approach with respect to the traditional method. However under tests with real datasets that diﬀerence disappeared. In addition to the number of trajectories, the individual size of each segment also aﬀects signiﬁcantly the performance of traditional frequent patterns algorithms used in the proposed approach. It is important to clarify that size of the trajectory refers to number of point locations rather than to spatial length. It is noted by [7, 29, 38] that not just the length of the involved transactions but also the length of the resulting patterns have a direct impact in the performance of frequent pattern techniques. The results from the experiments unveil that the shorter the trajectory size the better performance to ﬁnd ﬂock patterns. As can be seen in the Table 3.2, for the two synthetic datasets the average trajectory size is low (40 and 37 points for datasets SJ25KT60 and SJ50KT55 respectively). On the other hand average size in real datasets depend on the interpolation rate and range of time selected by the study. For Icebergs06, it was decided to study a complete year with a daily interpolation. In this case the average size per trajectory is around 329 points. For Beijing dataset previous preprocessing tasks were performed by the original authors. They separated the trajectory of the users daily. In addition, time periods greater than 20 minutes without position reports were used to mark a new trajectory. As con-

45

sequence, the average trajectory size is relatively small (103 points) compare to its ﬁne time resolution (every minute) and 4-month time coverage.

5.1.2

Possible solutions

Diﬀerent strategies could be used to limit the size of a trajectory, i.e. long periods without change of position or abrupt jumps in time or location can be used to split long trajectories in shorter segments without signiﬁcant loss of spatial information. Similarly, interpolation rate is other factor which impacts directly the size of trajectories. Depending on the context, it would be acceptable to use longer intervals to interpolate a dataset. For example, it is possible to obtain a suitable iceberg dataset with samples taken every week instead of daily intervals. An important feature of the proposed framework is that it is independent of the frequent pattern algorithm. Any technique could use the transactional version of the trajectory dataset to retrieve maximal or closed frequent patterns. While LCM algorithm gives a good performance with short size trajectories, other implementations could be more appropriate to deal with long trajectory datasets. The issue of mining long transactional datasets has been studied before in bioinformatics where colossal patterns (from very long transactions) are usually found in huge biological databases. One approach to the problem is mining frequent patterns using a vertical data format where a relation {item:TID set} is used instead of the traditional {TID:itemset} schema [38]. The CARPENTER [67] and COBBLER [68] algorithms are alternatives that follow this format. [55, 56] have proposed TD-Close to ﬁnd the complete set of frequent closed patterns in this kind of high dimensional data. Stream mining algorithms are another alternative to deal with the problem. Data stream are massive unbounded sequence of data elements generated at a rapid rate [46]. The concept of trajectory ﬁts very well in the above deﬁnition. Indeed, a similar fashion is developed in the BFE algorithm. In recent years, many stream mining algorithms [3, 51, 2, 15, 45] have been proposed to mine maximal and closed frequent patterns over online transactional data stream.

5.2 5.2.1

Interpretation Issues Number of patterns and quality of the results

As mentioned in Section 2.4.2, the way in which BFE and the proposed framework reports ﬁnal ﬂocks is diﬀerent and it has direct impact in the number of patterns. The fact that BFE reports ﬂocks as segments, depending on δ parameter, increases the amount of patterns reported by this method considerably. However, because of the overlapping problem, many of those patterns are duplicates, a situation that also aﬀects the proposed framework. Tables 3.4 and 3.5 show a considerably reduction in the number of valid ﬂocks after removing identical and redundant patterns. Although techniques to detect duplication were performed in the current research, high similarity among ﬁnal patterns still remain. It can be observed in icebergs dataset (Table 4.2) where even a small number of moving objects reports a large number of ﬂocks. Detailed visualization of this patterns showed that several of them shared trajectories and similar routes. The implementation of additional ﬁlters (minimum spatial distance) and

46

Figure 5.1: Example of reported ﬂocks with diﬀerent values of .

the use of alternative deﬁnitions of ﬂocks (leader and followers) reduce the number of ﬁnal ﬂocks at the same time that enhance their comprehension. Similarly, the introduction of additional postprocessing tasks would be important to improve the quality of the ﬁnal results. The issue to deal with large number of patterns in traditional data mining have been widely discussed by [8, 74, 37]. They proposed diﬀerent metrics to rank the interestingness of the patterns with the objective to report just the top k of them. Frequent itemset mining naturally leads to the discovery of associations and correlations usually expressed as rules of the form α ⇒ β [support, conﬁdence, correlation] [33]. There are various correlation measures, including lif t, χ2 , cosine and all conf idence [37], which can be used to ﬁlter the most signiﬁcant patterns. Besides these measures, speciﬁc metrics related to moving ﬂock patterns, such as the spatial length or coverage, could be applied to ﬁlter the most relevant patterns. Table 4.5, showing the ﬁnal results for Beijing dataset, deserves a special discussion. It shows a particular behaviour in the number of ﬂocks reported by diﬀerent methods. While the number of patterns reported by BFE increases according to , the amount of patterns reported by the proposed framework goes down. A detailed analysis of the results showed that the number of ‘segmented’ ﬂocks in the proposed framework decreases with larger values of . Figure 5.1 illustrates the situation. Let μ = 3 and δ = 2. is represented by circumferences with diﬀerent size at left and right. With the smaller value of the proposed framework reports two ﬂocks ({T1 , T2 , T3 } from t1 to t3 and then from t5 to t6 ) while BFE reports three (same members but from timestamps t1 to t2 ,t2 to t3 and t5 to t6 ). With the larger value of the proposed framework will report just one ﬂock (from t1 to t6 ) while BFE will need 5 ﬂocks to represent the same pattern. This reduction in the numbers of reported ﬂocks eases the interpretation of the ﬁnal results.

5.2.2

Overlapping problem and alternatives

Although the process to get a ﬁnal set of disks used by BFE and the proposed framework showed to be eﬃcient, especially for synthetic dataset, it also introduce serious problems due to the overlapping problem. The use of static values of inevitably will generate groups sharing trajectories. As discussed in the last section, it has a negative impact

47

introducing duplicate patterns. In addition, overlapped disks lead to generate ﬂocks which share many of their trajectories. As result, many of the reported ﬂocks are semantically similar. Formal deﬁnitions of ﬂocks set as a ﬁxed parameter [82, 10, 30]. However, natural behaviour of moving objects lead groups increase and decrease their members, and the space which they occupy, according to time. For example, people and vehicles have to move in constrained space which aﬀect the size and shape of the ﬂocks. It seems reasonable to think that ﬂexible shapes and values for will model in a better way the interaction among moving entities. Spatial clustering algorithms are an option to discover set of clusters, instead of disks, of arbitrary shape and size. Furthermore, the use of arbitrary shapes will ensure that trajectories belong to only one cluster per time interval. There are several methods in this topic which can be classiﬁed into partitioning, hierarchical, density-based or gridbased methods [34]. Recent and well-known algorithms in this area could be applied as alternative to avoid the overlapping problem. DBScan is one of the most popular density-based spatial clustering algorithm [21]. Clusters are deﬁned as a set of dense connected regions with irregular shape. It has similar parameters for minimum number of points (MinPts) and given radius (Eps) easily associated to μ and respectively. However, DBScan has problems with handling large databases and in worst case its complexity reaches to O(n2 ). Recently, parallel spatial clustering algorithms based on the use of Swarm Intelligence (SI) techniques [90] have been proposed. [22] propose SPARROW, an algorithm which combines an exploratory strategy based on biologically inspired agents with a densitybased cluster algorithm to discover adaptive clusters in spatial data. This approach extends DBScan to deal with large spatial databases using a decentralized approach Grid and spatial index methods are another alternative to be considered. Spatial indexes have proved to be highly eﬃcient to sort and manage point-set data. [18] present TrajStore, an adaptive storage system to manage very large trajectory datasets. They introduce spatial indexes based on adaptive quadtrees as clusterer and determine optimal size cell using cost functions. Grid-based methods have been also studied to face spatial clustering in presence of obstacles or constrains. This approach is important in the case of pedestrians or vehicles which usually have to follow networks. [94] present a grid-based hierarchical spatial clustering algorithm which uses an obstacle-grid and a hierarchical strategy to reduce the complexity of clustering in presence of obstacles and constraints. [89] and [95] deal with the problem following a density-based approach.

48

Chapter 6

Conclusions and Recommendations 6.1

Summary of the Research

This research has deﬁned an appropriate methodology to apply the frequent pattern mining approach in order to discover moving ﬂock patterns in large spatio-temporal datasets. A new framework which integrates techniques to identify groups of moving entities and longest duration ﬂocks patterns has been proposed and tested with synthetic and real datasets. The framework assumes that a moving ﬂock patterns can be generalized as a typical frequent pattern. The framework converts a trajectory dataset into a transactional database based on the locations visited by each trajectory. Once a transactional version of the dataset is available, frequent pattern mining algorithms can be applied on it (Objective 1). Overall, the implementation of the proposed framework consists of four steps: identiﬁcation of groups of moving objects per time interval, construction of a transactional version of the trajectory dataset, application of a frequent pattern mining algorithm and performing of postprocessing tasks (Objective 2). The proposed framework was tested and compared with a current method (BFE algorithm). Synthetic datasets simulating trajectories generated by large number of moving objects were used to test the scalability of the framework. Real datasets from diﬀerent contexts and with diﬀerent characteristics were used to assess the performance and analyse the discovered patterns. Compared with the current method (BFE), the proposed framework shows high performance with the characteristics of synthetic datasets. With real datasets, the time response was still eﬃcient and quite similar to BFE (Objective 3). The use of synthetic and real datasets provided valuable insights to understanding which parameters and dataset characteristics are the most relevant in order to ﬁnd moving ﬂock patterns. The size of the dataset had a high inﬂuence in the performance as expected, the length of the transactions also represented a relevant impact in the response of the proposed framework. The frequent pattern mining approach showed to be useful to deal with the problems found in the BFE algorithm for datasets with large number of trajectories. The proposed

49

framework handle the disk combination eﬃciently by using scalable frequent pattern algorithms. Additionally, maximal pattern mining techniques are able to detect longest duration ﬂocks. Preliminary strategies to visualize the results were explored during this research. These include diﬀerent interpretation of moving ﬂock patterns and ﬁlters to retrieve the most relevant information. However, the large number of discovered patterns in many cases show that this task require additional treatment. A correct visualization of the results is still an open issue. The proposed framework is modular and diﬀerent techniques can be applied to improve the performance depending on a particular case. Although a speciﬁc frequent pattern algorithm was implemented in this framework, once a set of transactions are derived from the original trajectories, any frequent pattern algorithm can be used. Coupled with that, the initial stage to identify the most visited sites could be changed for other methods. One important ﬁnding from this research was that the method is applicable to different types of phenomena. The framework proved to be useful to ﬁnd moving ﬂock patterns in diverse contexts such as human and iceberg movement. Results from study cases reﬂected useful information and trends which coincided with previous literature and expected behaviour.

6.2

Recommendation

Some recommendations for improvement and further research are proposed as follows: 1. A ﬁxed shape and distance was used to identify clusters in the ﬁrst step of the framework. This introduces serious problems of overlapping and redundant results. The use of ﬂexible shapes might deﬁne ﬂocks which represent reality better. Methods for spatial clustering, especially density and grid-based methods, could be used to deﬁne ﬂocks with diﬀerent shapes. 2. Depending on the dimensionality of datasets and speciﬁc parameter settings, the number of discovered patterns can be signiﬁcantly large. Appropriate analysis of the results demands special techniques to be developed. Further research in visualization is required to extract and display the most valuable patterns. Aggregation, summarisation and simpliﬁcation techniques based on spatial and temporal statistics could be used for this purpose. 3. In contrast to the BFE algorithm, the proposed framework is not real-time. It requires a time window to build the transactional version of the dataset. A viable alternative is stream mining algorithms which deal with massive unbounded sequence of continuous data. Application of this algorithms are usually found in the ﬁeld of bioinformatics and computer network traﬃc. The analysis and integration of such algorithms in the framework is an interesting further research area. 4. The current research has shown the similarities between itemsets and moving ﬂock patterns. A logical next step would be to mine association rules based on the discovered patterns. In a similar way in which traditional association rules ﬁnd correlation between items, association rule learning applied to spatio-temporal datasets might ﬁnd interesting correlations among the places which are visited by moving objects. In general, analysis of large spatio-temporal datasets has raised a lot of challenging problems. Frequent pattern mining techniques have been shown to have important contributions to make in this area.

50

References [1]

R. Agrawal and R. Srikant. “Fast algorithms for mining association rules”. In: Proc. 20th Int. Conf. Very Large Data Bases, VLDB. Vol. 1215. Citeseer. 1994, pp. 487–499.

[2]

F. Ao, J. Du, Y. Yan, B. Liu, and K. Huang. “An eﬃcient algorithm for mining closed frequent itemsets in data streams”. In: Computer and Information Technology Workshops, 2008. CIT Workshops 2008. IEEE 8th International Conference on. IEEE. 2008, pp. 37–42.

[3]

F. Ao, Y. Yan, J. Huang, and K. Huang. “Mining Maximal Frequent Itemsets in Data Streams Based on FP-Tree”. In: Machine Learning and Data Mining in Pattern Recognition (2007), pp. 479–489.

[4]

K.R. Arrigo, G.L. van Dijken, D.G. Ainley, M.A. Fahnestock, and T. Markus. “Ecological impact of a large Antarctic iceberg”. In: Geophysical Research Letters 29.7 (2002), p. 1104. issn: 0094-8276.

[5]

A. Atkinson, V. Siegel, E. Pakhomov, and P. Rothery. “Long-term decline in krill stock and increase in salps within the Southern Ocean”. In: Nature 432.7013 (2004), pp. 100–103. issn: 0028-0836.

[6]

J. Ballantyne and DG Long. “A Multidecadal Study of the Number of Antarctic Icebergs using Scatterometer Data”. In: International Geoscience and Remote Sensing Symposium. Vol. 5. 2002, pp. 3029–3031.

[7]

R.J. Bayardo Jr. “Eﬃciently mining long patterns from databases”. In: ACM Sigmod Record 27.2 (1998), pp. 85–93. issn: 0163-5808.

[8]

R.J. Bayardo Jr and R. Agrawal. “Mining the most interesting rules”. In: Proceedings of the ﬁfth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM. 1999, p. 154. isbn: 1581131437.

[9]

R.J. Bayardo Jr, B. Goethals, and M.J. Zaki. “FIMI 04, Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations”. In: vol. 126. 2004.

[10]

M. Benkert, J. Gudmundsson, F. Hubner, and T. Wolle. “Reporting ﬂock patterns”. In: Computational Geometry 41.3 (2008), pp. 111–125.

[11]

T. Birkhoﬀ. Network-based Generator of Moving Objects. 2010. url: http: //www.fh-oow.de/institute/iapg/personen/brinkhoff/generator/. 51

[12]

C. Borgelt. “An Implementation of the FP-growth Algorithm”. In: Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations. ACM. 2005, p. 5.

[13]

T. Brinkhoﬀ. “A framework for generating network-based moving objects”. In: GeoInformatica 6.2 (2002), pp. 153–180.

[14]

T. Brinkhoﬀ. “Generating traﬃc data”. In: IEEE Data Engineering Bulletin 26.2 (2003), pp. 19–25.

[15]

H. Chen. “Eﬃciently Mining the Recent Frequent Patterns over Online Data Streams”. In: Intelligent Systems and Applications (ISA), 2010 2nd International Workshop on. IEEE. 2010, pp. 1–4.

[16]

R. Chen, Q. Jiang, H. Yuan, and L. Gruenwald. “Mining association rules in analysis of transcription factors essential to gene expressions”. In: Atlantic Symposium on Computational Biology, and Genome Information Systems & Technology. Citeseer. 2001.

[17]

C. Creighton and S. Hanash. “Mining gene expression databases for association rules”. In: Bioinformatics 19.1 (2003), p. 79. issn: 1367-4803.

[18]

P. Cudre-Mauroux, E. Wu, and S. Madden. “TrajStore: An adaptive storage system for very large trajectory data sets”. In: Data Engineering (ICDE), 2010 IEEE 26th International Conference on. IEEE. 2010, pp. 109–120.

[19]

G. Deacon. The Antarctic circumpolar ocean. Vol. 180. Cambridge University Press Cambridge, 1984.

[20]

H. Dettki, G. Ericsson, and L. Edenius. “Real-time moose tracking: an internet based mapping application using GPS/GSM-collars in Sweden”. In: Alces 40 (2004), pp. 13–21.

[21]

M. Ester, H.P. Kriegel, J. Sander, and X. Xu. “A density-based algorithm for discovering clusters in large spatial databases with noise”. In: Proc. KDD. Vol. 96. 1996, pp. 226–231.

[22]

G. Folino and G. Spezzano. “An adaptive ﬂocking algorithm for spatial clustering”. In: Parallel Problem Solving from Nature PPSN VII (2002), pp. 924– 933.

[23]

A.U. Frank, J. Raper, and J.P. Cheylan. Life and motion of socio-economic units. CRC, 2001.

[24]

P. Giudici and Ebooks Corporation. Applied data mining: Statistical methods for business and industry. Wiley New York, 2003. isbn: 047084678.

[25]

B. Goethals. “Survey on frequent pattern mining”. In: Manuscript (2003), pp. 1–43.

[26]

B. Goethals. The FIMI’04 Homepage. 2004. url: http : / / fimi . cs . helsinki.fi/.

[27]

B. Goethals and M.J. Zaki. “Advances in frequent itemset mining implementations: report on FIMI’03”. In: ACM SIGKDD Explorations Newsletter 6.1 (2004), pp. 109–117. issn: 1931-0145. 52

[28]

B. Goethals and M.J. Zaki. “FIMI 03: Proceedings of the ICDM 2003 Workshop on Frequent Itemset Mining Implementations”. In: vol. 90. 2003.

[29]

G. Grahne and J. Zhu. “High performance mining of maximal frequent itemsets”. In: 6th International Workshop on High Performance Data Mining. Citeseer. 2003.

[30]

J. Gudmundsson and M. van Kreveld. “Computing longest duration ﬂocks in trajectory data”. In: Proceedings of the 14th annual ACM international symposium on Advances in geographic information systems. ACM. 2006, p. 42.

[31]

J. Gudmundsson, M. van Kreveld, and B. Speckmann. “Eﬃcient detection of motion patterns in spatio-temporal data sets”. In: Proceedings of the 12th annual ACM international workshop on Geographic information systems. ACM. 2004, pp. 250–257.

[32]

J. Gyory, J. Cangialosi, I. Jo, A. Mariano, and E. Ryan. Surface Currents in the Southern Ocean: The Antarctic Coastal Current. 2003. url: http:// oceancurrents.rsmas.miami.edu/southern/antarctic-coastal.html.

[33]

J. Han and M. Kamber. Data mining: concepts and techniques. Morgan Kaufmann, 2006. isbn: 1558609016.

[34]

J. Han, M. Kamber, and A.K.H. Tung. “Spatial clustering methods in data mining: A survey”. In: Geographic Data Mining and Knowledge Discovery. Taylor and Francis 21 (2001).

[35]

J. Han, K. Koperski, and N. Stefanovic. “GeoMiner: a system prototype for spatial data mining”. In: Proceedings of the 1997 ACM SIGMOD international conference on Management of data. ACM. 1997, pp. 553–556. isbn: 0897919114.

[36]

J. Han and J. Pei. “Mining frequent patterns by pattern-growth: methodology and implications”. In: ACM SIGKDD Explorations Newsletter 2.2 (2000), pp. 14–20. issn: 1931-0145.

[37]

J. Han, H. Cheng, D. Xin, and X. Yan. “Frequent pattern mining: current status and future directions”. In: Data Mining and Knowledge Discovery 15.1 (2007), pp. 55–86. issn: 1384-5810.

[38]

J. Han, H. Cheng, D. Xin, and X. Yan. “Frequent pattern mining: current status and future directions”. In: Data Mining and Knowledge Discovery 15.1 (2007), pp. 55–86. issn: 1384-5810.

[39]

J. Han, J. Pei, Y. Yin, and R. Mao. “Mining frequent patterns without candidate generation: A frequent-pattern tree approach”. In: Data mining and knowledge discovery 8.1 (2004), pp. 53–87.

[40]

R.P. Hewitt, D.A. Demer, and J.H. Emery. “An 8-year cycle in krill biomass density inferred from acoustic surveys conducted in the vicinity of the South Shetland Islands during the austral summers of 1991-1992 through 20012002”. In: Aquatic Living Resources 16.3 (2003), pp. 205–213. issn: 09907440. 53

[41]

S. Iwase and H. Saito. “Tracking soccer player using multiple views”. In: Proceedings of the IAPR Workshop on Machine Vision Applications (MVA02). Citeseer. 2002, pp. 102–105.

[42]

C.S. Jensen, D. Lin, and B.C. Ooi. “Continuous clustering of moving objects”. In: IEEE Transactions on Knowledge and Data Engineering (2007), pp. 1161– 1174. issn: 1041-4347.

[43]

H. Jeung, M.L. Yiu, X. Zhou, C.S. Jensen, and H.T. Shen. “Discovery of convoys in trajectory databases”. In: Proceedings of the VLDB Endowment 1.1 (2008), pp. 1068–1080.

[44]

Hoyoung Jeung, Heng Tao Shen, and Xiaofang Zhou. “Convoy Queries in Spatio-Temporal Databases”. In: Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on. 2008, pp. 1457–1459. doi: 10.1109/ICDE. 2008.4497588.

[45]

N. Jiang and L. Gruenwald. “CFI-Stream: mining closed frequent itemsets in data streams”. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM. 2006, pp. 592– 597. isbn: 1595933395.

[46]

N. Jiang and L. Gruenwald. “Research issues in data stream association rule mining”. In: ACM Sigmod Record 35.1 (2006), pp. 14–19. issn: 0163-5808.

[47]

P. Kalnis, N. Mamoulis, and S. Bakiras. “On discovering moving clusters in spatio-temporal data”. In: Advances in Spatial and Temporal Databases (2005), pp. 364–381.

[48]

J. Kaufman, J. Myllymaki, and J. Jackson. “City Simulator V2. 0”. In: IBM alphaWorks, December (2001).

[49]

J. Krumm and E. Horvitz. “Predestination: Where do you want to go today?” In: Computer 40.4 (2007), pp. 105–107. issn: 0018-9162.

[50]

P. Laube, M. Kreveld, and S. Imfeld. “Finding REMO - detecting relative motion patterns in geospatial lifelines”. In: Developments in Spatial Data Handling (2005), pp. 201–215.

[51]

H.F. Li, S.Y. Lee, and M.K. Shan. “Online mining (recently) maximal frequent itemsets over data streams”. In: (2005). issn: 1097-8585.

[52]

Q. Li, Y. Zheng, X. Xie, Y. Chen, W. Liu, and W.Y. Ma. “Mining user similarity based on location history”. In: Proceedings of the 16th ACM SIGSPATIAL international conference on Advances in geographic information systems. ACM. 2008, pp. 1–10.

[53]

Y. Li, J. Han, and J. Yang. “Clustering moving objects”. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM. 2004, pp. 617–622. isbn: 1581138881.

[54]

Z. Li, M. Ji, J.G. Lee, L.A. Tang, Y. Yu, J. Han, and R. Kays. “MoveMine: mining moving object databases”. In: Proceedings of the 2010 international conference on Management of data. ACM. 2010, pp. 1203–1206. 54

[55]

H. Liu, J. Han, D. Xin, and Z. Shao. “Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach”. In: Proceeding of the 2006 SIAM international conference on data mining (SDM 06), Bethesda, MD. Citeseer. 2006, pp. 280–291.

[56]

H. Liu, J. Han, D. Xin, and Z. Shao. “Top-down mining of interesting patterns from very high dimensional data”. In: Data Engineering, 2006. ICDE’06. Proceedings of the 22nd International Conference on. IEEE. 2006, p. 114. isbn: 0769525709.

[57]

DG Long, J. Ballantyne, and C. Bertoia. “Is the number of Antarctic icebergs really increasing?” In: EOS Transactions 83 (2002), p. 469.

[58]

D. Makris and T. Ellis. “Path detection in video surveillance”. In: Image and Vision Computing 20.12 (2002), pp. 895–903. issn: 0262-8856.

[59]

Microsoft Research Asia. GeoLife GPS Trajectories. 2010. url: http : / / research.microsoft.com/en-us/downloads/b16d359d-d164-469e-9fd4 -daa38f2b2e13/default.aspx.

[60]

H.J. Miller and J. Han. Geographic data mining and knowledge discovery. Vol. 338. Wiley Online Library, 2001.

[61]

S.I. Minato, T. Uno, and H. Arimura. “LCM over ZBDDS: fast generation of very large-scale frequent itemsets using a compact graph-based representation”. In: Advances in Knowledge Discovery and Data Mining (2008), pp. 234–246.

[62]

S. Nicol. “Krill, currents, and sea ice: Euphausia superba and its changing environment”. In: BioScience 56.2 (2006), pp. 111–120.

[63]

S. Nicol and Y. Endo. “Krill ﬁsheries: development, management and ecosystem implications”. In: Aquatic Living Resources 12.2 (1999), pp. 105–120. issn: 0990-7440.

[64]

S. Nicol and J. Foster. “Recent trends in the ﬁshery for Antarctic krill”. In: Aquating Living Resources 16.1 (2003), pp. 42–45. issn: 0990-7440.

[65]

Open GIS Consortium (OGC). Standards and Speciﬁcations. 2010. url: http://www.opengeospatial.org/standards.

[66]

M. Oppenheimer and R.B. Alley. “The West Antarctic ice sheet and long term climate policy”. In: Climatic Change 64.1 (2004), pp. 1–10. issn: 01650009.

[67]

F. Pan, G. Cong, A.K.H. Tung, J. Yang, and M.J. Zaki. “CARPENTER: Finding closed patterns in long biological datasets”. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM. 2003, pp. 637–642. isbn: 1581137370.

[68]

F. Pan, A.K.H. Tung, G. Cong, and X. Xu. “COBBLER: combining column and row enumeration for closed pattern discovery”. In: Scientiﬁc and Statistical Database Management, 2004. Proceedings. 16th International Conference on. IEEE. 2004, pp. 21–30. isbn: 0769521460. 55

[69]

N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. “Discovering frequent closed itemsets for association rules”. In: Database Theory ICDT 99 (1999), pp. 398–416.

[70]

D. Pfoser and Y. Theodoridis. “Generating semantics-based trajectories of moving objects”. In: Computers, Environment and Urban Systems 27.3 (2003), pp. 243–263. issn: 0198-9715.

[71]

C. Piciarelli, GL Foresti, and L. Snidara. “Trajectory clustering and its applications for video surveillance”. In: Advanced Video and Signal Based Surveillance, 2005. AVSS 2005. IEEE Conference on. IEEE. 2006, pp. 40–45. isbn: 0780393856.

[72]

Refractions Research. PostGIS (version 1.5.2). 2010. url: http://postgis. refractions.net/.

[73]

J.M. Saglio and J. Moreira. “Oporto: A realistic scenario generator for moving objects”. In: GeoInformatica 5.1 (2001), pp. 71–93. issn: 1384-6175.

[74]

T. Scheﬀer and S. Wrobel. “Finding the most interesting patterns in a database quickly by using sequential sampling”. In: The Journal of Machine Learning Research 3 (2003), pp. 833–862. issn: 1532-4435.

[75]

M.K. Shan and L.Y. Wei. “Algorithms for discovery of spatial co-orientation patterns from images”. In: Expert Systems with Applications (2010).

[76]

A. Stathopoulos, L. Dimitriou, and T. Tsekeris. “Fuzzy modeling approach for combined forecasting of urban traﬃc ﬂow”. In: Computer-Aided Civil and Infrastructure Engineering 23.7 (2008), pp. 521–535. issn: 1467-8667.

[77]

C. Swithinbank, P. McClain, and P. Little. “Drift tracks of Antarctic icebergs”. In: Polar Record 18.116 (1977), pp. 495–501. issn: 0032-2474.

[78]

The Laboratory for Web Algorithmics - University of Milan. fastutils: Fast & compact type-speciﬁc collections for Java (version 6.0). 2010. url: http: //fastutil.dsi.unimi.it/.

[79]

D. Tsur, J.D. Ullman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov, and A. Rosenthal. “Query ﬂocks: a generalization of association-rule mining”. In: Proceedings of the 1998 ACM SIGMOD international conference on Management of data. ACM. 1998, pp. 1–12. isbn: 0897919955.

[80]

T. Uno, M. Kiyomi, and H. Arimura. “LCM ver. 2: Eﬃcient mining algorithms for frequent/closed/maximal itemsets”. In: IEEE ICDM04 Workshop FIMI04 (International Conference on Data Mining, Frequent Itemset Mining Implementations). Citeseer. 2004.

[81]

T. Uno, M. Kiyomi, and H. Arimura. “Lcm ver. 3: Collaboration of array, bitmap and preﬁx tree for frequent itemset mining”. In: Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations. ACM. 2005, pp. 77–86. isbn: 1595932100.

56

[82]

M.R. Vieira, P. Bakalov, and V.J. Tsotras. “On-line discovery of ﬂock patterns in spatio-temporal data”. In: Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM. 2009, pp. 286–295.

[83]

M.R. Vieira, E. Fr´ıas-Mart´ınez, P. Bakalov, V. Fr´ıas-Mart´ınez, and V.J. Tsotras. “Querying Spatio-Temporal Patterns in Mobile Phone-Call Databases”. In: Mobile Data Management (MDM), 2010 Eleventh International Conference on. IEEE. 2010, pp. 239–248.

[84]

T.E. Vinje. “Some satellite-tracked iceberg drifts in the Antarctic”. In: Annals of Glaciology 1 (1980), pp. 83–87.

[85]

Vivid Solution Inc. JTS Topology Suite (version 1.8). 2010. url: http:// www.vividsolutions.com/jts/jtshome.htm.

[86]

Vivid Solution Inc. Open JUMP GIS (version 1.3.1). 2010. url: http:// www.openjump.org/.

[87]

E.I. Vlahogianni, M.G. Karlaftis, and J.C. Golias. “Optimized and metaoptimized neural networks for short-term traﬃc ﬂow prediction: A genetic approach”. In: Transportation Research Part C: Emerging Technologies 13.3 (2005), pp. 211–234. issn: 0968-090X.

[88]

M. Wachowicz, R. Ong, C. Renso, and M. Nanni. Discovering Moving Flock Patterns Among Pedestrians Through Spatio-Temporal Coherence. Tech. rep. Istituto di Scienza e Tecnologie dell’Informazione, 2010. url: http://puma. isti.cnr.it/publichtml/section_cnr_isti/cnr_isti_2010- TR- 027. html.

[89]

X. Wang, C. Rostoker, and H.J. Hamilton. “Density-based spatial clustering in the presence of obstacles and facilitators”. In: Knowledge Discovery in Databases: PKDD 2004 (2004), pp. 446–458.

[90]

B. Webb. “Swarm Intelligence: From Natural to Artiﬁcial Systems”. In: Connection Science 14.2 (2002), pp. 163–164. issn: 0954-0091.

[91]

T. Wilson. OGC KML. Tech. rep. OGC Standard 07-147r2, 2008-04-14, 251 pp, 2008.

[92]

Z. Wood and A. Galton. “A taxonomy of collective phenomena”. In: Applied Ontology 4.3 (2009), pp. 267–292.

[93]

Woods Hole Oceanographic Institution. Antarctica’s Ocean Circulation. 2006. url: http://polardiscovery.whoi.edu/antarctica/circulation.html.

[94]

Y. Yang, J. Zhang, and J. Yang. “Grid-Based Hierarchical Spatial Clustering Algorithm in Presence of Obstacle and Constraints”. In: 2008 International Conference on Internet Computing in Science and Engineering. IEEE. 2008, pp. 383–388.

57

[95]

OR Zaiane and C.H. Lee. “Clustering spatial data in the presence of obstacles: a density-based approach”. In: Database Engineering and Applications Symposium, 2002. Proceedings. International. IEEE. 2002, pp. 214–223. isbn: 0769516386.

[96]

C. Zhang and S. Zhang. Association rule mining: models and algorithms. 2002. isbn: 3540435336.

[97]

Y. Zheng, X. Xie, and W.Y. Ma. “GeoLife: A Collaborative Social Networking Service among User, Location and Trajectory”. In: Data Engineering (2010), p. 32.

[98]

Y. Zheng, L. Zhang, X. Xie, and W.Y. Ma. “Mining interesting locations and travel sequences from GPS trajectories”. In: Proceedings of the 18th international conference on World wide web. ACM. 2009, pp. 791–800.

[99]

Y. Zheng, Q. Li, Y. Chen, X. Xie, and W.Y. Ma. “Understanding mobility based on GPS data”. In: Proceedings of the 10th international conference on Ubiquitous computing. ACM. 2008, pp. 312–321.

58

Appendix A Main source code of the framework implementation

5

10

15

20

25

30

import import import import import import

com . v i v i d s o l u t i o n s . j t s . geom . ∗ ; edu . wlu . c s . l e v y .CG. KDTree ; edu . wlu . c s . l e v y .CG. ∗ ; i t . unimi . d s i . f a s t u t i l . ∗ ; java . io . ∗ ; java . u t i l . ∗ ;

public c l a s s LCMFlock { private GeometryFactory f a c t o r y ; public s t a t i c double e p s i l o n ; private s t a t i c double r 2 ; private s t a t i c double r ; public s t a t i c i n t time ; public s t a t i c i n t mu ; public s t a t i c i n t d e l t a ; private s t a t i c f i n a l double p r e c i s i o n = 0 . 0 0 1 ; i n t numFlock = 0 ; Int2ObjectAVLTreeMap d a t a b a s e ; private i n t ntime = 0 ; Int2ObjectAVLTreeMap d b d i s k s = new Int2ObjectAVLTreeMap() ; public LCMFlock ( double e p s i l o n , i n t mu, i n t d e l t a ) { LCMFlock . e p s i l o n = e p s i l o n ; LCMFlock . mu = mu ; LCMFlock . d e l t a = d e l t a ; f a c t o r y = new GeometryFactory ( ) ; LCMFlock . r = ( e p s i l o n / 2 ) + LCMFlock . p r e c i s i o n ; LCMFlock . r 2 = Math . pow ( e p s i l o n / 2 , 2 ) ; d a t a b a s e = new Int2ObjectAVLTreeMap() ; } public Object2ObjectAVLTreeMap getGrid ( Pointset points ) { O b j e c t I t e r a t o r i t e r a t o r = p o i n t s . p o i n t s . i t e r a t o r ( ) ; Object2ObjectAVLTreeMap g r i d ;

59

35

g r i d = new Object2ObjectAVLTreeMap>() ; while ( i t e r a t o r . hasNext ( ) ) { Point point = i t e r a t o r . next ( ) ; i n t i = ( i n t ) ( p o i n t . getX ( ) / LCMFlock . e p s i l o n ) ; i n t j = ( i n t ) ( p o i n t . getY ( ) / LCMFlock . e p s i l o n ) ; Index i n d e x = new Index ( i , j ) ; i f ( g r i d . containsKey ( index ) ) { g r i d . g e t ( i n d e x ) . add ( p o i n t ) ; } else { O b j e c t A r r a y L i s t aux = new O b j e c t A r r a y L i s t () ; aux . add ( p o i n t ) ; g r i d . put ( index , aux ) ; } } return g r i d ;

40

45

50

55

60

65

70

75

} public ObjectAVLTreeSet g e t D i s k s ( Object2ObjectAVLTreeMap< Index , O b j e c t A r r a y L i s t > g r i d ) { ObjectAVLTreeSet maximalDisks = new ObjectAVLTreeSet< Disk >() ; O b j e c t I t e r a t o r i t K e y s = g r i d . k e y S e t ( ) . i t e r a t o r ( ) ; O b j e c t I t e r a t o r i t V a l u e s = g r i d . v a l u e s () . iterator () ; ObjectAVLTreeSet computedPairs = new ObjectAVLTreeSet< Pair >() ; KDTree k d t r e e = new KDTree(2) ; A r r a y L i s t d i s k C o o r d i n a t e s = new A r r a y L i s t () ; while ( i t K e y s . hasNext ( ) ) { Index i n d e x = i t K e y s . n e x t ( ) ; O b j e c t A r r a y L i s t p o i n t s I n C e l l = i t V a l u e s . n e x t ( ) ; O b j e c t I t e r a t o r i t P o i n t s I n C e l l = p o i n t s I n C e l l . i t e r a t o r () ; O b j e c t A r r a y L i s t p o i n t s I n S u b g r i d ; p o i n t s I n S u b g r i d = new O b j e c t A r r a y L i s t () ; f o r ( i n t x = i n d e x . getX ( ) − 1 ; x = LCMFlock . mu) { try { double [ ] key = { d i s k 1 . getX ( ) , d i s k 1 . getY ( ) } ; k d t r e e . i n s e r t ( key , d i s k 1 ) ; d i s k C o o r d i n a t e s . add ( key ) ;

61

} catch ( K e y S i z e E x c e p t i o n ex ) { Logger . g e t L o g g e r ( LCMFlock . c l a s s . getName ( ) ) . l o g ( L e v e l . SEVERE, null , ex ) ; } catch ( K e y D u p l i c a t e E x c e p t i o n ex ) { // Prune i d e n t i c a l c e n t r e s }

130

135

140

145 }

150

155

160

165

170

175

}

}

}

} i f ( d i s k 2 . count >= LCMFlock . mu) { try { double [ ] key = { d i s k 2 . getX ( ) , d i s k 2 . getY ( ) } ; k d t r e e . i n s e r t ( key , d i s k 2 ) ; d i s k C o o r d i n a t e s . add ( key ) ; } catch ( K e y S i z e E x c e p t i o n ex ) { Logger . g e t L o g g e r ( LCMFlock . c l a s s . getName ( ) ) . l o g ( L e v e l . SEVERE, null , ex ) ; } catch ( K e y D u p l i c a t e E x c e p t i o n ex ) { // Prune i d e n t i c a l c e n t r e s } }

} f o r ( double [ ] key : d i s k C o o r d i n a t e s ) { try { // g e t P o i n t s i n r ange o f each k e y double [ ] l o = { key [ 0 ] − LCMFlock . e p s i l o n , key [ 1 ] − LCMFlock . e p s i l o n } ; double [ ] h i = { key [ 0 ] + LCMFlock . e p s i l o n , key [ 1 ] + LCMFlock . e p s i l o n } ; L i s t d i s k s = k d t r e e . r a n g e ( l o , h i ) ; int s i z e = d i s k s . s i z e ( ) ; double [ ] r k e y = { 0 , 0 } ; f o r ( i n t i = 0 ; i < s i z e − 1 ; i ++) { Disk d i s k 1 = d i s k s . g e t ( i ) ; i f ( disk1 . isSubset () ) { continue ; } A r r a y L i s t members1 = d i s k 1 . g e t P o i n t s I D s ( ) ; f o r ( i n t j = i + 1 ; j < s i z e ; j ++) { Disk d i s k 2 = d i s k s . g e t ( j ) ; i f ( disk2 . isSubset () ) { continue ; } A r r a y L i s t members2 = d i s k 2 . g e t P o i n t s I D s ( ) ; i f ( members1 . c o n t a i n s A l l ( members2 ) ) { d i s k 2 . s e t S u b s e t ( true ) ; maximalDisks . remove ( d i s k 2 ) ; r k e y [ 0 ] = d i s k 2 . getX ( ) ; r k e y [ 1 ] = d i s k 2 . getY ( ) ; d i s k C o o r d i n a t e s . remove ( r k e y ) ;

62

try { kdtree . d e l e t e ( rkey ) ; } catch ( K e y M i s s i n g E x c e p t i o n ex ) { } } e l s e i f ( members2 . c o n t a i n s A l l ( members1 ) ) { d i s k 1 . s e t S u b s e t ( true ) ; maximalDisks . remove ( d i s k 1 ) ; r k e y [ 0 ] = d i s k 1 . getX ( ) ; r k e y [ 1 ] = d i s k 1 . getY ( ) ; d i s k C o o r d i n a t e s . remove ( r k e y ) ; try { kdtree . d e l e t e ( rkey ) ; } catch ( K e y M i s s i n g E x c e p t i o n ex ) { } break ; }

180

185

190

} } f o r ( Disk d i s k : d i s k s ) { i f ( ! disk . isSubset () ) { maximalDisks . add ( d i s k ) ; } } } catch ( K e y S i z e E x c e p t i o n ex ) { Logger . g e t L o g g e r ( LCMFlock . c l a s s . getName ( ) ) . l o g ( L e v e l . SEVERE, null , ex ) ; }

195

200

} f o r ( Disk d i s k : maximalDisks ) { D i s k I n f o d i = new D i s k I n f o ( ) ; d i . setTime ( time ) ; di . setPoints ( disk . points ) ; d b d i s k s . put ( c i d , d i ) ; for ( Point point : d i s k . p o i n t s ) { int id = ( I n t e g e r ) point . getUserData ( ) ; i f ( database . containsKey ( id ) ) { A r r a y L i s t aux = d a t a b a s e . g e t ( i d ) ; aux . g e t ( aux . s i z e ( ) − 1 ) . add ( c i d ) ;

205

210

} else { A r r a y L i s t aux = new A r r a y L i s t () ; A r r a y L i s t t a g = new A r r a y L i s t ( ) ; t a g . add ( c i d ) ; aux . add ( t a g ) ; d a t a b a s e . put ( id , aux ) ; }

215

220

} c i d ++;

} return maximalDisks ;

225 }

public void mineMIF ( ) { BufferedWriter w r i t e r = null ;

63

230

S t r i n g aux = ” d a t a b a s e . aux ” ; String filename = ” database . t r a j ” ; try { S t r i n g B u i l d e r s t d i n = new S t r i n g B u i l d e r ( ) ; I n t B i d i r e c t i o n a l I t e r a t o r itDB = d a t a b a s e . k e y S e t ( ) . i t e r a t o r ( ) ; O b j e c t C o l l e c t i o n t r a j s = d a t a b a s e . values () ; O b j e c t I t e r a t o r i t T r a j s = t r a j s . iterator () ; while ( itDB . hasNext ( ) ) { itDB . n e x t I n t ( ) ; A r r a y L i s t t r a j = i t T r a j s . n e x t ( ) ; for ( ArrayList d i s k s : t r a j ) { i f ( d i s k s . s i z e ( ) == 1 ) { continue ; } for ( Object d i s k : d i s k s ) { s t d i n . append ( d i s k ) . append ( ” ” ) ; } s t d i n . append ( ” \n” ) ; } } F i l e f a u x = new F i l e ( aux ) ; F i l e i n p u t = new F i l e ( f i l e n a m e ) ; w r i t e r = new B u f f e r e d W r i t e r (new F i l e W r i t e r ( f a u x ) ) ; writer . write ( stdin . toString () ) ; writer . close () ;

235

240

245

250

255 S t r i n g command = ” /home/ a n d r e s s / P r o j e c t s / lcm21 / f i m c l o s e d ” + f i l e n a m e + ” ” + LCMFlock . mu + ” output . mfi ” ; long now = System . c u r r e n t T i m e M i l l i s ( ) ; P r o c e s s p = Runtime . getRuntime ( ) . e x e c ( command ) ; p . waitFor ( ) ;

260

timeMIF = System . c u r r e n t T i m e M i l l i s ( ) − now ; } catch ( I n t e r r u p t e d E x c e p t i o n ex ) { Logger . g e t L o g g e r ( LCMFlock . c l a s s . getName ( ) ) . l o g ( L e v e l . SEVERE, null , ex ) ; } catch ( IOException ex ) { Logger . g e t L o g g e r ( LCMFlock . c l a s s . getName ( ) ) . l o g ( L e v e l . SEVERE, null , ex ) ; } finally { try { writer . close () ; } catch ( IOException ex ) { Logger . g e t L o g g e r ( LCMFlock . c l a s s . getName ( ) ) . l o g ( L e v e l . SEVERE, null , ex ) ; } }

265

270

275 }

64

280

285

290

295

300

305

310

315

320

325

public void c h e c k F l o c k s ( ) { BufferedReader re ad e r = null ; S t r i n g B u i l d e r s t d i o = new S t r i n g B u i l d e r ( ) ; S t r i n g B u i l d e r s t r f l o c k s = new S t r i n g B u i l d e r ( ) ; StringTokenizer st ; try { String line ; String scid ; int f i d = 1 ; int cid2 = 0 ; F i l e i n p u t = new F i l e ( ” ou tp ut . m f i ” ) ; r e a d e r = new B u f f e r e d R e a d e r (new F i l e R e a d e r ( i n p u t ) ) ; i n t f r e c u e n c y = LCMFlock . mu ; while ( ( l i n e = r e a d e r . r e a d L i n e ( ) ) != n u l l ) { s t = new S t r i n g T o k e n i z e r ( l i n e , ” ” ) ; i f ( s t . countTokens ( ) < 4 ) { continue ; } A r r a y L i s t aux = new A r r a y L i s t ( s t . countTokens ( ) − 1 ) ; while ( s t . hasMoreElements ( ) ) { s c i d = s t . nextToken ( ) ; i f ( s c i d . charAt ( 0 ) == ’ ( ’ ) { frecuency = Integer . parseInt ( scid . substring (1 , scid . length ( ) − 1) ) ; } else { cid2 = Integer . parseInt ( scid ) ; DiskInfo di = dbdisks . get ( cid2 ) ; aux . add ( d i ) ; } } C o l l e c t i o n s . s o r t ( aux ) ; A r r a y L i s t f i n a l P o i n t s = new A r r a y L i s t () ; f i n a l P o i n t s = aux . g e t ( 0 ) . g e t P o i n t I D s ( ) ; i n t b e g i n = aux . g e t ( 0 ) . getTime ( ) ; i n t end = b e g i n ; i n t now ; i n t l i m i t = aux . s i z e ( ) ; f o r ( i n t i = 1 ; i < l i m i t ; i ++) { now = aux . g e t ( i ) . getTime ( ) ; i f ( now == end + 1 | | now == end ) { end = now ; i f ( f i n a l P o i n t s . s i z e ( ) != f r e c u e n c y ) { f i n a l P o i n t s . r e t a i n A l l ( aux . g e t ( i ) . g e t P o i n t I D s ( ) ) ; } } e l s e i f ( end − b e g i n >= LCMFlock . d e l t a − 1 ) { System . out . p r i n t l n ( ” \n” + f i d + ” ( ” + f i n a l P o i n t s . s i z e () + ”) : ”) ; System . out . p r i n t l n ( ”From time ” + b e g i n + ” t o ” + end ); System . out . p r i n t l n ( ”Members : ” + f i n a l P o i n t s ) ; numFlock++; b e g i n = end = now ;

65

f i n a l P o i n t s = aux . g e t ( i ) . g e t P o i n t I D s ( ) ; } else { b e g i n = end = now ; f i n a l P o i n t s = aux . g e t ( i ) . g e t P o i n t I D s ( ) ; }

330

} i f ( end − b e g i n >= LCMFlock . d e l t a − 1 ) { System . out . p r i n t l n ( ” \n” + f i d + ” ( ” + f i n a l P o i n t s . s i z e ( ) + ”) : ”) ; System . out . p r i n t l n ( ”From time ” + b e g i n + ” t o ” + end ) ; System . out . p r i n t l n ( ”Members : ” + f i n a l P o i n t s ) ; numFlock++; } f i d ++;

335

340

} } catch ( IOException ex ) { Logger . g e t L o g g e r ( LCMFlock . c l a s s . getName ( ) ) . l o g ( L e v e l . SEVERE, null , ex ) ; } finally { try { reader . close () ; } catch ( IOException ex ) { Logger . g e t L o g g e r ( LCMFlock . c l a s s . getName ( ) ) . l o g ( L e v e l . SEVERE, null , ex ) ; } }

345

350 } 355

360

365

370

375

private P o i n t [ ] Point d i s k s [ ] double p1 x = double p1 y = double p2 x = double p2 y = double k 1 , k double X, Y; double D2 ;

c a l c u l a t e D i s k s ( P o i n t p1 , P o i n t p2 ) { = new P o i n t [ 2 ] ; p1 . getX ( ) ; p1 . getY ( ) ; p2 . getX ( ) ; p2 . getY ( ) ; 2, h 1, h 2;

X = p1 x − p2 x ; Y = p1 y − p2 y ; D2 = Math . pow (X, 2 ) + Math . pow (Y, 2 ) ; //The two p o i n t s a r e t h e same ( Measure or r e s a m p l e e r r o r s ) i f (D2 == 0 ) { return n u l l ; } double e x p r e s s i o n = 4 ∗ ( r 2 / D2) − 1 ; double r o o t = Math . pow ( e x p r e s s i o n , 0 . 5 ) ; h 1 = ( (X + Y ∗ r o o t ) / 2 ) + p2 x ; h 2 = ( (X − Y ∗ r o o t ) / 2 ) + p2 x ; k 1 = ( (Y − X ∗ r o o t ) / 2 ) + p2 y ; k 2 = ( (Y + X ∗ r o o t ) / 2 ) + p2 y ;

66

d i s k s [ 0 ] = f a c t o r y . c r e a t e P o i n t (new C o o r d i n a t e ( h 1 , k 1 ) ) ; d i s k s [ 1 ] = f a c t o r y . c r e a t e P o i n t (new C o o r d i n a t e ( h 2 , k 2 ) ) ;

380

return d i s k s ; } public void f l o c k F i n d e r ( S t r i n g f i l e n a m e ) { try { Loader l o a d e r ; l o a d e r = new Loader ( f i l e n a m e ) ; int p o i n t s = l o a d e r . readPoints ( ) ; numFlock = 0 ; Int2ObjectAVLTreeMap timestamps = l o a d e r . getTimestamps ( ) ; I n t S e t t i m e s = timestamps . k e y S e t ( ) ; O b j e c t C o l l e c t i o n p o i n t s e t s = timestamps . v a l u e s ( ) ; O b j e c t I t e r a t o r i t P o i n t s e t s = p o i n t s e t s . i t e r a t o r ( ) ; I n t I t e r a t o r itTimes = times . i t e r a t o r ( ) ; while ( i t T i m e s . hasNext ( ) ) { LCMFlock . time = i t T i m e s . n e x t ( ) ; P o i n t s e t p o i n t s e t = i t P o i n t s e t s . next ( ) ; Object2ObjectAVLTreeMap g r i d = t h i s . g e t G r i d ( p o i n t s e t ) ; ObjectAVLTreeSet maximalDisks = t h i s . g e t D i s k s ( g r i d ) ; ntime++; } t h i s . mineMIF ( ) ; this . checkFlocks ( ) ; } catch ( IOException ex ) { Logger . g e t L o g g e r ( LCMFlock . c l a s s . getName ( ) ) . l o g ( L e v e l . SEVERE, null , ex ) ; } }

385

390

395

400

405

410

}

public s t a t i c void main ( S t r i n g [ ] a r g ) { LCMFlock main = new LCMFlock ( 1 0 0 , 3 , 3 ) ; main . f l o c k F i n d e r ( ” I c e b e r g 0 6 . dat ” ) ; }

67