Comparing Models for Gesture Recognition of Children s Bullying Behaviors

2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII) Comparing Models for Gesture Recognition of Children’...

Author: Neal Howard Tyler

3 downloads 0 Views 4MB Size

Report

Download PDF

Recommend Documents

A 3D Gesture Recognition Extension for igesture

GESTURE RECOGNITION LIBRARY FOR LEAP MOTION CONTROLLER

Dynamic Training of Hand Gesture Recognition System

Hand Gesture Recognition From Video

Gesture Recognition in Intelligent Buildings

Kinect-based Gesture Password Recognition

Children s recognition of alcohol marketing

Toddler & Children s Challenging Behaviors. Learning Objectives:

Hand Gesture Recognition in Camera-Projector System

3D Gesture Recognition through RF Sensing

Human Gesture Recognition Using Kinect Camera

Coherency in One-Shot Gesture Recognition

Bayesian Co-Boosting for Multi-modal Gesture Recognition

Multi-Dimensional Dynamic Time Warping for Gesture Recognition

HAND GESTURE RECOGNITION USING MULTILAYER PERCEPTRON NETWORK

Classifier Fusion for Gesture Recognition using a Kinect Sensor

Application Note AN-2109 High Power VCSELs for Gesture Recognition

Gesture Recognition using a Probabilistic Framework for Pose Matching

Minimizing Bullying for Children Who Stutter

A Descriptive Analysis of Harassment, Intimidation, and Bullying Student Behaviors

Comparing Multifactor Models of the Term Structure

Hand gesture recognition using a real-time tracking method and hidden Markov models q

COMPUTATIONAL MODELS OF MUSICAL METER RECOGNITION

Gesture Recognition Using a Touchless Sensor To Reduce Driver Distraction

2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII)

Comparing Models for Gesture Recognition of Children’s Bullying Behaviors Michael Tsang, Vadim Korolik

Stefan Scherer

Maja Matari´c

Department of Computer Science University of Southern California Los Angeles, California 90089 Email: {tsangm,korolik}@usc.edu

Institute of Creative Technologies University of Southern California Los Angeles, California, 90094 Email: [email protected]

Department of Computer Science University of Southern California Los Angeles, California 90089 Email: [email protected]

Abstract—We explored gesture recognition applied to the problem of classifying natural physical bullying behaviors by children. To capture natural bullying behavior data, we developed a humanoid robot that used hand-coded gesture recognition to identify basic physical bullying gestures and responded by explaining why the gestures were inappropriate. Children interacted with the robot by trying various bullying behaviors, thereby allowing us to collect a natural bullying behavior dataset for training the classifiers. We trained three different sequence classifiers using the collected data and compared their effectiveness at classifying different types of common physical bullying behaviors. Overall, Hidden Conditional Random Fields achieved the highest average F1 score (0.645) over all tested gesture classes.

Figure 1. Example of bullying-type gestures children spontaneously demonstrated in front of the robot to elicit its response

1. Introduction Bullying among children is a global issue that can cause long-term negative psychological and behavioral problems for both the victim and the bully. Children who bully others tend to have higher instances of conduct problems and dislike of school, whereas victims of bullying show higher levels of insecurity, anxiety, depression, loneliness, and low self-esteem compared to their peers [7], [13]. Despite the prevalence and severity of bullying, to our knowledge no research has been conducted on the automatic recognition/classification of physical bullying behaviors. Such recognition and classification, especially if done in realtime, can be useful for reporting and intervening to prevent negative effects. An example of an intervention could be a physical robot detecting a child’s bullying behavior and advising the child on what inappropriate behavior the child engaged in and why bullying is wrong. In the computer vision community, human action recognition is typically studied on datasets of behaviors acted by adults performing in front of a camera in a structured fashion [29]. For example, specific behaviors may be acted in sequence in the same order, with every actor performing the target behaviors multiple times. However, our goal was to acquire children’s natural bullying behavior data. Nomura et al. (2015) found that children in a shopping mall in Tokyo

c 2017 IEEE 978-1-5386-0563-9/17/$31.00

had a tendency to show abusive, bullying behaviors towards a robot [15]. Informed by that work, we used a robot as a target of non-contact bullying behaviors by children for data collection purposes. Toward that end, we developed a humanoid anti-bullying robot that responds to perceived bullying behaviors by explaining why such behaviors are inappropriate. Children engaged in interacting with the robot tested out a variety of bullying behaviors to see how the robot would respond. We recorded a significant number of such playful, mock-bullying instances by children (see Figure 1) for use as a training set for the classification algorithm. Bullying can manifest in a variety of forms, including physical (kicking, hitting), verbal (name-calling, intimidation), social (gossip), and cyber [23]. According to the literature, punching, kicking, showing/waving a fist, and pointing are prevalent in aggressive bullying and teasing, and are especially damaging [2], [4], [21]. Therefore, in the scope of this paper, we focus on the detection of those behaviors in one-on-one bullying. In group bullying, i.e., the cases where children bullied the robot in groups, we select the child closest to the robot as the bully. We hypothesized that the bullying behaviors we examined lend themselves to effective classification using gesture recognition methods. We tested this hypothesis by training

and comparing the following models for gesture recognition on our dataset: Hidden Conditional Random Fields, Hidden Markov Models, and Dynamic Time Warping [25]. The results of our experiments support our hypothesis: even with a small number of children’s free-form bullying behaviors as training and testing examples, Hidden Conditional Random Fields was able to discriminate among all four bullying gesture classes and a null class. This paper contributes a novel process for obtaining natural bullying behavior data from children and comparatively demonstrates an effective gesture recognition approach for bullying behavior classification. With natural bullying data comes the challenge of accounting for a large null class, since bullying occurrences are much less common than non-bullying (null) events. We address this challenges by accounting for naturally imbalanced, noisy, and limited data associated with children’s misbehaviors, instead of relying on clean acted data. We focus on testing the hypothesis that gesture recognition is appropriate for bullying behavior classification. The full dynamics of bullying were not modeled, as the focus was on determining the effectiveness of gesture recognition as a prerequisite. Toward that goal. we show that gesture recognition methods that have traditionally been tested on data from adults can also work on child data and on the specific task of bullying detection. We describe our approach to obtaining natural behavior data on children’s bullying behaviors using a humanoid robot as a bullying target and compare the performance of three validated classification models on those bullying behavior data.

2. Related Work We briefly review existing literature relevant to physical bullying detection and behavior classification in child-robot interactions.

2.1. Bullying Detection While little research has been conducted on the automatic detection of physical bullying behaviors specifically, related topics, such as human aggression and violence detection, have been explored. As a result, detection methods have been applied to a variety of domains, including the detection of aggressive behaviors by the elderly [5], violent human behavior in crowds [10], fight scenes in sports videos [14], violence in movies [1], [6], and physical bullying roleplayed by adults [26], [27]. Research in classifying aggression or violence has primarily focused on feature representations of RGB videos. For example, Chen et al. [5] studied binary motion descriptors in their classification of aggression in elderly, Nievas et al. [14] used motion space-invariant feature transform and space-time interest points in classifying hockey fights, Hassner et al. [10] used flow vector descriptors in their classification of crowd violence, and Ye et al. [26], [27] used acceleration and gyro data for role-played bullying detection.

Classification of aggressive behavior has also been studied on depth camera (RGB-D) videos in various forms. As surveyed by Zhang et al. [29], many RGB-D datasets consist of one or more aggressive behaviors, such as punching, kicking, pushing, and throwing. Furthermore, many of these datasets are widely used in action recognition experiments with novel feature representation or computational models; however, none of these datasets consist solely of aggressive behaviors, and none of these datasets involve child actors. In the context of our study, bullying detection is the detection of a subset of physical children’s behaviors that are described as bullying in social science literature. Specifically, studies have found that intentional physical or emotional abuse, such as hitting, kicking, and pointing directed toward a victim from a person of greater power or strength is considered bullying [4], [16], [17], [19], [21]. These gestures can be observed from RGB-D data in the form of skeletons, making them the basis for our detection method.

2.2. Classification of Behaviors in Child-Robot Interactions

Limited research to date has been conducted on the data collection and automatic classification of children’s behaviors in child-robot interactions. We describe several notable studies in this domain. Leite et al. [12] studied and classified the nonverbal behaviors that children show when they disengage from social interactions with robots. The setting of the study is similar to ours, but we classify nonverbal behaviors when children bully a robot. In that study, data used to model disengagement were collected from child-robot interaction studies primarily in the form of videos, which were handannotated and processed with a facial tracking algorithm for features of children’s behaviors. Support Vector Machines were used to classify and rank the most discriminative features of disengagement. Strohkorb et al. [22] recorded a group of children interacting with robots using a tablet device to identify dominant behavior of one child over the others. Video data of childrobot interactions were recorded and hand-annotated for features including gaze, utterance, and gestures, then used to model social dominance. Logistic Regression and Support Vector Machines were used for classification. The appearance of the robots was designed to engage the children in the child-robot interactions, similar to our choice of using a humanoid robot. In contrast to those works, we study the recognition of children’s bullying behaviors by taking into account the temporal component of human movement. Hence, we apply sequence classifiers to our task of detecting children’s bullying behaviors.

Figure 2. Experimental setup: the robot responds to detected undesirable behaviors by explaining why they are inappropriate. The red tape marks the boundary children were not allowed to cross, to prevent them from touching and possibly harming the robot given the bullying context.

3. Methodology 3.1. Experimental Setup To enable our data collection of children’s bullying behaviors, we endowed a humanoid robot with the ability to respond to bullying poses by children. The robot was programmed to respond to poses that appeared to be preparing for hitting, kicking, and shoving, as well as the poses of showing a fist and sticking out the tongue. We deployed the robot at our University Robotics Open House event attended by children aged 6 to 18. All children participated in the Open House with parental consent provided to their schools, and many of the children interacted with the robot over a four-hour data collection period. A demonstrator showed the audience of children the poses that the robot was programmed to detect, and then the children, either individually or in a group, freely interacted with the robot from one meter away (Figures 1, 2). Child supervisors and the demonstrator were always present, and participation in the activity was voluntary and could be terminated at any point. The robot’s role in the interaction was to explain why certain behaviors are inappropriate. For example, when the robot detected someone pointing at it, it responded by saying “Stop it! Please don’t point at people because they won’t know why you are pointing at them.” Likewise, when the robot detected someone showing their fist, it responded “Stop it! You are showing your fist. Please don’t do that at people because it signals an intention to hurt them.” Similar responses were made for every gesture the robot was programmed to detect. These capabilities naturally inspired children to attempt various bullying actions in order to elicit the robot’s response. In this way, the robot was a target for children’s real or mock bullying behaviors, and

we were able to record those natural behaviors from the robot’s perspective. We used Bandit, an adolescent-sized humanoid robot torso mounted atop a Pioneer P3-DX mobile robot base. We collected data on children’s behaviors using a Kinect One1 mounted on top of the robot (which was 1.12 meters tall) at a height of 1.35 meters and angled downwards at 13.8 degrees. This setup allowed the Kinect to capture children and adolescents up to 185cm (6’ 1”) tall. The system recorded skeletal features, depth data, and body segmentation information while simultaneously detecting poses in real time from tracked skeletons. At any point in time, we limited tracking and data collection to one child by only processing the closest skeleton in a 60-degree horizontal field of view of the Kinect sensor. The real-time pose detector we used for data collection was heuristic-based. It normalized Kinect skeleton sizes, selected a set of representative skeletal joints for a pose, computed z-scores [8] of joint positions while the person held the pose, and used a manually set threshold to average those features. The threshold was based on estimating when the same pose is shown again by another person. This detector is not practical for general, reliable bullying detection because it neither captures the sequential nature of gestures nor the subtleties of children’s bullying, since it is not datadriven. In our testing and data collection, the heuristic-based pose detector often failed, causing us to capture fewer poses overall. However, in spite of its limitations, the automatic approach was sufficiently effective for the data collection process, allowing us to capture a natural behavior dataset that was used for bullying gesture recognition training and evaluation.

3.2. Dataset The dataset consists of Kinect One skeletal data of 49 boys and 17 girls performing bullying gestures. No identifiable data of these children were published or shared. Boys’ heights ranged from 0.77m and 1.85m, with an average of 1.30m and the standard deviation of 0.17m. Girls’ heights ranged from 1.04m to 1.45m, with an average of 1.26m and the standard deviation of 0.11m. The bullying gestures collected in our dataset consist of hitting, pointing, kicking, showing a fist, and null classes shown by children in front of the demo robot. We only processed the gestures that did not make physical contact with the robot or Kinect and were demonstrated using the child’s right hand. A hitting gesture is represented by the onset and apex of a swipe or punch performed in front of the body. Likewise, pointing and kicking gestures are represented by their onset and apex, demonstrated in front of the body. Showing a fist is represented with raising the hand and displaying the fist at the robot as if preparing to punch the robot. Finally, the null class represents other behavioral sequences the children performed. These behaviors were diverse and included walking, leaning forward, hand waving, and standing idle. 1. http://www.xbox.com/en-US/xbox-one/accessories/kinect

TABLE 1. T HE NUMBER OF CHILDREN THAT SHOWED A GESTURE A SPECIFIC NUMBER OF TIMES IN OUR DATASET. Gesture class 0 times

1 time

2 times

3 times

4 times

5 times

6 times

7 or more times

18 42 43 50 6

25 11 16 10 11

13 6 3 3 17

5 4 1 3 1

1 0 2 0 3

3 2 0 0 7

1 0 0 0 7

0 1 1 0 14

1.0

Positions NormPos Velocities Norm+Vel

0.8 0.6 0.4 0.2

ll

1.0

DTW DHMM CHMM HCRF

0.8 0.6

ll nu

g

kic kin

fis t ing

sh

po

ow

0.0

hit

A number of steps were needed to classify children’s bullying behaviors in our dataset. In order to prepare features, we used two approaches of processing raw skeletal data. First, we computed normalized 3D positions of the raw skeleton by converting all pairwise distances between

int ing

0.2

tin g

0.4

4.1. Procedure

2. Because our demo robot and Kinect One configuration used an underpowered mini-PC, the Kinect skeletal data were not captured at 30Hz. To handle irregularity in the data time-series, we sampled the data at 15Hz. 3. https://tla.mpi.nl/tools/tla-tools/elan/

nu

Figure 3. F1 score comparisons of feature representations for each gesture. ”Positions” indicates features are raw skeletal joint positions, ”NormPos” indicates features are normalized positions, ”Velocities” indicates features are joint velocities, and ”Norm+Vel” indicates features are a combination of joint velocities and normalized joint positions. For all feature representations, PCA was applied to the training data to capture 90% of feature variance, and HCRF was used for classification. Error bars indicate standard deviations of F1 scores from cross-validation.

F1 scores

4. Predicting Children’s Bullying Gestures

ing ow sh

3.3. Video Annotation Because all recorded videos are of a protected participant class (children), we opted to annotate the videos ourselves. The lead author was the only data coder; all annotations were done using Elan annotation software3 . We were careful not to bias the annotations by setting strict objective guidelines on the start (onset) and end (offset) times of each behavior; the onset is defined as the moment a child begins any specific behavior and the offset is when s/he completes the same behavior. Each video consisted of one or several unique behaviors of various durations; we manually analyzed each video to determine the onset and offset of our restricted set of bullying behaviors. A high number of all remaining video segments, including false positive behaviors, were labeled as the null class and are the reason for our dataset imbalance (Table 1).

fis t kic kin g

g tin hit

int

ing

0.0

po

There were a total of 91 pointing, 54 hitting, 40 showing a fist, 25 kicking, and 302 null gestures. The average sequence lengths of pointing, hitting, showing a fist, kicking, and null are: 33.5, 9.2, 46.8, 8.8, and 32.4 frames, respectively, at 15 frames per second2 . While every child showed a bullying gesture, not all children showed every gesture. Out of the 66 total children, 48 showed pointing, 24 showed hitting, 23 showed their fist, 16 showed kicking, and 60 (nearly everyone) showed the null sequence. Furthermore, gesture repetition rates varied greatly among children. For example, 25 children showed pointing only once, 13 children showed pointing 2 times, 5 children showed pointing 3 times, and so on. The repetition rates for all gesture classes can be seen in Table 1.

F1 scores

Pointing Hitting Showing Fist Kicking Null

Number of times a gesture was shown

Figure 4. F1 score comparisons of classifiers for each gesture. Error bars indicate standard deviations of F1 scores from cross-validation.

neighboring joints to be unit length while preserving all joint angles, and by fixing the head joint to be at the origin. Second, we computed the velocities of all joints by subtracting the joint positions of every other raw skeleton in time and dividing by their time difference. Our use of normalized joint positions and velocities was inspired by the findings of Zanfir et al. on improving skeletal action recognition using simple descriptors [28]. A comparison of classifier performance for different feature representations can be seen in Figure 3. Our data analysis indicates that gesture classes such as kicking and pointing require features that capture both stationary positions in pointing and fast movements in kicking. Figure 3 shows that, by themselves, velocity features outperform stationary features for classification of hitting and kicking, whereas solely using stationary features, in particular normalized positions, outperforms velocity features in classifying pointing and showing the fist. We achieve a balance of representing fast and stationary gestures by concatenating normalized joint position features and joint velocity features. This combined representation results in 150 features; the x, y, and z positions of all 25 joints were used in both the normalization and velocity computations. To reduce feature dimensionality, we applied Principal Component Analysis (PCA) to capture 90% of the variance of the original features in training sets and used the same principal components to classify test sets. The performance of the combined representation in gesture classifications can be seen in the Norm+Vel bars in Figure 3 and in a comparison of classifiers that use this representation in Figure 4. Using the combined joint velocity and normalized joint position feature representation, we compared classification performance of Dynamic Time Warping (DTW), Discrete Hidden Markov Models (DHMMs), Continuous Hidden Markov Models (CHMMs), and Hidden Conditional Random Fields (HCRFs) on our data. Among sequence classifiers, DTW was chosen because it is a simple baseline, Hidden Markov Models (HMMs) because they are standard generative models, and HCRF because it is a discriminative model. HCRFs and HMMs both employ hidden states for classification of sequences, but HCRFs only need one trained model for multi-class classification, whereas HMMs need separate models to be trained per class [24]. In contrast, DTW attempts to find an alignment between an input time series and a reference time series by computing the distance between them [3]. HCRF experiments were conducted using the HCRF library4 , and HMM and DTW experiments were conducted using the Gesture Recognition Toolkit [9]. The DHMM had four states, and the HCRF had 10 states with a window size of 1 and an L2 regularization of 10. These parameters are consistent with the model comparison experiments performed by Wang et al. [24]. With these classifiers, we performed person-independent cross-validation experiments. Since our dataset contains 66 children, we used leave-11children-out cross-validation, where gesture data from 11 4. http://sourceforge.net/projects/hcrf/

TABLE 2. C OMPARISON OF CROSS - VALIDATION STATISTICS FOR MACRO - AVERAGED F1 SCORES . Classifier

F1 scores average

DTW DHMM CHMM HCRF

0.359 0.469 0.496 0.645

std. dev. 0.0923 0.0597 0.105 0.0495

Figure 5. Confusion matrix for Dynamic Time Warping

children were retained as validation data for testing a model, and the remaining gesture data from 55 children were used to train the model. In order to handle random seeds in our models, we repeated the training of each model five times. The six rounds of classification experiments as part of crossvalidation and the repeated training of our models result in 30 classification experiments per model.

4.2. Evaluation The primary evaluation metric we used was the F1 score [20], in order to capture classifier performance on both precision and recall of data with imbalanced class labels. We computed an F1 score for each gesture class to determine a classifier’s performance on specific gestures. To obtain a holistic performance score for a classifier across all gesture classes, we averaged the F1 scores across a gesture class to produce a macro-averaged F1 score. We also generate confusion matrices to examine how classifiers confuse predicted and actual class labels, discussed below.

4.3. Results and Discussion Comparing F1 scores between classifiers for each gesture class reveals that HCRF obtains the highest scores at classifying children’s bullying gestures (Figure 4). The perclass F1 scores obtained by HCRF classifications are 0.79,

Figure 6. Confusion matrix for Continuous HMM

Figure 7. Confusion matrix for Discrete HMM

0.54, 0.64, 0.68, and 0.90, for pointing, hitting, showing the fist, kicking, and null, respectively. The superior performance of HCRF can also be seen in Table 2, which shows cross-validation statistics on macro-averaged F1 scores. In Table 2, HCRF achieves the highest average and lowest standard deviation of F1 scores in cross-validation. The confusion matrix of each classifier is given in Figures 5-8. A number of critical misclassifications can be seen in the confusion matrices. DTW confuses showing the fist with hitting and all of the gestures with null. CHMM confuses most gestures with pointing and null. The common confusion with the null class is likely caused by the dataset’s skewed class distribution towards that class, resulting from capturing natural behavior data. The HCRF is the only model to demonstrate the capability of discriminating all gesture classes in our dataset even in the presence of class imbalance, since the HCRF predicts true classes at the highest rates compared to other classes for all gestures (see Figure 8). Contrary to common understandings of simple models such as DTW, and generative models such as CHMM and DHMM, such models are not necessarily better at generalizing and modeling small datasets as compared to discriminative models, such as HCRF [11]. In the case of our dataset, the HCRF discriminates all gestures classes while HMM and DTW mostly tend to mis-classify gestures. We suspect the reason behind HCRF outperforming HMM is because HMM presumes independence of observations given latent variables, while HCRF makes no such assumption [18]. In addition, we believe that HCRF outperforms DTW because HCRF models hidden temporal dependency structure in its latent variables, whereas DTW does not have this capability. The results of HCRF classification are not without problems. For example, the predicted label is often null when actual labels are other gestures. Showing the fist and hitting gestures are also often confused with pointing. The former problem may be attributed to the imbalanced dataset, and the latter problem may be due to feature representations of hitting and pointing. There also seems to be a classification bias favoring long gesture sequences and more training data, which may explain why pointing and null classification are the best for not only the HCRF, but also CHMM and DTW. Despite potential problems with HCRF classification of our dataset, HCRF still outperforms other models and its classification performance may be improved in uses with more training data and a more balanced training set.

5. Lessons from the Child-Robot Interaction

Figure 8. Confusion matrix for HCRF

In this section, we share our observations on the robot interaction task from the perspectives of the interaction itself and the affective computing technologies we used. Then, we explain how these observations inform future research on better understanding and addressing bullying. One of the most notable lessons we learned in the childrobot interaction was that children were very engaged in interacting with the robot, sometimes revisiting the demo later in the day to mock bully the robot again. We also

noticed that the responses made by the robot - correct or otherwise - encouraged more mock bullying. Dynamics of group interactions also played a role, where children were more likely to mock bully the robot when in a group than by themselves alone. Finally, we noticed that children almost always looked at the eyes of the robot when mock bullying it, rather than looking at the Kinect camera mounted above the robot. The challenges associated with developing and using affective technologies for bullying detection are manifold. A significant challenge we faced involved minimizing the false positive rate of our classifiers, since a robot that falsely detects children’s bullying could lead to undeserved accusations of bullying. We know that given sufficient data, classification performance can reach higher levels of accuracy, however, the approach will likely never yield perfect results. Given this fact, an important question to address is how to develop technologies for bullying detection that minimizes false positives. Another challenge we encountered was detecting group bullying. Although our work does not address group bullying, Kinect-type vision systems typically do not perceive more than 10 people, so it is difficult to track many individuals in a stable manner, as it is also non-trivial to maintain sustained tracking of a specific person. In addition to these challenges, there are many other problems to be addressed in vision-based bullying detection. One of the most classic problems is collecting data at scale, so that state-of-the-art machine learning models can also be applied to bullying detection. More specific to bullying is the detection of nuanced behaviors, such as distinguishing between thumbs down and showing a fist, or detecting behind-the-back bullying. Another important area of study is identifying the bully, to enable the detection of repeated bullying as well as aid in bullying interventions.

a first glimpse at the classifier performance achievable in the domain of detecting children’s bullying gestures with natural, noisy, and limited data. There are multiple lines of future work to explore in the gesture recognition of bullying behaviors. Since collecting data from children is challenging, it would be interesting to examine whether models trained on acted bullying gestures by adults can be used to correctly classify test gestures by children. Furthermore, it is worth exploring whether examples of children’s bullying behaviors are enough to train sequence classifiers for real-time bullying recognition. Finally, the detection of synchronous bullying behaviors among groups of children is another natural extension of this work.

Acknowledgments This material is supported by the National Science Foundation under award number IIS-1117279 and the U.S. Army Research Laboratory under contract number W911NF-14-D0005. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Government, and no official endorsement should be inferred.

References [1]

E. Acar, F. Hopfgartner, and S. Albayrak. Violence detection in hollywood movies by the fusion of visual and mid-level audio cues. In Proceedings of the 21st ACM international conference on Multimedia, pages 717–720. ACM, 2013.

[2]

M. A. Barnett, S. R. Burns, F. W. Sanborn, J. S. Bartel, and S. J. Wilds. Antisocial and prosocial teasing among children: Perceptions and individual differences. Social Development, 13(2):292–310, 2004.

[3]

D. J. Berndt and J. Clifford. Using dynamic time warping to find patterns in time series. In KDD workshop, volume 10, pages 359– 370. Seattle, WA, 1994.

[4]

K. Bjorkqvist, K. M. Lagerspetz, and A. Kaukiainen. Do girls manipulate and boys fight? developmental trends in regard to direct and indirect aggression. Aggressive behavior, 18(2):117–127, 1992.

[5]

D. Chen, H. Wactlar, M.-y. Chen, C. Gao, A. Bharucha, and A. Hauptmann. Recognition of aggressive human behavior using binary local motion descriptors. In 2008 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pages 5238– 5241. IEEE, 2008.

[6]

L.-H. Chen, H.-W. Hsu, L.-Y. Wang, and C.-W. Su. Violence detection in movies. In Computer Graphics, Imaging and Visualization (CGIV), 2011 Eighth International Conference on, pages 119–124. IEEE, 2011.

[7]

W. E. Copeland, D. Wolke, A. Angold, and E. J. Costello. Adult psychiatric outcomes of bullying and being bullied by peers in childhood and adolescence. JAMA psychiatry, 70(4):419–426, 2013.

[8]

J. Dubes. Algorithms for clustering data. Prentice Hall., 1988.

[9]

N. E. Gillian and J. A. Paradiso. The gesture recognition toolkit. Journal of Machine Learning Research, 15(1):3483–3487, 2014.

6. Conclusions and Future Works The ability to classify bullying gestures from children is necessary for automatically monitoring and intervening in cases of physical bullying at schools, playgrounds, and homes. To train effective recognizers, realistic data are needed, but socially sensitive behaviors are difficult to capture naturally. We present a novel method of collecting data on children’s bullying gestures using an anti-bullying robot, which children engage with by naturally acting out bullying gestures in front of it. Using the collected data, we preform experiments with different sequence classifiers to compare their performance on discriminating gestures in the dataset. For per-class gesture recognition, we show that HCRF outperforms other models like HMM and DTW for every gesture based on F1 scores in our dataset, which features natural child bullying behaviors and is highly imbalanced. Neither HMM nor DTW perform comparably to HCRF across all gesture classes, suggesting that future domainagnostic gesture recognition experiments should use the HCRF as a baseline model. Furthermore, this work offers

[10] T. Hassner, Y. Itcher, and O. Kliper-Gross. Violent flows: Real-time detection of violent crowd behavior. In 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pages 1–6. IEEE, 2012.

[11] A. Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. Advances in neural information processing systems, 14:841, 2002. [12] I. Leite, M. McCoy, D. Ullman, N. Salomons, and B. Scassellati. Comparing models of disengagement in individual and group interactions. In Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction, pages 99–105. ACM, 2015. [13] T. R. Nansel, M. Overpeck, R. S. Pilla, W. J. Ruan, B. SimonsMorton, and P. Scheidt. Bullying behaviors among us youth: Prevalence and association with psychosocial adjustment. Jama, 285(16):2094–2100, 2001. [14] E. B. Nievas, O. D. Suarez, G. B. Garc´ıa, and R. Sukthankar. Violence detection in video using computer vision techniques. In International Conference on Computer Analysis of Images and Patterns, pages 332– 339. Springer, 2011. [15] T. Nomura, T. Uratani, T. Kanda, K. Matsumoto, H. Kidokoro, Y. Suehiro, and S. Yamada. Why do children abuse robots? In Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction Extended Abstracts, pages 63–64. ACM, 2015. [16] D. Olweus. The revised Olweus bully/victim questionnaire. University of Bergen, Research Center for Health Promotion, 1996. [17] D. Olweus. Bully/victim problems in school: Facts and intervention. European Journal of Psychology of Education, 12(4):495–510, 1997. [18] A. Quattoni, S. Wang, L.-P. Morency, M. Collins, and T. Darrell. Hidden conditional random fields. IEEE transactions on pattern analysis and machine intelligence, 29(10), 2007. [19] J. P. Shapiro, R. F. Baumeister, and J. W. Kessler. A three-component model of children’s teasing: Aggression, humor, and ambiguity. Journal of Social and Clinical Psychology, 10(4):459–472, 1991. [20] M. Sokolova and G. Lapalme. A systematic analysis of performance measures for classification tasks. Information Processing & Management, 45(4):427–437, 2009. [21] M. E. Solberg and D. Olweus. Prevalence estimation of school bullying with the olweus bully/victim questionnaire. Aggressive behavior, 29(3):239–268, 2003. [22] S. Strohkorb, I. Leite, N. Warren, and B. Scassellati. Classification of children’s social dominance in group interactions with robots. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pages 227–234. ACM, 2015. [23] J. Wang, R. J. Iannotti, and T. R. Nansel. School bullying among adolescents in the united states: Physical, verbal, relational, and cyber. Journal of Adolescent health, 45(4):368–375, 2009. [24] S. B. Wang, A. Quattoni, L.-P. Morency, D. Demirdjian, and T. Darrell. Hidden conditional random fields for gesture recognition. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 1521–1527. IEEE, 2006. [25] D. Weinland, R. Ronfard, and E. Boyer. A survey of vision-based methods for action representation, segmentation and recognition. Computer vision and image understanding, 115(2):224–241, 2011. [26] L. Ye, H. Ferdinando, T. Sepp¨anen, and E. Alasaarela. Physical violence detection for preventing school bullying. Advances in Artificial Intelligence, 2014:5, 2014. [27] L. Ye, H. Ferdinando, T. Sepp¨anen, T. Huuki, and E. Alasaarela. An instance-based physical violence detection algorithm for school bullying prevention. In 2015 International Wireless Communications and Mobile Computing Conference (IWCMC), pages 1384–1388. IEEE, 2015. [28] M. Zanfir, M. Leordeanu, and C. Sminchisescu. The moving pose: An efficient 3d kinematics descriptor for low-latency action recognition and detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 2752–2759, 2013. [29] J. Zhang, W. Li, P. O. Ogunbona, P. Wang, and C. Tang. Rgb-dbased action recognition datasets: A survey. Pattern Recognition, 60:86–105, 2016.