Entertainment Robot: Learning from Observation Paradigm for Humanoid Robot Dancing

Entertainment Robot: Learning from Observation Paradigm for Humanoid Robot Dancing Shunsuke Kudoh Takaaki Shiratori Shin’ichiro Nakaoka The Univers...
Author: Timothy Wilcox
14 downloads 3 Views 403KB Size
Entertainment Robot: Learning from Observation Paradigm for Humanoid Robot Dancing Shunsuke Kudoh

Takaaki Shiratori

Shin’ichiro Nakaoka

The University of Tokyo, Japan

The University of Tokyo, Japan

[email protected]

[email protected]

National Institute of Advanced Industrial Science and Technology (AIST), Japan [email protected]

Atsushi Nakazawa

Fumio Kanehiro

Katsushi Ikeuchi

Okasa University, Japan

National Institute of Advanced Industrial Science and Technology (AIST), Japan

The University of Tokyo, Japan

[email protected]

[email protected]

[email protected]

Abstract— This paper describes how to enable a humanoid robot to imitate human dance. The robotics-oriented approach for generating dance motions is a challenging topic due to dynamic and kinematic differences between humans and robots. In order to overcome these differences, we have been developing a paradigm, Learning From Observation (LFO), in which a robot observes human actions, recognizes what the human is doing, and maps the recognized actions to robot actions for the purpose of mimicking them. We extend this paradigm to enable a robot to perform classic Japanese folk dances. Our system recognizes human actions through abstract task models, and then maps recognized results to robot motions. Through this indirect mapping, we can overcome the physical differences between human and robot. Since the legs and the upper body have different purposes when performing a dance, support by the legs and dance representation by the upper body, our design strategies for the task models reflects these differences. We use a top-down analytic approach for leg task models and a bottomup generative approach for keypose task models. Human motions are recognized through these task models separately, and robot motions are also generated separately through these models. Then, we concatenate those separately generated motions and adjust them according to the dynamic body balance of a robot. Finally, we succeed in having a humanoid robot dance with an expert dancer at the original music tempo.

I. I NTRODUCTION Recently, research on humanoid robots has been progressing dramatically. Several applications areas for humanoid robots have emerged. Among them, one of the most promising and unique areas is to utilize humanoid robots for entertainment, in particular for dancing, using the shape similarity of the robot to that of a human dancer. Dance motions also contain various challenging actions, which require long training and practice periods even for human dancers, and motivate the advancement of robotics technologies. To make a robot dance is a challenging and emerging research area. Various CG theories have been proposed to make avatars and characters dance. However, difficulties are encountered in applying those graphics-oriented theories to dancing robots due to factors such as the physical mass of

humanoid robots, friction between the floor and the robot’s feet, and motor power-limitations. The robotics area also has accumulated research results on locomotion using the robot’s legs. However, few researchers have been challenged to make more complicated actions than simple locomotion. Thus, to develop systems to make humanoid robots dance should open up a new area that contributes to both robotics and graphics communities. One of the issues to be solved for achieving dancing robots is how to program such complicated actions on humanoid robots in robot programming language. One simple solution is to manually program robots. In fact some companies do program their humanoid robots manually. However, this requires long and tedious programming time and results in inflexibility in generating desired dances. We have been developing a paradigm, Learning From Observation (LFO), in which a robot observes human actions, recognizes what the human is doing, and maps the recognized action to robot actions for mimicking them. As shown in the attached video, direct mapping from human joint angles to robot joint angles does not work well because of the differences in weight balance and lengths of arms and legs. The LFO paradigm prepares task models with which the robot recognizes what the human is doing. The recognized task model generates appropriate robot motions to mimic those human actions. This indirect mapping can overcome balance and dimensional differences. In fact, this LFO has been applied to various hand-eye operations such as parts assembly and knot tying [1], [2]. In this paper, we plan to extend this paradigm to handle dancing robots. We have chosen Japanese folk dance as the goal for our dancing robots to perform. Japanese folk dance is highly structured, so there is a possibility that we can define clear task models for Japanese folk dances. It is also true that some Japanese folk dances disappear over time due to the lack of inheritors who will continue to perform these dances. But once a robot learns how to dance such folk dances, we can preserve

Task Models

1.Observation

(what to do) task1

task2

task3

(how to do)

skill2

skill3

skill1

2.Recognition

3.Performance

Fig. 1: A humanoid robot reproducing dance performance based on the LFO paradigm.

them forever in the performances of humanoid robots. We employed different strategies to apply LFO to leg and upper-body motions. Leg and upper-body motions of dancing robots have two different purposes. The leg motions stably support robot bodies, while upper body motions express dancing patterns. Thus, we needed different strategies for designing task models of leg and upper body, and for generating leg and upper body actions separately. We then concatenated and adjusted those generated motions of leg and upper-body motions in the final stage. This paper is organized as follows: Section II explains related works. In Section III, we briefly explain our LFO paradigms and overview system structures. Section IV and Section V explain how to generate leg and upper body motions, respectively. Section VI describes the adjustment of whole motions using separately generated upper body and leg motions. These techniques successfully generate the dance demonstration by a humanoid robot explained in Section VII. II. R ELATED W ORK In the robotics field, many researchers have developed methods to adapt human motion to a humanoid robot. Riley et al. [3] produced dancing motion in a humanoid robot by converting human motion data obtained by a motion-capture system into joint trajectories of the robot. For the same purpose, Pollard et al. [4] proposed a method for constraining given joint trajectories within mechanical limitations of the joints. However, these methods are insufficient for satisfying dynamic body balance because they basically focus on the constraints of each individual joint. In fact, the pelvis of their robot was fixed in space. For biped humanoid robots, Tamiya et al. [5] proposed a method that enables a robot to follow given motion trajectories while keeping body balance. This method can only deal with the motions in which a robot is standing on one leg. Kagami et al. [6] extended the method so that it allows changes in the motion of supporting legs. Yamane and Nakamura [7] proposed a dynamics filter, which converts a physically inconsistent motion into a consistent motion for a given body. Their dynamics filter lets a given motion follow the equations of motions, but the output motion may not keep the features

of the original motion. In these conventional methods, the adaptation basically consists of a single process that modifies given motion trajectories. In contrast to these approaches, our framework consists of two processes. First, the recognition is processed, and then the motion for a humanoid robot is reconstructed. Here, the problem of adapting motions is replaced with the problem of generating motions. With regard to dance performance, Kuroki et al. [8] enabled an actual biped humanoid to stably perform dance motions that include dynamic-style steps. However, this achievement is not the same as our goal, because the motions of their robot were manually created by using their motion editing tool [9]. The adaptations of human motions have also been studied actively in the computer animation community. Many researchers have developed methods to either edit motion capture data [10], [11], [12], [13], to seamlessly blend [14], [15] or connect data [16], [17], or to modify data according to kinematic constraints [4], [18]. Basically, it is not necessary to consider dynamic constraints, such as balance, in computer animation. However, some researchers consider such constraints in order to create more realistic animation [19], [20], [21], [22]. In these studies, motion sequence is synthesized based only on features of motion. In contrast to these approaches, we focus not only on motion aspects, but also on several environmental and perceptual aspects, such as dance music and a contact state between the foot and the floor, and we extract the meaning of motion as task models based on these aspects to generate a robot’s motion. Other researchers have also considered these aspects for synthesized character animation. Peters et al. [23] proposed a method of human animation synthesis based on visual attention. Sakuma et al. [24] considered a psychological model for human crowd simulation in which neighboring computer graphics characters imposed mental stress on each other. Stone et al. [25] proposed a method to synthesize speech performances in which input sound signals are considered. Kim et al. [26] proposed a rhythmic motion synthesis method using the results of motion rhythm analysis. Alankus et al. [27] and Lee et al. [28] also proposed a method to synthesize dance motion by considering the rhythm of input music. However,

these are essentially similar to a method that considers only features of motion because they use the perception factor only for assisting their motion-based method. They do not use the features of motion for extracting its meaning, which is important in synthesizing expressive motion, as we do in our framework. III. OVERVIEW A. LFO paradigm The LFO enables a robot to acquire the knowledge of what to do and how to do from observation of a human demonstration. As shown in Figure 1, the LFO generates robot actions through the following three steps: 1) The human demonstrates the actions to be imitated by a robot placed in front of the human. 2) The robot recognizes the demonstrated actions based on predefined abstract task models pre-defined, and constructs a series of instantiated task models. 3) The robot converts those instantiated task models into robot physical actions. Here, the abstract models are pre-designed using the knowledge of action domains, such as assembly actions or folk dance actions, under the top-down approach. In general, performing the same action does not require mimicking the entire action performed. It is difficult, if not impossible, to repeat the same trajectories to be mimicked, because each person has different dimensions in parts of his or her body. Instead, for this purpose, characteristics, or important features of the actions are extracted, and then, only such characteristics or important features are performed. That is, each action consists of essential and nonessential parts to be mimicked. The LFO introduces abstract task models to represent those essential parts. The merits of utilizing those abstract task models are that they enable the performance of the same actions using robots of different dimensions. Since the abstract task models are common among different robots, we only need to prepare mapping routines for each abstract task model to each individual robot. Each abstract task model consists of what to do and skill parameters to explain how to do the task. From an input image, one abstract task model is chosen as the one representing the current action. After this task recognition, from input data, skill parameters that characterize the action are obtained. B. Flowchart of Motion Generation Figure 2 shows the overview of the system. First, the performance of a dancer is recorded using a motion-capture system. Next, the captured data is converted to abstract task models based on the LFO paradigm. Finally, the robot’s motion, that is, joint angle trajectory, is reconstructed from the task models. Our method handles upper body and leg motions separately using different kinds of task models. Leg motion has to stably

support the whole body. We designed leg task models by considering foot-to-floor contact conditions. Each leg task model also has a template trajectory, to be modified from observation, for stable locomotion, along with the skill parameters required to complete the action. For leg motion, a continuous foot motion is segmented and recognized using pre-defined abstract task models, which use the skill parameters obtained from the motion-capture system. Using the obtained skill parameters, the pre-defined trajectory of a foot is modified, and inverse kinematics provides the entire joint angles of the robot’s foot. The upper body can move freely without considering any constraints for representing characteristic features of the folk dance. We have defined such characteristics, which we refer to as keyposes, from music beats and brief pauses of dancers in a motion sequence. Upper body motions of a robot follow these keyposes exactly and achieve a smooth transition from one keypose to the next. The transitions are generated smoothly by considering motor and joint limits using our filter. By adjusting leg and upper body motions, we can produce the robot’s entire motion. IV. G ENERATING L EG M OTION In this section, we describe a method to generate leg motion. Generating leg motion for a humanoid robot involves major challenges because of the physical differences between a robot and a human dancer. The center of mass (CM) of a humanoid robot is located at a higher position in the body than that of a human dancer. The area of foot support is smaller than that of a human dancer. And, finally, the foot-ground contact is less stable than a human’s because the feet of a humanoid robot are much harder and have fewer degrees of freedom than those of a human dancer. For these reasons, the physical constraints become very strict for keeping a robot’s balance, and so generating leg motion directly from captured human motion is not practical. However, we found that we can overcome these physical challenges of a humanoid robot using the LFO paradigm. First, we design leg task models with skill parameters and template trajectories that guarantee the stability of the robot. Then, from observed motion sequences of a human dancer, the system recognizes a leg task model, obtains skill parameters, and modifies the predetermined trajectory using the skill parameters. A. Leg Task Model We define four tasks shown in Figure 3 by considering a contact state between the feet of a humanoid robot and the floor, through the top-down analytic approach. Two-foot contact can be represented by both a STAND task and a SQUAT task. The difference between STAND and SQUAT is in its waist position. Here, the SQUAT task model involves a

Motion pause Keypose Skill transition

Upper body motion Joint angle

Whole body motion Joint angle

Task Skill parameter Task models Dancer

Joint angle

Dynamic filter Skill refinement

Leg motion

Marker position sequence

Robot

Fig. 2: Overview of motion generation

Two-foot contact SQUAT

One-foot contact R-STEP

L-STEP

Skill parameters (How to do)

Task (What to do)

STAND

Common param. t 0 Begining time, t f Finishing time Param. of the mid-point

t1 Time of the mid - point d1 Waist height destance

Param. of the swing foot

r f , R f Horiznotal position and orientation on ∑ s at t f Param. of the waist

ψ f Yaw orientation on ∑ s at t f

Param. of the mid-point (option)

t1 Time of the mid - piont r1 , R1 Position and orientation of the swing foot on ∑ s at t f

Fig. 3: Leg task model: the task model consists of a task and a skill parameter. The former explains what to do, and the latter explains how to do it.

vertical movement of the waist, but does not involve horizontal waist movement. This horizontal movement is reserved for determining the dynamic balance of the body in Section VIA. One-foot contact is represented by a STEP task; R-STEP represents one stepping motion by a right foot, and L-STEP represents that by a left foot. This task consists of a motion in which one foot, referred to as the swing foot, is lifted from the floor and lands again while the other foot, referred to as the support foot, maintains contact with the floor. Using STEP tasks, various leg motions including footfalls, side- or backstepping, and kicks can be expressed. Skill parameters describe task-specific timings and spatial characteristics of each task motion as shown in the bottom row in Figure 3. All the tasks have the beginning time t0 and the finishing time t f . These values enable tasks to be arranged in a time sequence, and are necessary for composing a choreography and making rhythm in a dance performance.

The geometric skill parameters are represented in a relative coordinate system with respect to one basis coordinate system fixed at one foot. This allows us to locally modify tasks in a task sequence for skill refinement described in Section VIC. No geometric skill parameters represent positions at the beginning point, because those values are inherited from the ending of a previous task. A STAND task model has only skill parameters to describe beginning, t0 , and ending, t f . A SQUAT task model also has the same timing parameters. The SQUAT task model also has skill parameters to characterize the mid-point, the lowest waist position with respect to the beginning position, d1 , and its timing, t1 . All the position parameters in a STEP task are expressed with respect to Σs , the coordinate system of a support foot. Here, the support foot is assumed to remain still on the floor during a STEP task. The z-axis of Σs is fixed at the upper direction of the global coordinate, with the positive direction of the zaxis corresponding along the upper direction. The template trajectory of the swing foot is represented with respect to this coordinate system, Σs . A STEP task model has a template trajectory, and its characterizing parameters, r f , and R f , the final configuration of the swing foot. This parameter, r f , does not include an element of the z-axis because we assume that the floor is flat in this study. A swing foot sometimes takes a characteristic pose during a step. This particular pose is represented by the mid-point of the trajectory of the swing foot. In this case, a time of midpoint, t1 , and position parameters r 1 at t1 are also used. In contrast to r f , parameter r 1 includes a z-axis element. The STEP task model also has skill parameters to describe the waist rotation of the robot. Parameter ψ f is the yaw angle (an orientation around the vertical axis) of the waist at t f . This parameter allows motions in which the waist orientation changes as the result of a step. B. Recognizing Leg Task A task sequence is recognized from marker trajectories obtained by a motion-capture system. Temporal segments are

Parameters obtained from observation Mid-point Speed

Length of stride Generated trajectory Start position

End position

Fig. 5: Skill parameters for STEP and generated foot trajectory

STEP

STEP

STEP

STEP

STEP

STEP

R-STEP

Time

Fig. 4: Detecting STEP tasks: A STEP task is detected from the speed of the swing foot. The center represents the graph of the swing foot speed, and the bottom represents the extracted task sequence. The figures at the top are the corresponding captured postures.

extracted and then recognized as tasks. From each segment, the values of the skill parameters corresponding to the task are obtained. In general, first STEP tasks are recognized from a motion sequence by considering foot-floor contact relations, and then the remaining segments are further classified into STAND or SQUAT, depending on the waist positions. A STEP task is recognized by analyzing a trajectory of a swing foot. Let p(t) be the position of a foot marker at time t. The speed of the foot marker is represented as v p (t) = | p˙ (t)|. In Figure 4, the middle row shows an example of v p (t), a foot marker speed. If a foot marker speed, v p (t), has continuously positive values within an interval, that segment is a candidate interval for a STEP task, as shown in the bottom row of Figure 4. Further, in order to avoid an erroneous small segment, we also add the constraint about the moving distance. A segment that satisfies the following conditions is recognized as one STEP task: v p (t) ≥ vstep (t0 ≤ t ≤ t f ) ,

 tf t0

v p (t)dt ≥ lstep ,

(1)

where vstep and lstep are threshold values in terms of velocity and moving distance respectively. The first condition stipulates that the speed of a swing foot is higher than a certain value, while the second condition guarantees that the swing foot moves more than a certain distance. These two conditions eliminate noisy slipping motion when the foot is a support foot. By applying these analyses to right and left feet, R-STEPs and L-STEPs are recognized. After R-STEPs and L-STEPs are recognized and their corresponding intervals are removed from an input sequence, either STAND or SQUAT tasks are recognized from the remaining sequences. SQUAT tasks are recognized by analyzing a vertical trajectory of the waist to extract a motion of lowering and rising again. This kind of waist motion is detected as a

segment from t0 to t f that satisfies   tf vh (t) < 0 (t0 ≤ t < t1 ) |vh (t)|dt ≥ lsquat , , vh (t) > 0 (t1 < t ≤ t f ) t0

(2)

where t1 corresponds to the timing of the lowest waist position and lsquat is a threshold for the vertical moving distance, which eliminates slight vertical motions that are not regarded as a SQUAT. C. Determining Skill Parameters for Generating Leg Motion Finally, we extract skill parameters for each task model, generate the foot trajectory from the obtained skill parameters, and calculate the joint angles of the entire leg from the foot motion by solving inverse kinematics equations. In each task, the timing parameters, t0 and t f are obtained from the beginning and ending time of the recognized segment. The values of the positional parameters, such as r f , and R f , are calculated by using positions of a couple of related markers at these timings. In a STAND task, we set all the joint angles of the leg by the posture of t0 . A STAND task has only two skill parameters, t0 and t f , which are the beginning and the ending time obtained from the motion sequence. In a SQUAT task, the mid-point time t1 is obtained in the task recognition stage. The vertical waist positions at t0 and t1 are extracted from the waist markers, and the difference of the waist height between these points in time is set to the skill parameter d1 . Sequences of joint angles of the leg along the time frame are calculated by solving inverse kinematics equations for the waist positions, so that the trajectory of the waist position satisfies these skill parameters. For a STEP task, the end point and the mid-point of the swing foot are extracted as the skill parameters. The foot trajectory can be calculated from these parameters as shown in Figure 5. In practice, the mid-point parameters are extracted only if it is necessary. The position and orientation of the feet are obtained from several markers attached to the legs of the dancer. Markers to be utilized depend on a model of body markers in a motioncapture system. In our system, the position of the feet is obtained as the center of the two markers attached to the

toe and heel, while the orientation of the feet is obtained from the markers attached to the toe, heel, and knee. The base coordinate system Σs is determined from the position and orientation of the support foot, with respect to the world coordinate system, at the timing, t1 ; the z-axis on Σs is aligned to the z-axis of the world coordinate system. The positional skill values, r 1 , r f , and R 1 , R f , are obtained by converting the positions and orientation at t1 and t f into those based on Σs . The orientation of the waist ψ f is also extracted from several markers attached to the waist in the same way. In a STEP task, whether the mid-point is valid or not must be determined. First, a model trajectory of the swing foot is generated by an interpolation from the beginning point to the finishing point. If the distance between the model trajectory and the actual trajectory is larger than a certain threshold, the mid-point is appended to express that trajectory. Here we define the interpolating function f n which passes n(≥ 2) points where time, a value, and a velocity are ti , y i , and y˙i respectively. One segment between the two adjacent points is expressed by a third polynomial equation. This function is expressed f n (t1 , y 1 , y˙ 1 ), . . . , (tn , y n , y˙ n ) (t).

(3)

When y˙i is omitted, y˙i is assumed to be 0 . By using the function, the interpolated trajectory is generated as p  (t) = f2 (t0 , p (t0 )), (t f , p (t f ))(t). The difference between the two trajectories is defined as d(t) = |pp (t) − p (t)|. The mid-point is determined to be valid for time t1 if d(t1 ) = max d(t), t0

Suggest Documents