Palco: A Multisensor Realtime 3D Cartoon Production System

Palco: A Multisensor Realtime 3D Cartoon Production System Luis Torr˜ao∗, Sandro F. Queir´os†, Pedro M. Teixeira‡, Jo˜ao L. Vilaca§, Nuno F. Rodrigues...
Author: Kelly Mosley
1 downloads 0 Views 5MB Size
Palco: A Multisensor Realtime 3D Cartoon Production System Luis Torr˜ao∗, Sandro F. Queir´os†, Pedro M. Teixeira‡, Jo˜ao L. Vilaca§, Nuno F. Rodrigues¶ Digital Games Research Center - Polytechnic Institute of Cavado and Ave, Barcelos, Portugal HASLab/INESC TEC - University of Minho, Braga, Portugal ICVS/3B’s - PT Government Associate Laboratory, Braga/Guimares, Portugal

Figure 1: Palco makes use of Microsoft Kinect or Asus/Primesense WAVI Xtion sensor for simultaneous capture of both body and facial expression, introduces a cartoonification transform applied to the actor captured movements to reproduce exaggerated animations in cartoon characters and combines these factors into a real-time production software.

Abstract

ing, animation, television, kinect, primesense

1 This paper presents Palco, a prototype system specifically designed for the production of 3D cartoon animations. The system addresses the specific problems of producing cartoon animations, where the main objective is not to reproduce realistic movements, but rather animate cartoon characters with predefined and characteristic body movements and facial expressions. The techniques employed in Palco are simple and easy to use, not requiring any invasive or complicated motion capture system, as both body motion and facial expression of actors are captured simultaneously, using an infrared motion detection sensor, a regular camera and a pair of electronically instrumented gloves. The animation process is completely actor-driven, with the actor controlling the character movements, gestures, facial expression and voice, all in realtime. The actor controlled cartoonification of the captured facial and body motion is a key functionality of Palco, and one that makes it specifically suited for the production of cartoon animations.

CR Categories: I.3.6 [Computer Graphics]: Methodology and Techniques—Interaction techniques; I.3.7 [Computer Graphics]: Three-Dimensional Graphics and Realism—Animation; I.3.8 [Computer Graphics]: Applications;

Keywords: cost-saving, time-saving, realtime, 3D cartoon render-

∗ e-mail:[email protected] † e-mail:[email protected] ‡ e-mail:[email protected] § e-mail:[email protected] ¶ e-mail:[email protected]

Introduction

The arising of realtime digital animation has had great impact in media and gaming industries, becoming an increasingly standard in the production of successful movies, games and television series. Answering to this demand, several software solutions have emerged that can integrate motion capture in the digital animation workflow, like MotionBuilder or Messiah Studio. These systems have proven to meet the mainstream strongly-budgeted movie and game production requirements. However, these solutions require the use of motion capture systems that are often expensive, intrusive and demand particular studio conditions and actor preparation. Furthermore, current systems demand significant manual effort to adapt the realistic captured movement to the predefined and characteristic body movement and facial expression of cartoon characters. The photo-realistic 3D realtime digital animation industry, as it is, has already demonstrated its great potential, even though there is still great demand for better and more credible results. On the other hand, the non-photo-realistic realtime cartoon independent productions still heavily rely in alternative solutions, and more surprisingly, there are no currently available solutions when it comes to 3D. The authors became quite aware of these limitations, when chal-

lenged for the production of a 3D version of the popular Portuguese daily show cartoon, Cafe Central, from Radio Televisao Portuguesa1 (RTP). The 2D production of Cafe Central is based on the “GO Realtime Animation” software solution [FlickerLab 2012], a system that delivers considerable cost and time savings due to its standardized and simplified 2D animation functionalities. In brief, this system is capable of cartoon animations from previously loaded (in a specific format) 2D characters, animate the mouth of characters according to the actors speech, and perform close and open angles. All animations are triggered by a gamepad, whose buttons are associated with particular character animations, previously defined. The production of an episode starts with the loading of each separate actor speech to the system, which in turn produces all mouth animations on the cartoon characters accordingly. Afterwards, an animation editor uses the gamepad to set all character animations (from a previously defined set of animations for each character), one at a time, and according to the specific situation and lines of speech of each cartoon character. For a 6 minutes episode, the entire production process takes about 1 day to complete. Although all animation is made of prefabricated smaller animations, the result is still captivating (it can be seen on RTP website ( [RTP 2012])) and the series has experienced considerable success, mainly because of the themes being treated – often based on events from the same day when the episode goes live.

Contributions. Our main contributions are the use of Microsoft Kinect or Asus/Primesense WAVI Xtion sensor for simultaneous capture of both body and facial expression, the introduction of a cartoonification transform applied to the actor captured movements to reproduce exaggerated animations in cartoon characters and the combination of these factors into a real-time production software.

The reminder of this paper is organised as follows. Section 2 describes the related work associated to this research; Section 3 exposes the process of creation of the software prototype; Sections 4 and 5 present and discuss the results and limitations; and Section 6 addresses future work and conclusions. Outline.

2

Related Work

Following the same route of achieving improved efficiency of cartoon production, [Morishima et al. 2007] have presented MoCaToon and AniFace, two well focused solutions for data distillation and lip-syncing. Our goal here is rather different, with Palco one seeks to reduce cartoon episode production time to almost the equivalent of real-time acting and we do not target high-stardard Anime, but rather daily-produced cartoon episodes. Some solutions based on Microsoft Kinect or Asus/Primesense WAVI Xtion realtime body tracking and character control/animation can be found, namely the iClone Mocap Device Plugin [Reallusion 2012], commercially, and the work of [Suma et al. 2011], in a more academic view. Both iClone and FAAST provide low-cost motion capture systems, based on a skeleton that lacks head orientation. In Palco, we give answer to this problem, as the captured skeleton is enriched with head rotation data. Furthermore, both previously mentioned systems elude simultaneous facial tracking, a key aspect of Palco.

Figure 2: Cafe Central, daily broadcasted by Radio Televisao Portuguesa (RTP) and 3D model of “Silva” character. The challenge put to our team was to upgrade the series to 3D, without an increase in production costs for each episode and to maintain the main benefits of the previous system, in particular, to keep a low production time (so that episodes could refer to recent events) and allow exaggeration (cartoonification) of character expressions and movements. Moreover, the producers also wanted to improve the final quality of episodes by having more animation options for each character, instead of using the limited set of animations being used in previous episodes. Since current animation systems alone do not provide an answer to the aforementioned requirements, we have embarked on the construction of Palco, a system designed for the rapid and cost-effective production of 3D cartoon series. Palco uses Microsoft Kinect or Asus/Primesense WAVI Xtion sensor for capturing both body motion and facial expression of actors, which are then used to animate cartoon characters. The animation process is completely actordriven, with the actor controlling the character movements, gestures, facial expression and voice, all in realtime. Actor movement capture is improved by the use of a pair of electronically instrumented gloves, capable of capturing not just all finger and wrist movements but also the hands relative position. The actor controlled cartoonification of the captured facial and body motion is a key functionality of Palco, and one that makes it specifically suited for the production of independent cartoon animations. 1A

Portuguese public television channel

In this matter, [Weise et al. 2011] explore the idiosyncrasies of using the Kinect sensor for facial tracking and achieve a notable realtime solution. Their work is based on the optimization of the low quality sensor’s depth data output, mapping it to realistic facial expressions, something that functions well at short distances but that will not satisfy our need for facial tracking that must work in a distance that includes both face and body in the same capture plan. To overcome this, we have opted to take advantage of the Microsoft Kinect or Asus/Primesense WAVI Xtion simultaneous capture of depth and color image data. [Bleiweiss et al. 2010] verify that in most games, the movements of the animated avatar are expected to be more expressive than the players’ actual movements. They propose a model for blending predefined animations set with Kinect’s skeleton tracking animation. We introduce dynamicity in that set in a sense that the process is actor-driven. Plus, it is applied to both body and facial animation. In the same line, [Ju and Lee 2008] and [Wang et al. 2006] propose, respectively, a cartoon animation filter that can take an arbitrary motion signal and make it more “animated”, and a solution for more expressive facial gestures from motion capture. Our work is close to these constructions, but important differences stand: Palco lets the 3D artist decide how the cartoonification results, as it relies in predesigned expressions (and not prerecorded patterns [2008]), that are then applied dynamically at the actor’s will, rather than in a generalized way [2006].

3

Methods

Palco system is comprehended of four main modules, a body motion capture module, a facial expression recognition module, a hands and fingers movement recognition system and a cartoonifi-

cation module. The first three modules are assembled in a unified tracker capable of sending all data via UDP to the Palco interface, which applies the cartoonification module to the input received.

allows the use of both Microsoft Kinect and or Asus/Primesense WAVI Xtion, and retrieves data from more than one device simultaneously. As it supports both the Microsoft Kinect SDK and OpenNI, this was chosen to be the most adequate interface. Given the chosen API, the new problem was to identify which SDK should best fit our tracking data needs. We have experimented with the Kinect SDK first but found that the current beta version still does not support skeleton joint orientation, a necessary feature for the character to rotate in the three-dimensional space. For this reason we chose to use OpenNI for retrieving joint position and orientation. As OpenNI data is associated with confidence values, this has made our tracker more reliable too, allowing to define the threshold for joint data acceptance by Palco.

3.3

Facial Expressions Tracking

Figure 3: Palco diagram overview. Palco interface was built over the Unity3D game engine, which integrates reasonably well with Blender, the tool chosen for 3D content creation. The following sections introduce each Palco module, presenting the main problems being addressed and the details about the technical solutions developed to overcome them.

3.1

3D Modelling

For the purpose of this research a specific character and scenario was created, in a work-flow that had its own drawing methodology and had to comply with certain rules in order to function harmoniously with Palco. Even though the character has a fine tune modeling, the artistry of character design and modeling had to submit to technology and accept some limitations that the system compels. The main idea of Palco is that characters should act like cartoon animation, without complex skeleton controls, realistic movements or human face recognition. The goal is not to achieve realistic results, but instead communicative expression and believability emotion on acting and posture in 3D environment. 3.1.1

Modeling Specificities for Palco

The 3D model was created with polygonal modeling technique and a limited number of polygons. The subsurface modeling that allows a more smooth surface has to be natively applied by the modelling tool. Character model should be kept simple with a cartoonish look. Like classical animation, lip-sync and facial expressions are controlled by shape keys, and not from bone controllers. This allows classical animators to achieve interesting results. Combination of several expressions, like happiness, sadness and rage, using shape keys for eyebrows, chicks, mouth, and chin animations are useful for better Cartoonification results. Finally, character skeleton should match as much as possible the kinect skeleton references so that body acting can be more accurate.

3.2

Body Movement Tracking

The initial problem was to find an adequate solution for capturing body tracking data by using wide-spread 3D sensor technology. Plus, it was our goal to keep Palco open to further developments namely the simultaneous use of more than one sensor, so that tracking quality could be improved in the future by means of using more tracking sensors. Choosing the right application programming interface (API) was a fundamental decision. There are several solutions available but the one that seemed to fit our projects needs was 2RealKinectWrapper, currently being developed by CADET. We also wanted to preserve our project’s openness to different devices, software development kits (SDK) and drivers. Again, this wrapper

The second problem was to obtain facial tracking data, in a way that it could be easily combined with the motion tracking technique employed, and did not demand for complicated calibration procedures. The first choice was to follow the work of [Weise et al. 2011] to implement the facial module of our tracker, but achieved sensor’s data quality, for distances that could include both face and body tracking in the same capture plan, was far to low. One also considered the possibility of using separate sensors for facial and body tracking, but such an option would compromise the solution simplicity one is seeking to achieve: to use a single sensor for both body and face tracking. The encountered solution was to take advantage of the different sensors Microsoft Kinect and and Asus/Primesense WAVI Xtion provided, by using the depth sensor for body capture and the RGB camera for optical facial orientation and expression tracking. Another crucial decision was to choose the right API for this task. With the objective of keeping the project open-source, we opted for the [Saragih et al. 2011] tracker, as used by Kyle McDonald in his OfxFaceTracker experience, that one adapted for Microsoft Kinect and Asus/Primesense WAVI Xtion. The adaptation compromised the matching of a deformable mask over the sensor color image, retrieving the head rotation and the 3D position of the mask’s facial points.

3.4

Hand Expression with Gloves

In order to track the hand and fingers movements, we conceived a glove using a set of sensors appropriately combined. Since it is fundamental to detect the hand spatial orientation, a nine degrees of freedom (DOFs) Inertial Measurement Unit (IMU) was used. The IMU integrates (a) a tri-axis accelerometer and tri-axis gyroscope (LSM330DL), which allows the acceleration and angular velocity measurements in any direction in space, and (b) a magnetometer (HMC5883L), in order to correct the gyroscope and accelerometer data, serving primarily as a reference for yaw measurement. An Attitude and Heading Reference System (AHRS) was developed, combining RF (radio frequency) communication. By incorporating the STM32F103TBU7 microcontroller, operating at a 72MHz frequency, on-board implementation of a Direction Cosine Matrix algorithm (DCM), allowed the fusion of accelerometer, gyroscope and magnetometer outputs and the acquisition of accurate information about spatial orientation. Even though each finger has at least 3 DOFs, finger move partially together, thus a good approximation can be achieved with less DOFs. Exploiting this fact, one has used 10 flex sensors (FSL0095103ST), two for each finger, to assess the fingers position. The flex sensor allows the determination of the fingers opening angle through the conversion of a resistance variable to a voltage out-

put, which in turn can be read by our microcontroller and interpreted.After processing the data correction methods, the resulting values are transmitted to the computer by RF in a single frame, which encapsulates a set of commands specifically designed to easily allow data transmission to the computer but also configuration setups of the board. Furthermore, an error detection method based on the checksum principle was used, which allows the confirmation of the authenticity of the frame received. Our solution demands for a configuration process before each virtual session. This system calibration is intended for the flex sensors, requiring two single steps, a first one with the hand opened and a second one with the hand completely closed, both providing a relationship between the flex sensors output and the fingers movements.

3.5

tation is applied to the corresponding model skeleton bones. The root bone has also rotation (quaternion) applied directly from the assembled tracker reading, and its position determines the overall position of the entire model. For the face, expression vectors are calculated by subtracting expression characteristic spheres’ position. The vectors then have to be calibrated, by measuring their rest position magnitude. This magnitude will be used as a switch for the facial expressions activation: whenever the captured vector exceeds this value, the surplus value is converted to a scale from 0% to 100%, and then this percentage is applied to the models’ pre-designed facial expressions ( keyframes). In this conversion, a multiplication by a delta value occurs, that will exaggerate the keyframe animation.

Assembled Tracker

Figure 5: Data visualized in Palco Figure 4: The assembled tracker. Body, facial and hand tracking modules where then assembled in a sole program generally called “the tracker”. This piece of software is based on OpenFrameworks (C++) and its core functionality is to send all body, facial and hand data via UDP for Palco to process. First, it combines the skeleton head position with the face tracker, cropping the mask matching area to a small rectangle around that skeleton joint, optimizing the tracking time. Then it positions and scales the facial data to that joint. The process is similar with gloves: the wrist joint is used to calculate, position and scale the coordinates of the finger joints and wrist rotation. In order to process all data flawlessly, the assembled tracker was built in a three-parted multi-threading strategy, keeping the different modules’ processes independent from each other.

3.6

Palco maps the data given by the tracker to the character’s body, hands and face, using mainly vector and quaternion mathematics. The motion result is interesting by itself, but not satisfactory in terms of cartoon expression.

Game Engine and Cartoonification Figure 6: Animation scene visualized in Palco

Another problem to tackle in the construction of Palco concerns the application of the tracker’s data to concrete previously modelled 3D cartoons, in a way that their movements and expressions seemed natural – from a cartoon perspective, of course. To this extent, Unity3D was chosen to implement Palco’s interface and cartoonification module, mainly because of its interesting graphics quality, collision detection and other game engine features – in particular the ones available for realtime animation. The mapping between the tracking data to the character’s 3D model was implemented as follows. First, all position data is scaled and applied to floating game objects (spheres). For the face and hands, the head and wrist rotation are applied to this body parts’ corresponding game objects, and then the assembled tracker captured positions are applied to them too. All game objects positions and rotations are previously smoothed with time-based interpolation, in order to achieve a stable, though real-time animated, basic structure to be mapped to the 3D model Secondly, the structure is mapped to the model. For the body, vectors are calculated by subtracting the sphere positions and their ro-

For that reason, a dual logic of cartoonification was applied. By one side, a set of exaggerated animations were created. These animations are interpolated with the tracking data in real time. Second, Unity’s Mega-Fiers plugin was used to create cartoon-like effects that are dynamically triggered by certain actions, like jumping, shrinking and stretching the character vertically, or brows frown, spherifying the characters head and switching on the pre-designed exaggerated facial expressions.

4

Discussion

The system’s response time is very close to real-time, as shown in the companion video. During trial tests, the most relevant factor contributing to the increase of delay time was the amount of interpolation/smoothing values. As a result, a good correlation between real and virtual movements was observed and a wide range of movements are possible with precise and accurate tracking. As an advantage, both the Microsoft Kinect and Asus/Primesense WAVI

Xtion, and the developed gloves are a low-cost solution to the body, facial and hand problem acquisition. In what concerns to errors in body tracking, as long as the previously mentioned Z axis constrains are respected, the system responds well. However, the rotation of the character might not work properly if a full twist is performed by the actor with great speed, something that is related both to OpenNI and to the smoothing/interpolation value chosen.

5

Limitations

Like any other infrared or optical motion detection system, PALCO is unable to track movement without a line of sight to the actor – or specific actor body parts – being tracked. This aspect is particularly problematic when capturing scenes involving close and enduring physical contact between the actors, such as close contact fight. In order for the 3D sensor to successfully capture the entire body there is a minimal distance to the target of around 3 meters. The facial capture will operate at this distance plus about half a meter. This gives the actor some movement constraints in the Z and axis, despite not compromising its performance, since Palco’s desired usability in daily episodes does not demand great movement in that axis. Trials also revealed that the facial tracker module works best with a frontal light pointing out to the target, and different light conditions could compromise its effectiveness.

6

Conclusion and Future Work

We have presented Palco, a system specifically designed to fill a gap in the available systems for rapid production of independent 3D cartoons, an area gaining significant importance and gradually establishing its space in television series. The methods employed in the construction of Palco essentially resume to the composition of available isolated software and hardware solutions (with minor modifications), together with an integration module, responsible not just for combining all components functionality but also to transform the collected data in order to produce 3D cartoon animations with a fair quality. Since Palco was destined, from the beginning, to provide an answer to the rapid production of independent 3D cartoons, factors like simplicity, ease of use, production cost and time, were put at a higher level of importance than output graphics quality, realism and movements accuracy. This of course, leaves plenty of space to improve Palco in such areas. However, from the experience achieved during trial tests, future work in this project will essentially focus on the improvement of the automatic facial calibration features of the system, better match levels in facial recognition and the expansion of the available types of cartoon effects. Currently, Palco has achieved its beta version, making it capable of producing small scenes of 3D cartoons, and even though it still involves considerable manual effort to produce television quality results, it is already a very important aid in such a process. This paper presents some of the main steps towards the accomplishment of Palco final goal, i.e., the complete production of 3D cartoon animations by artists with little or no knowledge about the computer graphics animation. Furthermore, we believe that future applications of Palco go far beyond the production of 3D cartoon animations, and the system will probably be suitable for realtime animation of cartoonish avatars in virtual worlds, or to provide useful feedback in actor training.

References B LEIWEISS , A., E SHAR , D., K UTLIROFF , G., L ERNER , A., O SHRAT, Y., AND YANAI , Y. 2010. Enhanced interactive gaming by blending full-body tracking and gesture animation. In ACM SIGGRAPH ASIA 2010 Sketches, ACM, New York, NY, USA, SA ’10, 34:1–34:2. F LICKER L AB, 2012. Go real time animation, January. http://realtime.flickerlab.com/real-time-animation.html. J U , E., AND L EE , J. 2008. Expressive facial gestures from motion capture data. COMPUTER GRAPHICS FORUM 27, 2 (APR), 381–388. 29th Annual Conference of the European-Associationfor-Computer-Graphics, Crete, GREECE, APR 14-18, 2008. M ORISHIMA , S., K URIYAMA , S., K AWAMOTO , S., S UZUKI , T., TAIRA , M., YOTSUKURA , T., AND NAKAMURA , S. 2007. Data-driven efficient production of cartoon character animation. In ACM SIGGRAPH 2007 sketches, ACM, New York, NY, USA, SIGGRAPH ’07. R EALLUSION, 2012. iclone mocap device plugin, January. http://www.reallusion.com/iclone/. RTP, 2012. Cafe central, http://www.rtp.pt/blogs/programas/cafecentral/.

January.

S ARAGIH , J., L UCEY, S., AND C OHN , J. 2011. Deformable model fitting by regularized landmark mean-shift. International Journal of Computer Vision 91, 200–215. 10.1007/s11263-010-03804. S UMA , E. A., L ANGE , B., R IZZO , A. S., K RUM , D. M., AND B OLAS , M. 2011. FAAST: The Flexible Action and Articulated Skeleton Toolkit. In 2011 IEEE VIRTUAL REALITY CONFERENCE (VR), IEEE, 345 E 47TH ST, NEW YORK, NY 10017 USA, Hirose, M and Lok, B and Majumder, A and Schmalstieg, D, Ed., Proceedings of the IEEE Virtual Reality Annual International Symposium, IEEE; IEEE Visualizat & Graph Tech Comm (VGTC); IEEE Comp Soc, 247–248. IEEE Virtual Reality Conference (VR), Singapore, SINGAPORE, MAR 19-23, 2011. WANG , J., D RUCKER , S. M., AGRAWALA , M., AND C OHEN , M. F. 2006. The cartoon animation filter. ACM Trans. Graph. 25 (July), 1169–1173. W EISE , T., B OUAZIZ , S., L I , H., AND PAULY, M. 2011. Realtime performance-based facial animation. ACM Trans. Graph. 30 (Aug.), 77:1–77:10.