System Overview

System components of MMDAgent-EX and how it works as general.

About MMDAgent

We have developed an open toolkit for building various interactive speech systems for exploration of vast variety of aspects in SDSs and SIs. The software is named “MMDAgent”. It incorporates low-latency fast speech recognition, HMM-based flexible speech synthesis, embodied 3-D agent rendering with simulated physics, and dialog management based on a finite state transducer (FST). The system is designed carefully to offer real-time rich interaction, light-weight high frame rate rendering, flexible control of emotional expressions. Open formats are adopted for most files to cope with other software. The system has been made public to actually push forward the field of SDS involving related researches, developments of practical applications.

System architecture

System architecture

The figure above illustrates the main architecture of MMDAgent. All the modules run in different threads, sharing a global message queue. A module is driven by either I/O events (speech input, sensor signal, etc.) or command messages thrown into the queue (speech synthesis request, 3-D model handling, motion control, etc.). It can also outputs event messages to the queue (speech recognition trigger, end of speech synthesis, etc.). A dialog manager listens for the event messages and sends command messages according to a dialog scenario.

The messaging scheme is spoke-and-hub: all the modules share a single message queue and send messages to the queue. The sent messages are then broadcasted to all the connected modules. There are no routing function, each module should read all the messages and find a responsible message to be processed.

3-D agent rendering engine

The 3-D graphics rendering and control module for embodied agents has been built on OpenGL from scratch to be tailored for light-weight, rich expressions of voice interaction systems:

  • Toon rendering
  • Moving expression by bones, morphs, inverse kinematics, skinning, alpha texture, environmental mapping, depth shadowing,etc.
  • Physics simulation to express natural movements of objects.

MMDAgent is fully compatible with a 3-D software “MikuMikuDance” (a.k.a “MMD”). MikuMikuDance is a free, lightweight software that lets users to create 3D animated movies. The original MMDAgent supports only PMD, but MMDAgent-EX can render PMX models. All models and motions, including physics simulation feature, are playable in MMDAgent. It can also control a bone or a morph in a model directly from dialogue scenario or other plugin.

Speech recognition

For the speech recognition engine, a general-purpose, open-source large-vocabulary continuous speech recognition engine Julius is used. A full version of Julius is directly incorporated, thus you can use full feature of Julius in MMDAgent.

Speech synthesis

The Japanese TTS system Open JTalk was adopted as the TTS system. One can express detailed emotions in the synthesized speech by using multiple acoustic models and presetting their parameters (interpolation weights, speaking speed, volume, average F0, etc.). Furthermore, it is possible to simultaneously synthesize voices for multiple characters in different speaking styles by using multiple acoustic models.

The lip motion is generated internally in real time, based on the phoneme durations used at synthesis and phoneme-to-morph mapping rule, so complete synchronization of mouth movement and synthesized speech is guaranteed.

Dialogue / Interaction manager

The default dialogue management module of MMDAgent is a simple graph-based dialogue manager written in OpenFST format.

The Dialog / interaction FST is a low-level scenario description as descripbed below. It is a finite state transducer (FST), in which all the messages from modules are fed in turns, and resulting output will be thrown to the queue.

 0     10    RECOG_EVENT_STOP|hello   <eps>
10     11    <eps>                    MOTION_ADD|mei|greet|greet.vmd
11     12    <eps>                    SYNTH_START|mei|normal|hi
12      0    SYNTH_EVENT_STOP|mei     <eps>

As you see above, MMDAgent offers only a low level description As dialogue scenario. We are expecting users or other developers to make a scenario converter that converts a high-level dialogue scenario description into the FST format, or connect other text conversation manager via network socket.


MMDAgent-EX has several networking features in addition to original MMDAgent:

  • Socket connection that allows other processes to connect to the message queue.
  • Apache Kafka module to produce / consume messages to / from cloud Kafka server.
  • REST-based file uploading function to send log to a content server.
  • Access to a specified URL to invoke some Web API.


The MMDAgent-EX itself only contains 3-D rendering engine and message queue. Most of the modules, including speech recognition, speech synthesis, dialogue manager, are implemented as a plug-in DLL, just shares the message queue internally. On desktop OS, You can easily swap the ASR / TTS engines or add new function by building a plugin DLL for MMDAgent-EX or via socket connection. (This feature is not available on iOS/Android since they does not allow dynamic library to be invoked in user level for security.)

Last modified January 6, 2021: re-organized document (984d397)