Speech Recognition Setup #
MMDAgent-EX includes Julius as the default speech recognition engine. Julius is a lightweight, fast speech recognition engine that runs locally on CPU only.
Using speech recognition requires the following preparation and configuration. It will not work right after building. Be sure to perform the setup below for each content package.
Downloading Models #
Speech recognition models are provided for Japanese and English. They are not bundled in the repository, so download them separately from the link below. The download size is about 791 MB, and it uses about 1.7 GB of disk space when unpacked.
Download Recent: Julius_Models_20231015.zipList of older versions
- Julius_Models_20231015.zip - 2023.10.15
Unpack the archive and place the contents in the Release/AppData/Julius folder. For example:
Release/
└── AppData/
└── Julius/
├── phoneseq/
├── jconf_phone.txt
├── jconf_gmm_ja.txt
├── jconf_dnn_ja.txt
├── jconf_dnn_en.txt
├── dictation_kit_ja/
└── ENVR-v5.4.Dnn.Bin/
Setup #
You need to specify the models and language used by Julius in the .mdf file. To use English speech recognition, open the main.mdf file in the Example folder with a text editor and add the following two lines at the end:
Plugin_Julius_lang=en
Plugin_Julius_conf=dnn
- The first line specifies the language: choose
ja(Japanese) oren(English). - The second line specifies the configuration: for
jayou can choosednnordmm; forenonlydnnis available.
Preparing an Audio Input Device #
The speech recognition module opens the system default sound input device to perform recognition. Prepare an audio input device and set it as the default input device.
If no audio input device is available, an error will occur and the application will not start.
Test Run #
Start the Example content with the .mdf file configured as above.
After starting, if a circular meter like the one below appears in the lower-left corner of the screen, the engine has started successfully. The circle size represents the input volume.

Speak toward the audio input device. When recognition starts, the circular meter will look like this:

Recognition results are shown as captions on the screen.

How it Works #
The speech recognition module sends its internal activity as messages. When recognition starts, the module outputs the following message:
You can view these messages by enabling log output.
Recognition results are output as a message when recognition stops, for example:
If you want the result to be separated by words, you can specify the following in the .mdf:
Plugin_Julius_wordspacing=yes
no: Concatenate words without separators (default forja).yes: Insert spaces between words (default for non-ja).comma: Insert commas between words (compatible with older MMDAgent).
You can turn off caption display. To disable captions, add the following line to main.mdf:
show_caption=false
Using Other Engines #
Julius is a compact open-source speech recognition engine, but it was developed with older techniques; model quality and noise robustness – especially in noisy environments – can be inferior to modern speech recognition engines.
If you build a system using cloud STT engines like Google STT or local models like Whisper via Python, you can integrate them with MMDAgent-EX in two ways:
- Run them as a submodule of MMDAgent-EX using Plugin_AnyScript
- Integrate with a separate MMDAgent-EX process via the WebSocket feature