Custom Models
Contents
Custom Models#
Model Deployment#
Like all Riva models, Riva TTS requires the following steps:
Create
.riva
files for each model from a.nemo
file as outlined in the NeMo section.Create
.rmir
files for each Riva Speech AI Skill (for example, ASR, NLP, and TTS) usingriva-build
.Create model directories using
riva_deploy
.Deploy the model directory using
riva_server
.
The following sections provide examples for steps 1 and 2 as outlined above. For steps 3 and 4, refer to Using riva-deploy and Riva Speech Container (Advanced).
Creating Riva Files#
Riva files can be created from .nemo
files. As mentioned before in the NeMo
section, the generation of Riva files from .nemo
files must be done on a Linux x86_64
workstation only.
The following is an example of how a
HiFi-GAN model can be converted to a .riva
file from a .nemo
file.
Download the
.nemo
file from NGC onto the host system.Run the NeMo container and share the
.nemo
file with the container including the-v
option.
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_hifigan/versions/1.0.0rc1/zip -O tts_hifigan_1.0.0rc1.zip
unzip tts_hifigan_1.0.0rc1.zip
docker run --gpus all -it --rm \
-v $(pwd):/NeMo \
--shm-size=8g \
-p 8888:8888 \
-p 6006:6006 \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--device=/dev/snd \
nvcr.io/nvidia/nemo:22.08
After the container has launched, use
nemo2riva
to convert.nemo
to.riva
.
pip3 install nvidia-pyindex
ngc registry resource download-version "nvidia/riva/riva_quickstart:2.17.0"
pip3 install "riva_quickstart_v2.17.0/nemo2riva-2.17.0-py3-none-any.whl"
nemo2riva --key encryption_key --out /NeMo/hifigan.riva /NeMo/tts_hifigan.nemo
Repeat this process for each .nemo
model to generate .riva
files. It is suggested that
you do so for FastPitch before continuing to the next step. Ensure that you are getting the latest
tts_hifigan.nemo
checkpoint, latest nvcr.io/nvidia/nemo
container version, and latest
nemo2riva-2.17.0_beta-py3-none-any.whl
version when performing the above step:
Customization#
After creating the .riva
file and prior to running riva-build
, there are a few customization
options that can be adjusted. These are optional, however, if you are interested, the instructions
for building the default Riva pipeline, skip ahead to Riva-build Pipeline Instructions.
Custom Pronunciations#
Speech synthesis models deployed in Riva are configured with a language-specific pronunciation
dictionary mapping a large vocabulary of words from their written form, graphemes, to a sequence
of perceptually distinct sounds, phonemes. In cases where pronunciation is ambiguous, for example
with heteronyms like bass
(the fish) and bass
(the musical instrument), the dictionary is
ignored and the synthesis model uses context clues from the sentence to predict an appropriate
pronunciation.
Modern speech synthesis algorithms are surprisingly capable of accurately predicting pronunciations of new and novel words. Sometimes, however, it is desirable or necessary to provide extra context to the model.
While custom pronunciations can be supplied at request time using SSML, request-time overrides are best suited for one-off adjustments. For domain-specific terms with fixed pronunciations, configure Riva with these pronunciations when deploying the server.
There are two key parameters that can be configured through riva-build
or in the
preprocessor configuration that affects the phoneme path:
--phone_dictionary_file
path to the pronunciation dictionary. To start with, leave this parameter empty. If the.riva
file was created from a.nemo
model that contained an dictionary artifact, and this argument is not set, Riva will use the NeMo dictionary file that the model was trained with. To add custom entries and modify pronunciation, modify the NeMo dictionary artifact, save it to another file, and pass that file-path toriva-build
with this argument.--preprocessor.g2p_ignore_ambiguous
IfTrue
, words that have more than one phonetic representation in the pronunciation dictionary such as “read” are not converted to phonemes. Defaults toTrue
.--upper_case_chars
should be set toTrue
ifipa
is used. This affects grapheme inputs as theipa
phone set includes lower-cased English characters.--phone_set
can be used to specify whether the model was trained witharpabet
oripa
. If this flag is not used, Riva attempts to auto-detect the correct phone set.
Note
--arpabet_file
is deprecated as of Riva 2.8.0 and replaced by --phone_dictionary_file
.
Note
Riva supports both arpabet
and ipa
depending on what the acoustic model was trained on.
For more information, refer to the ARPABET wikipedia page. For more information
on IPA, refer to the TTS Phoneme Support page.
To determine the appropriate phoneme sequence, use the SSML API to experiment with phone sequences and evaluate the quality. Once the mapping sounds correct, add the discovered mapping to a new line in the dictionary.
Multi-Speaker Models#
Riva supports models with multiple speakers.
To enable this feature, specify the following parameters before building the model.
--voice_name
is the name of the model. Defaults toEnglish-US.Female-1
.--subvoices
is a comma-separated list of names for each subvoice, with the length equal to the number of subvoices as specified in the FastPitch model. For example, for a model with a “male” subvoice in the 0th speaker embedding and “female” subvoice in the first embedding, include the option--subvoices=Male:0,Female:1
. If not provided, the desired embedding can be requested by integer index.
The voice name and subvoices are maintained in the generated .rmir
file, and caried into the generated Triton
repositories. During inference, modify the voice name of the request by appending voice_name
with a
period followed by a valid subvoice. For example, <voice_name>.<subvoice>
.
Custom Voice#
Riva is voice agnostic and can be run with any English-US TTS voice. In order to train a custom voice model, data must first be collected. We recommend at least 30 minutes of high-quality data. For collecting the data, refer to the Riva custom voice recoder. After the data has been collected, the FastPitch and HiFi-GAN models need to be fine-tuned on this dataset. Refer to the Riva fine-tuning tutorial for how to train these models. A Riva pipeline using these models can be built according to the instructions on this page.
Custom Text Normalization#
Riva supports custom text normalization rules built from NeMo’s WFST text normalization (TN) tool.
For details on customizing TN, refer to the NeMo WFST tutorial.
After the WFST has been customized, use NeMo to deploy it using its export_grammar
script. Refer to the
documentation for more information.
This produces two files: tokenize_and_classify.far
and verbalize.far
. These are passed to the
riva-build
step using the --wfst_tokenizer_model
and --wfst_verbalizer_model
arguments.
Riva-build Pipeline Instructions#
FastPitch and HiFi-GAN#
Deploy a FastPitch and HiFi-GAN TTS pipeline as follows from within the ServiceMaker container:
riva-build speech_synthesis \
/servicemaker-dev/<rmir_filename>:<encryption_key> \
/servicemaker-dev/<fastpitch_riva_filename>:<encryption_key> \
/servicemaker-dev/<hifigan_riva_filename>:<encryption_key> \
--voice_name=<pipeline_name> \
--abbreviations_file=/servicemaker-dev/<abbr_file> \
--arpabet_file=/servicemaker-dev/<dictionary_file> \
--wfst_tokenizer_model=/servicemaker-dev/<tokenizer_far_file> \
--wfst_verbalizer_model=/servicemaker-dev/<verbalizer_far_file> \
--sample_rate=<sample_rate> \
--subvoices=<subvoices> \
Where:
<rmir_filename>
is the Rivarmir
file that is generated<encryption_key>
is the key used to encrypt the files. The encryption key for the pre-trained Riva models uploaded on NGC istlt_encode
, unless specified under a specific model in the list of pretrained quick start pipelines.pipeline_name
is an optional user-defined name for the components in the model repository<fastpitch_riva_filename>
is the name of theriva
file for FastPitch<hifigan_riva_filename>
is the name of theriva
file for HiFi-GAN<abbr_file>
is the name of the file containing abbreviations and their corresponding expansions<dictionary_file>
is the name of the file containing the pronunciation dictionary mapping from words to their phonetic representation in ARPABET<voice_name>
is the name of the model<subvoices>
is a comma-separated list of names for each subvoice. Defaults to naming by integer index. This is needed and only used for multi-speaker models.<wfst_tokenizer_model>
is the location of thetokenize_and_classify.far
file that is generated from running the NeMo’s Text Processing’sexport_grammar.sh
script<wfst_verbalizer_model>
is the location of theverbalize.far
file that is generated from running the NeMo’s Text Processing’sexport_grammar.sh
script<sample_rate>
is the sample rate of audio that the models were trained on
Upon successful completion of this command, a file named <rmir_filename>
is created in the
/servicemaker-dev/
folder. If your .riva
archives are encrypted, you need to include
:<encryption_key>
at the end of the RMIR and riva
filenames, otherwise this is
unnecessary.
For embedded platforms, using a batch size of 1 is recommended since it achieves the lowest memory footprint. To use a batch size of 1, refer to the Riva-build Optional Parameters section and set the various min_batch_size
, max_batch_size
, and opt_batch_size
parameters to 1 while executing the riva-build
command.
Pretrained Quick Start Pipelines#
Pipeline |
|
---|---|
FastPitch + HiFi-GAN IPA (en-US Multi-Speaker) |
riva-build speech_synthesis \
<rmir_filename>:<key> \
<riva_fastpitch_file>:<key> \
<riva_hifigan_file>:<key> \
--language_code=en-US \
--num_speakers=12 \
--phone_set=ipa \
--phone_dictionary_file=<txt_phone_dictionary_file> \
--sample_rate 44100 \
--voice_name English-US \
--subvoices Female-1:0,Male-1:1,Female-Neutral:2,Male-Neutral:3,Female-Angry:4,Male-Angry:5,Female-Calm:6,Male-Calm:7,Female-Fearful:10,Female-Happy:12,Male-Happy:13,Female-Sad:14 \
--wfst_tokenizer_model=<far_tokenizer_file> \
--wfst_verbalizer_model=<far_verbalizer_file> \
--upper_case_chars=True \
--preprocessor.enable_emphasis_tag=True \
--preprocessor.start_of_emphasis_token='[' \
--preprocessor.end_of_emphasis_token=']' \
--abbreviations_file=<txt_abbreviations_file>
|
FastPitch + HiFi-GAN IPA (zh-CN Multi-Speaker) |
riva-build speech_synthesis \
<rmir_filename>:<key> \
<riva_fastpitch_file>:<key> \
<riva_hifigan_file>:<key> \
--language_code=zh-CN \
--num_speakers=10 \
--phone_set=ipa \
--phone_dictionary_file=<txt_phone_dictionary_file> \
--sample_rate 44100 \
--voice_name Mandarin-CN \
--subvoices Female-1:0,Male-1:1,Female-Neutral:2,Male-Neutral:3,Male-Angry:5,Female-Calm:6,Male-Calm:7,Male-Fearful:11,Male-Happy:13,Male-Sad:15 \
--wfst_tokenizer_model=<far_tokenizer_file> \
--wfst_verbalizer_model=<far_verbalizer_file> \
--preprocessor.enable_emphasis_tag=True \
--preprocessor.start_of_emphasis_token='[' \
--preprocessor.end_of_emphasis_token=']'
|
FastPitch + HiFi-GAN IPA (es-ES Female) |
riva-build speech_synthesis \
<rmir_filename>:<key> \
<riva_fastpitch_file>:BSzv7YAjcH4nJS \
<riva_hifigan_file>:BSzv7YAjcH4nJS \
--language_code=es-ES \
--phone_dictionary_file=<dict_file> \
--sample_rate 22050 \
--voice_name Spanish-ES-Female-1 \
--phone_set=ipa \
--wfst_tokenizer_model=<far_tokenizer_file> \
--wfst_verbalizer_model=<far_verbalizer_file> \
--abbreviations_file=<txt_file>
|
FastPitch + HiFi-GAN IPA (es-ES Male) |
riva-build speech_synthesis \
<rmir_filename>:<key> \
<riva_fastpitch_file>:PPihyG3Moru5in \
<riva_hifigan_file>:PPihyG3Moru5in \
--language_code=es-ES \
--phone_dictionary_file=<dict_file> \
--sample_rate 22050 \
--voice_name Spanish-ES-Male-1 \
--phone_set=ipa \
--wfst_tokenizer_model=<far_tokenizer_file> \
--wfst_verbalizer_model=<far_verbalizer_file> \
--abbreviations_file=<txt_file>
|
FastPitch + HiFi-GAN IPA (es-US Multi-Speaker) |
riva-build speech_synthesis \
<rmir_filename>:<key> \
<riva_fastpitch_file>:<key> \
<riva_hifigan_file>:<key> \
--language_code=es-US \
--num_speakers=12 \
--phone_set=ipa \
--phone_dictionary_file=<txt_phone_dictionary_file> \
--sample_rate 44100 \
--voice_name Spanish-US \
--subvoices Female-1:0,Male-1:1,Female-Neutral:2,Male-Neutral:3,Female-Angry:4,Male-Angry:5,Female-Calm:6,Male-Calm:7,Male-Fearful:11,Male-Happy:13,Female-Sad:14,Male-Sad:15 \
--wfst_tokenizer_model=<far_tokenizer_file> \
--wfst_verbalizer_model=<far_verbalizer_file> \
--preprocessor.enable_emphasis_tag=True \
--preprocessor.start_of_emphasis_token='[' \
--preprocessor.end_of_emphasis_token=']'
|
FastPitch + HiFi-GAN IPA (it-IT Female) |
riva-build speech_synthesis \
<rmir_filename>:<key> \
<riva_fastpitch_file>:R62srgxeXBgVxg \
<riva_hifigan_file>:R62srgxeXBgVxg \
--language_code=it-IT \
--phone_dictionary_file=<dict_file> \
--sample_rate 22050 \
--voice_name Italian-IT-Female-1 \
--phone_set=ipa \
--wfst_tokenizer_model=<far_tokenizer_file> \
--wfst_verbalizer_model=<far_verbalizer_file> \
--abbreviations_file=<txt_file>
|
FastPitch + HiFi-GAN IPA (it-IT Male) |
riva-build speech_synthesis \
<rmir_filename>:<key> \
<riva_fastpitch_file>:dVRvg47ZqCdQrR \
<riva_hifigan_file>:dVRvg47ZqCdQrR \
--language_code=it-IT \
--phone_dictionary_file=<dict_file> \
--sample_rate 22050 \
--voice_name Italian-IT-Male-1 \
--phone_set=ipa \
--wfst_tokenizer_model=<far_tokenizer_file> \
--wfst_verbalizer_model=<far_verbalizer_file> \
--abbreviations_file=<txt_file>
|
FastPitch + HiFi-GAN IPA (de-DE Male) |
riva-build speech_synthesis \
<rmir_filename>:<key> \
<riva_fastpitch_file>:ZzZjce65zzGZ9o \
<riva_hifigan_file>:ZzZjce65zzGZ9o \
--language_code=de-DE \
--phone_dictionary_file=<dict_file> \
--sample_rate 22050 \
--voice_name German-DE-Male-1 \
--phone_set=ipa \
--wfst_tokenizer_model=<far_tokenizer_file> \
--wfst_verbalizer_model=<far_verbalizer_file> \
--abbreviations_file=<txt_file>
|
RadTTS + HiFi-GAN IPA |
riva-build speech_synthesis \
<rmir_filename>:<key> \
<riva_radtts_file>:<key> \
<riva_hifigan_file>:<key> \
--num_speakers=12 \
--phone_dictionary_file=<txt_phone_dictionary_file> \
--sample_rate 44100 \
--voice_name English-US-RadTTS \
--subvoices Female-1:0,Male-1:1,Female-Neutral:2,Male-Neutral:3,Female-Angry:4,Male-Angry:5,Female-Calm:6,Male-Calm:7,Female-Fearful:10,Female-Happy:12,Male-Happy:13,Female-Sad:14 \
--phone_set=ipa \
--upper_case_chars=True \
--wfst_tokenizer_model=<far_tokenizer_file> \
--wfst_verbalizer_model=<far_verbalizer_file> \
--preprocessor.enable_emphasis_tag=True \
--preprocessor.start_of_emphasis_token='[' \
--preprocessor.end_of_emphasis_token=']' \
--abbreviations_file=<txt_abbreviations_file>
|
FastPitch + HiFi-GAN ARPABET |
riva-build speech_synthesis \
<rmir_filename>:<key> \
<riva_fastpitch_file>:<key> \
<riva_hifigan_file>:<key> \
--arpabet_file=cmudict-0.7b_nv22.08 \
--sample_rate 44100 \
--voice_name English-US \
--subvoices Male-1:0,Female-1:1 \
--wfst_tokenizer_model=<far_tokenizer_file> \
--wfst_verbalizer_model=<far_verbalizer_file> \
--preprocessor.enable_emphasis_tag=True \
--preprocessor.start_of_emphasis_token='[' \
--preprocessor.end_of_emphasis_token=']' \
--abbreviations_file=<txt_file>
|
FastPitch + HiFi-GAN LJSpeech |
riva-build speech_synthesis \
<rmir_filename>:<key> \
<riva_fastpitch_file>:<key> \
<riva_hifigan_file>:<key> \
--arpabet_file=..cmudict-0.7b_nv22.08 \
--voice_name ljspeech \
--wfst_tokenizer_model=<far_tokenizer_file> \
--wfst_verbalizer_model=<far_verbalizer_file> \
--abbreviations_file=<txt_file>
|
All text normalization .far
files are in NGC on the Riva TTS English Normalization Grammar page. All other auxiliary files that are not .riva
files (such as pronunciation dictionaries) are in NGC on the Riva TTS English US Auxiliary Files page.
Riva-build Optional Parameters#
For details about the parameters passed to riva-build
to customize the TTS pipeline, issue:
riva-build speech_synthesis -h
The following list includes descriptions for all optional parameters currently recognized by riva-build
:
usage: riva-build speech_synthesis [-h] [-f] [-v]
[--language_code LANGUAGE_CODE]
[--instance_group_count INSTANCE_GROUP_COUNT]
[--kind KIND]
[--max_batch_size MAX_BATCH_SIZE]
[--max_queue_delay_microseconds MAX_QUEUE_DELAY_MICROSECONDS]
[--batching_type BATCHING_TYPE]
[--voice_name VOICE_NAME]
[--num_speakers NUM_SPEAKERS]
[--subvoices SUBVOICES]
[--sample_rate SAMPLE_RATE]
[--chunk_length CHUNK_LENGTH]
[--overlap_length OVERLAP_LENGTH]
[--num_mels NUM_MELS]
[--num_samples_per_frame NUM_SAMPLES_PER_FRAME]
[--abbreviations_file ABBREVIATIONS_FILE]
[--has_mapping_file HAS_MAPPING_FILE]
[--mapping_file MAPPING_FILE]
[--wfst_tokenizer_model WFST_TOKENIZER_MODEL]
[--wfst_verbalizer_model WFST_VERBALIZER_MODEL]
[--arpabet_file ARPABET_FILE]
[--phone_dictionary_file PHONE_DICTIONARY_FILE]
[--phone_set PHONE_SET]
[--upper_case_chars UPPER_CASE_CHARS]
[--upper_case_g2p UPPER_CASE_G2P]
[--mel_basis_file_path MEL_BASIS_FILE_PATH]
[--voice_map_file VOICE_MAP_FILE]
[--postprocessor.max_sequence_idle_microseconds POSTPROCESSOR.MAX_SEQUENCE_IDLE_MICROSECONDS]
[--postprocessor.max_batch_size POSTPROCESSOR.MAX_BATCH_SIZE]
[--postprocessor.min_batch_size POSTPROCESSOR.MIN_BATCH_SIZE]
[--postprocessor.opt_batch_size POSTPROCESSOR.OPT_BATCH_SIZE]
[--postprocessor.preferred_batch_size POSTPROCESSOR.PREFERRED_BATCH_SIZE]
[--postprocessor.batching_type POSTPROCESSOR.BATCHING_TYPE]
[--postprocessor.preserve_ordering POSTPROCESSOR.PRESERVE_ORDERING]
[--postprocessor.instance_group_count POSTPROCESSOR.INSTANCE_GROUP_COUNT]
[--postprocessor.max_queue_delay_microseconds POSTPROCESSOR.MAX_QUEUE_DELAY_MICROSECONDS]
[--postprocessor.optimization_graph_level POSTPROCESSOR.OPTIMIZATION_GRAPH_LEVEL]
[--postprocessor.fade_length POSTPROCESSOR.FADE_LENGTH]
[--preprocessor.max_sequence_idle_microseconds PREPROCESSOR.MAX_SEQUENCE_IDLE_MICROSECONDS]
[--preprocessor.max_batch_size PREPROCESSOR.MAX_BATCH_SIZE]
[--preprocessor.min_batch_size PREPROCESSOR.MIN_BATCH_SIZE]
[--preprocessor.opt_batch_size PREPROCESSOR.OPT_BATCH_SIZE]
[--preprocessor.preferred_batch_size PREPROCESSOR.PREFERRED_BATCH_SIZE]
[--preprocessor.batching_type PREPROCESSOR.BATCHING_TYPE]
[--preprocessor.preserve_ordering PREPROCESSOR.PRESERVE_ORDERING]
[--preprocessor.instance_group_count PREPROCESSOR.INSTANCE_GROUP_COUNT]
[--preprocessor.max_queue_delay_microseconds PREPROCESSOR.MAX_QUEUE_DELAY_MICROSECONDS]
[--preprocessor.optimization_graph_level PREPROCESSOR.OPTIMIZATION_GRAPH_LEVEL]
[--preprocessor.mapping_path PREPROCESSOR.MAPPING_PATH]
[--preprocessor.g2p_ignore_ambiguous PREPROCESSOR.G2P_IGNORE_AMBIGUOUS]
[--preprocessor.language PREPROCESSOR.LANGUAGE]
[--preprocessor.max_sequence_length PREPROCESSOR.MAX_SEQUENCE_LENGTH]
[--preprocessor.max_input_length PREPROCESSOR.MAX_INPUT_LENGTH]
[--preprocessor.mapping PREPROCESSOR.MAPPING]
[--preprocessor.tolower PREPROCESSOR.TOLOWER]
[--preprocessor.pad_with_space PREPROCESSOR.PAD_WITH_SPACE]
[--preprocessor.enable_emphasis_tag PREPROCESSOR.ENABLE_EMPHASIS_TAG]
[--preprocessor.start_of_emphasis_token PREPROCESSOR.START_OF_EMPHASIS_TOKEN]
[--preprocessor.end_of_emphasis_token PREPROCESSOR.END_OF_EMPHASIS_TOKEN]
[--encoderFastPitch.max_sequence_idle_microseconds ENCODERFASTPITCH.MAX_SEQUENCE_IDLE_MICROSECONDS]
[--encoderFastPitch.max_batch_size ENCODERFASTPITCH.MAX_BATCH_SIZE]
[--encoderFastPitch.min_batch_size ENCODERFASTPITCH.MIN_BATCH_SIZE]
[--encoderFastPitch.opt_batch_size ENCODERFASTPITCH.OPT_BATCH_SIZE]
[--encoderFastPitch.preferred_batch_size ENCODERFASTPITCH.PREFERRED_BATCH_SIZE]
[--encoderFastPitch.batching_type ENCODERFASTPITCH.BATCHING_TYPE]
[--encoderFastPitch.preserve_ordering ENCODERFASTPITCH.PRESERVE_ORDERING]
[--encoderFastPitch.instance_group_count ENCODERFASTPITCH.INSTANCE_GROUP_COUNT]
[--encoderFastPitch.max_queue_delay_microseconds ENCODERFASTPITCH.MAX_QUEUE_DELAY_MICROSECONDS]
[--encoderFastPitch.optimization_graph_level ENCODERFASTPITCH.OPTIMIZATION_GRAPH_LEVEL]
[--encoderFastPitch.trt_max_workspace_size ENCODERFASTPITCH.TRT_MAX_WORKSPACE_SIZE]
[--encoderFastPitch.use_onnx_runtime]
[--encoderFastPitch.use_torchscript]
[--encoderFastPitch.use_trt_fp32]
[--encoderFastPitch.fp16_needs_obey_precision_pass]
[--encoderRadTTS.max_sequence_idle_microseconds ENCODERRADTTS.MAX_SEQUENCE_IDLE_MICROSECONDS]
[--encoderRadTTS.max_batch_size ENCODERRADTTS.MAX_BATCH_SIZE]
[--encoderRadTTS.min_batch_size ENCODERRADTTS.MIN_BATCH_SIZE]
[--encoderRadTTS.opt_batch_size ENCODERRADTTS.OPT_BATCH_SIZE]
[--encoderRadTTS.preferred_batch_size ENCODERRADTTS.PREFERRED_BATCH_SIZE]
[--encoderRadTTS.batching_type ENCODERRADTTS.BATCHING_TYPE]
[--encoderRadTTS.preserve_ordering ENCODERRADTTS.PRESERVE_ORDERING]
[--encoderRadTTS.instance_group_count ENCODERRADTTS.INSTANCE_GROUP_COUNT]
[--encoderRadTTS.max_queue_delay_microseconds ENCODERRADTTS.MAX_QUEUE_DELAY_MICROSECONDS]
[--encoderRadTTS.optimization_graph_level ENCODERRADTTS.OPTIMIZATION_GRAPH_LEVEL]
[--encoderRadTTS.trt_max_workspace_size ENCODERRADTTS.TRT_MAX_WORKSPACE_SIZE]
[--encoderRadTTS.use_onnx_runtime]
[--encoderRadTTS.use_torchscript]
[--encoderRadTTS.use_trt_fp32]
[--encoderRadTTS.fp16_needs_obey_precision_pass]
[--encoderPflow.max_sequence_idle_microseconds ENCODERPFLOW.MAX_SEQUENCE_IDLE_MICROSECONDS]
[--encoderPflow.max_batch_size ENCODERPFLOW.MAX_BATCH_SIZE]
[--encoderPflow.min_batch_size ENCODERPFLOW.MIN_BATCH_SIZE]
[--encoderPflow.opt_batch_size ENCODERPFLOW.OPT_BATCH_SIZE]
[--encoderPflow.preferred_batch_size ENCODERPFLOW.PREFERRED_BATCH_SIZE]
[--encoderPflow.batching_type ENCODERPFLOW.BATCHING_TYPE]
[--encoderPflow.preserve_ordering ENCODERPFLOW.PRESERVE_ORDERING]
[--encoderPflow.instance_group_count ENCODERPFLOW.INSTANCE_GROUP_COUNT]
[--encoderPflow.max_queue_delay_microseconds ENCODERPFLOW.MAX_QUEUE_DELAY_MICROSECONDS]
[--encoderPflow.optimization_graph_level ENCODERPFLOW.OPTIMIZATION_GRAPH_LEVEL]
[--encoderPflow.trt_max_workspace_size ENCODERPFLOW.TRT_MAX_WORKSPACE_SIZE]
[--encoderPflow.use_onnx_runtime]
[--encoderPflow.use_torchscript]
[--encoderPflow.use_trt_fp32]
[--encoderPflow.fp16_needs_obey_precision_pass]
[--chunkerFastPitch.max_sequence_idle_microseconds CHUNKERFASTPITCH.MAX_SEQUENCE_IDLE_MICROSECONDS]
[--chunkerFastPitch.max_batch_size CHUNKERFASTPITCH.MAX_BATCH_SIZE]
[--chunkerFastPitch.min_batch_size CHUNKERFASTPITCH.MIN_BATCH_SIZE]
[--chunkerFastPitch.opt_batch_size CHUNKERFASTPITCH.OPT_BATCH_SIZE]
[--chunkerFastPitch.preferred_batch_size CHUNKERFASTPITCH.PREFERRED_BATCH_SIZE]
[--chunkerFastPitch.batching_type CHUNKERFASTPITCH.BATCHING_TYPE]
[--chunkerFastPitch.preserve_ordering CHUNKERFASTPITCH.PRESERVE_ORDERING]
[--chunkerFastPitch.instance_group_count CHUNKERFASTPITCH.INSTANCE_GROUP_COUNT]
[--chunkerFastPitch.max_queue_delay_microseconds CHUNKERFASTPITCH.MAX_QUEUE_DELAY_MICROSECONDS]
[--chunkerFastPitch.optimization_graph_level CHUNKERFASTPITCH.OPTIMIZATION_GRAPH_LEVEL]
[--hifigan.max_sequence_idle_microseconds HIFIGAN.MAX_SEQUENCE_IDLE_MICROSECONDS]
[--hifigan.max_batch_size HIFIGAN.MAX_BATCH_SIZE]
[--hifigan.min_batch_size HIFIGAN.MIN_BATCH_SIZE]
[--hifigan.opt_batch_size HIFIGAN.OPT_BATCH_SIZE]
[--hifigan.preferred_batch_size HIFIGAN.PREFERRED_BATCH_SIZE]
[--hifigan.batching_type HIFIGAN.BATCHING_TYPE]
[--hifigan.preserve_ordering HIFIGAN.PRESERVE_ORDERING]
[--hifigan.instance_group_count HIFIGAN.INSTANCE_GROUP_COUNT]
[--hifigan.max_queue_delay_microseconds HIFIGAN.MAX_QUEUE_DELAY_MICROSECONDS]
[--hifigan.optimization_graph_level HIFIGAN.OPTIMIZATION_GRAPH_LEVEL]
[--hifigan.trt_max_workspace_size HIFIGAN.TRT_MAX_WORKSPACE_SIZE]
[--hifigan.use_onnx_runtime]
[--hifigan.use_torchscript]
[--hifigan.use_trt_fp32]
[--hifigan.fp16_needs_obey_precision_pass]
[--neuralg2p.max_sequence_idle_microseconds NEURALG2P.MAX_SEQUENCE_IDLE_MICROSECONDS]
[--neuralg2p.max_batch_size NEURALG2P.MAX_BATCH_SIZE]
[--neuralg2p.min_batch_size NEURALG2P.MIN_BATCH_SIZE]
[--neuralg2p.opt_batch_size NEURALG2P.OPT_BATCH_SIZE]
[--neuralg2p.preferred_batch_size NEURALG2P.PREFERRED_BATCH_SIZE]
[--neuralg2p.batching_type NEURALG2P.BATCHING_TYPE]
[--neuralg2p.preserve_ordering NEURALG2P.PRESERVE_ORDERING]
[--neuralg2p.instance_group_count NEURALG2P.INSTANCE_GROUP_COUNT]
[--neuralg2p.max_queue_delay_microseconds NEURALG2P.MAX_QUEUE_DELAY_MICROSECONDS]
[--neuralg2p.optimization_graph_level NEURALG2P.OPTIMIZATION_GRAPH_LEVEL]
[--neuralg2p.trt_max_workspace_size NEURALG2P.TRT_MAX_WORKSPACE_SIZE]
[--neuralg2p.use_onnx_runtime]
[--neuralg2p.use_torchscript]
[--neuralg2p.use_trt_fp32]
[--neuralg2p.fp16_needs_obey_precision_pass]
output_path source_path [source_path ...]
Generate a Riva Model from a speech_synthesis model trained with NVIDIA NeMo.
positional arguments:
output_path Location to write compiled Riva pipeline
source_path Source file(s)
options:
-h, --help show this help message and exit
-f, --force Overwrite existing artifacts if they exist
-v, --verbose Verbose log outputs
--language_code LANGUAGE_CODE
Language of the model
--instance_group_count INSTANCE_GROUP_COUNT
How many instances in a group
--kind KIND Backend runs on CPU or GPU
--max_batch_size MAX_BATCH_SIZE
Default maximum parallel requests in a single forward
pass
--max_queue_delay_microseconds MAX_QUEUE_DELAY_MICROSECONDS
Maximum amount of time to allow requests to queue to
form a batch in microseconds
--batching_type BATCHING_TYPE
--voice_name VOICE_NAME
Set the voice name for speech synthesis
--num_speakers NUM_SPEAKERS
Number of unqiue speakers.
--subvoices SUBVOICES
Comma-seprated list of subvoices (no whitespace).
--sample_rate SAMPLE_RATE
Sample rate of the output signal
--chunk_length CHUNK_LENGTH
Chunk length in mel frames to synthesize at one time
--overlap_length OVERLAP_LENGTH
Chunk length in mel frames to overlap neighboring
chunks
--num_mels NUM_MELS number of mels
--num_samples_per_frame NUM_SAMPLES_PER_FRAME
number of samples per frame
--abbreviations_file ABBREVIATIONS_FILE
Path to file with list of abbreviations and
corresponding expansions
--has_mapping_file HAS_MAPPING_FILE
--mapping_file MAPPING_FILE
Path to phoneme mapping file
--wfst_tokenizer_model WFST_TOKENIZER_MODEL
Sparrowhawk model to use for tokenization and
classification, must be in .far format
--wfst_verbalizer_model WFST_VERBALIZER_MODEL
Sparrowhawk model to use for verbalizer, must be in
.far format.
--arpabet_file ARPABET_FILE
Path to pronunciation dictionary (deprecated)
--phone_dictionary_file PHONE_DICTIONARY_FILE
Path to pronunciation dictionary
--phone_set PHONE_SET
Phonetic set that the model was trained on. An unset
value will attempt to auto-detect the phone set used
during training. Supports either "arpabet", "ipa",
"none".
--upper_case_chars UPPER_CASE_CHARS
Whether character representations for this model are
upper case or lower case.
--upper_case_g2p UPPER_CASE_G2P
Whether character representations for this model are
upper case or lower case.
--mel_basis_file_path MEL_BASIS_FILE_PATH
Pre calculated Mel basis file for Audio to Mel
--voice_map_file VOICE_MAP_FILE
Default voice name to filepath map
postprocessor:
--postprocessor.max_sequence_idle_microseconds POSTPROCESSOR.MAX_SEQUENCE_IDLE_MICROSECONDS
Global timeout, in ms
--postprocessor.max_batch_size POSTPROCESSOR.MAX_BATCH_SIZE
Default maximum parallel requests in a single forward
pass
--postprocessor.min_batch_size POSTPROCESSOR.MIN_BATCH_SIZE
--postprocessor.opt_batch_size POSTPROCESSOR.OPT_BATCH_SIZE
--postprocessor.preferred_batch_size POSTPROCESSOR.PREFERRED_BATCH_SIZE
Preferred batch size, must be smaller than Max batch
size
--postprocessor.batching_type POSTPROCESSOR.BATCHING_TYPE
--postprocessor.preserve_ordering POSTPROCESSOR.PRESERVE_ORDERING
Preserve ordering
--postprocessor.instance_group_count POSTPROCESSOR.INSTANCE_GROUP_COUNT
How many instances in a group
--postprocessor.max_queue_delay_microseconds POSTPROCESSOR.MAX_QUEUE_DELAY_MICROSECONDS
max queue delta in microseconds
--postprocessor.optimization_graph_level POSTPROCESSOR.OPTIMIZATION_GRAPH_LEVEL
The Graph optimization level to use in Triton model
configuration
--postprocessor.fade_length POSTPROCESSOR.FADE_LENGTH
Cross fade length in samples used in between audio
chunks
preprocessor:
--preprocessor.max_sequence_idle_microseconds PREPROCESSOR.MAX_SEQUENCE_IDLE_MICROSECONDS
Global timeout, in ms
--preprocessor.max_batch_size PREPROCESSOR.MAX_BATCH_SIZE
Use Batched Forward calls
--preprocessor.min_batch_size PREPROCESSOR.MIN_BATCH_SIZE
--preprocessor.opt_batch_size PREPROCESSOR.OPT_BATCH_SIZE
--preprocessor.preferred_batch_size PREPROCESSOR.PREFERRED_BATCH_SIZE
Preferred batch size, must be smaller than Max batch
size
--preprocessor.batching_type PREPROCESSOR.BATCHING_TYPE
--preprocessor.preserve_ordering PREPROCESSOR.PRESERVE_ORDERING
Preserve ordering
--preprocessor.instance_group_count PREPROCESSOR.INSTANCE_GROUP_COUNT
How many instances in a group
--preprocessor.max_queue_delay_microseconds PREPROCESSOR.MAX_QUEUE_DELAY_MICROSECONDS
max queue delta in microseconds
--preprocessor.optimization_graph_level PREPROCESSOR.OPTIMIZATION_GRAPH_LEVEL
The Graph optimization level to use in Triton model
configuration
--preprocessor.mapping_path PREPROCESSOR.MAPPING_PATH
--preprocessor.g2p_ignore_ambiguous PREPROCESSOR.G2P_IGNORE_AMBIGUOUS
--preprocessor.language PREPROCESSOR.LANGUAGE
--preprocessor.max_sequence_length PREPROCESSOR.MAX_SEQUENCE_LENGTH
maximum length of every emitted sequence
--preprocessor.max_input_length PREPROCESSOR.MAX_INPUT_LENGTH
maximum length of input string
--preprocessor.mapping PREPROCESSOR.MAPPING
--preprocessor.tolower PREPROCESSOR.TOLOWER
--preprocessor.pad_with_space PREPROCESSOR.PAD_WITH_SPACE
--preprocessor.enable_emphasis_tag PREPROCESSOR.ENABLE_EMPHASIS_TAG
Boolean flag that controls if the emphasis tag should
be parsed or not during pre-processing
--preprocessor.start_of_emphasis_token PREPROCESSOR.START_OF_EMPHASIS_TOKEN
field to indicate start of emphasis in the given text
--preprocessor.end_of_emphasis_token PREPROCESSOR.END_OF_EMPHASIS_TOKEN
field to indicate end of emphasis in the given text
encoderFastPitch:
--encoderFastPitch.max_sequence_idle_microseconds ENCODERFASTPITCH.MAX_SEQUENCE_IDLE_MICROSECONDS
Global timeout, in ms
--encoderFastPitch.max_batch_size ENCODERFASTPITCH.MAX_BATCH_SIZE
Default maximum parallel requests in a single forward
pass
--encoderFastPitch.min_batch_size ENCODERFASTPITCH.MIN_BATCH_SIZE
--encoderFastPitch.opt_batch_size ENCODERFASTPITCH.OPT_BATCH_SIZE
--encoderFastPitch.preferred_batch_size ENCODERFASTPITCH.PREFERRED_BATCH_SIZE
Preferred batch size, must be smaller than Max batch
size
--encoderFastPitch.batching_type ENCODERFASTPITCH.BATCHING_TYPE
--encoderFastPitch.preserve_ordering ENCODERFASTPITCH.PRESERVE_ORDERING
Preserve ordering
--encoderFastPitch.instance_group_count ENCODERFASTPITCH.INSTANCE_GROUP_COUNT
How many instances in a group
--encoderFastPitch.max_queue_delay_microseconds ENCODERFASTPITCH.MAX_QUEUE_DELAY_MICROSECONDS
Maximum amount of time to allow requests to queue to
form a batch in microseconds
--encoderFastPitch.optimization_graph_level ENCODERFASTPITCH.OPTIMIZATION_GRAPH_LEVEL
The Graph optimization level to use in Triton model
configuration
--encoderFastPitch.trt_max_workspace_size ENCODERFASTPITCH.TRT_MAX_WORKSPACE_SIZE
Maximum workspace size (in Mb) to use for model export
to TensorRT
--encoderFastPitch.use_onnx_runtime
Use ONNX runtime instead of TensorRT
--encoderFastPitch.use_torchscript
Use TorchScript instead of TensorRT
--encoderFastPitch.use_trt_fp32
Use TensorRT engine with FP32 instead of FP16
--encoderFastPitch.fp16_needs_obey_precision_pass
Flag to explicitly mark layers as float when parsing
the ONNX network
encoderRadTTS:
--encoderRadTTS.max_sequence_idle_microseconds ENCODERRADTTS.MAX_SEQUENCE_IDLE_MICROSECONDS
Global timeout, in ms
--encoderRadTTS.max_batch_size ENCODERRADTTS.MAX_BATCH_SIZE
Default maximum parallel requests in a single forward
pass
--encoderRadTTS.min_batch_size ENCODERRADTTS.MIN_BATCH_SIZE
--encoderRadTTS.opt_batch_size ENCODERRADTTS.OPT_BATCH_SIZE
--encoderRadTTS.preferred_batch_size ENCODERRADTTS.PREFERRED_BATCH_SIZE
Preferred batch size, must be smaller than Max batch
size
--encoderRadTTS.batching_type ENCODERRADTTS.BATCHING_TYPE
--encoderRadTTS.preserve_ordering ENCODERRADTTS.PRESERVE_ORDERING
Preserve ordering
--encoderRadTTS.instance_group_count ENCODERRADTTS.INSTANCE_GROUP_COUNT
How many instances in a group
--encoderRadTTS.max_queue_delay_microseconds ENCODERRADTTS.MAX_QUEUE_DELAY_MICROSECONDS
Maximum amount of time to allow requests to queue to
form a batch in microseconds
--encoderRadTTS.optimization_graph_level ENCODERRADTTS.OPTIMIZATION_GRAPH_LEVEL
The Graph optimization level to use in Triton model
configuration
--encoderRadTTS.trt_max_workspace_size ENCODERRADTTS.TRT_MAX_WORKSPACE_SIZE
Maximum workspace size (in Mb) to use for model export
to TensorRT
--encoderRadTTS.use_onnx_runtime
Use ONNX runtime instead of TensorRT
--encoderRadTTS.use_torchscript
Use TorchScript instead of TensorRT
--encoderRadTTS.use_trt_fp32
Use TensorRT engine with FP32 instead of FP16
--encoderRadTTS.fp16_needs_obey_precision_pass
Flag to explicitly mark layers as float when parsing
the ONNX network
encoderPflow:
--encoderPflow.max_sequence_idle_microseconds ENCODERPFLOW.MAX_SEQUENCE_IDLE_MICROSECONDS
Global timeout, in ms
--encoderPflow.max_batch_size ENCODERPFLOW.MAX_BATCH_SIZE
Default maximum parallel requests in a single forward
pass
--encoderPflow.min_batch_size ENCODERPFLOW.MIN_BATCH_SIZE
--encoderPflow.opt_batch_size ENCODERPFLOW.OPT_BATCH_SIZE
--encoderPflow.preferred_batch_size ENCODERPFLOW.PREFERRED_BATCH_SIZE
Preferred batch size, must be smaller than Max batch
size
--encoderPflow.batching_type ENCODERPFLOW.BATCHING_TYPE
--encoderPflow.preserve_ordering ENCODERPFLOW.PRESERVE_ORDERING
Preserve ordering
--encoderPflow.instance_group_count ENCODERPFLOW.INSTANCE_GROUP_COUNT
How many instances in a group
--encoderPflow.max_queue_delay_microseconds ENCODERPFLOW.MAX_QUEUE_DELAY_MICROSECONDS
Maximum amount of time to allow requests to queue to
form a batch in microseconds
--encoderPflow.optimization_graph_level ENCODERPFLOW.OPTIMIZATION_GRAPH_LEVEL
The Graph optimization level to use in Triton model
configuration
--encoderPflow.trt_max_workspace_size ENCODERPFLOW.TRT_MAX_WORKSPACE_SIZE
Maximum workspace size (in Mb) to use for model export
to TensorRT
--encoderPflow.use_onnx_runtime
Use ONNX runtime instead of TensorRT
--encoderPflow.use_torchscript
Use TorchScript instead of TensorRT
--encoderPflow.use_trt_fp32
Use TensorRT engine with FP32 instead of FP16
--encoderPflow.fp16_needs_obey_precision_pass
Flag to explicitly mark layers as float when parsing
the ONNX network
chunkerFastPitch:
--chunkerFastPitch.max_sequence_idle_microseconds CHUNKERFASTPITCH.MAX_SEQUENCE_IDLE_MICROSECONDS
Global timeout, in ms
--chunkerFastPitch.max_batch_size CHUNKERFASTPITCH.MAX_BATCH_SIZE
Default maximum parallel requests in a single forward
pass
--chunkerFastPitch.min_batch_size CHUNKERFASTPITCH.MIN_BATCH_SIZE
--chunkerFastPitch.opt_batch_size CHUNKERFASTPITCH.OPT_BATCH_SIZE
--chunkerFastPitch.preferred_batch_size CHUNKERFASTPITCH.PREFERRED_BATCH_SIZE
Preferred batch size, must be smaller than Max batch
size
--chunkerFastPitch.batching_type CHUNKERFASTPITCH.BATCHING_TYPE
--chunkerFastPitch.preserve_ordering CHUNKERFASTPITCH.PRESERVE_ORDERING
Preserve ordering
--chunkerFastPitch.instance_group_count CHUNKERFASTPITCH.INSTANCE_GROUP_COUNT
How many instances in a group
--chunkerFastPitch.max_queue_delay_microseconds CHUNKERFASTPITCH.MAX_QUEUE_DELAY_MICROSECONDS
Maximum amount of time to allow requests to queue to
form a batch in microseconds
--chunkerFastPitch.optimization_graph_level CHUNKERFASTPITCH.OPTIMIZATION_GRAPH_LEVEL
The Graph optimization level to use in Triton model
configuration
hifigan:
--hifigan.max_sequence_idle_microseconds HIFIGAN.MAX_SEQUENCE_IDLE_MICROSECONDS
Global timeout, in ms
--hifigan.max_batch_size HIFIGAN.MAX_BATCH_SIZE
Default maximum parallel requests in a single forward
pass
--hifigan.min_batch_size HIFIGAN.MIN_BATCH_SIZE
--hifigan.opt_batch_size HIFIGAN.OPT_BATCH_SIZE
--hifigan.preferred_batch_size HIFIGAN.PREFERRED_BATCH_SIZE
Preferred batch size, must be smaller than Max batch
size
--hifigan.batching_type HIFIGAN.BATCHING_TYPE
--hifigan.preserve_ordering HIFIGAN.PRESERVE_ORDERING
Preserve ordering
--hifigan.instance_group_count HIFIGAN.INSTANCE_GROUP_COUNT
How many instances in a group
--hifigan.max_queue_delay_microseconds HIFIGAN.MAX_QUEUE_DELAY_MICROSECONDS
Maximum amount of time to allow requests to queue to
form a batch in microseconds
--hifigan.optimization_graph_level HIFIGAN.OPTIMIZATION_GRAPH_LEVEL
The Graph optimization level to use in Triton model
configuration
--hifigan.trt_max_workspace_size HIFIGAN.TRT_MAX_WORKSPACE_SIZE
Maximum workspace size (in Mb) to use for model export
to TensorRT
--hifigan.use_onnx_runtime
Use ONNX runtime instead of TensorRT
--hifigan.use_torchscript
Use TorchScript instead of TensorRT
--hifigan.use_trt_fp32
Use TensorRT engine with FP32 instead of FP16
--hifigan.fp16_needs_obey_precision_pass
Flag to explicitly mark layers as float when parsing
the ONNX network
neuralg2p:
--neuralg2p.max_sequence_idle_microseconds NEURALG2P.MAX_SEQUENCE_IDLE_MICROSECONDS
Global timeout, in ms
--neuralg2p.max_batch_size NEURALG2P.MAX_BATCH_SIZE
Default maximum parallel requests in a single forward
pass
--neuralg2p.min_batch_size NEURALG2P.MIN_BATCH_SIZE
--neuralg2p.opt_batch_size NEURALG2P.OPT_BATCH_SIZE
--neuralg2p.preferred_batch_size NEURALG2P.PREFERRED_BATCH_SIZE
Preferred batch size, must be smaller than Max batch
size
--neuralg2p.batching_type NEURALG2P.BATCHING_TYPE
--neuralg2p.preserve_ordering NEURALG2P.PRESERVE_ORDERING
Preserve ordering
--neuralg2p.instance_group_count NEURALG2P.INSTANCE_GROUP_COUNT
How many instances in a group
--neuralg2p.max_queue_delay_microseconds NEURALG2P.MAX_QUEUE_DELAY_MICROSECONDS
Maximum amount of time to allow requests to queue to
form a batch in microseconds
--neuralg2p.optimization_graph_level NEURALG2P.OPTIMIZATION_GRAPH_LEVEL
The Graph optimization level to use in Triton model
configuration
--neuralg2p.trt_max_workspace_size NEURALG2P.TRT_MAX_WORKSPACE_SIZE
Maximum workspace size (in Mb) to use for model export
to TensorRT
--neuralg2p.use_onnx_runtime
Use ONNX runtime instead of TensorRT
--neuralg2p.use_torchscript
Use TorchScript instead of TensorRT
--neuralg2p.use_trt_fp32
Use TensorRT engine with FP32 instead of FP16
--neuralg2p.fp16_needs_obey_precision_pass
Flag to explicitly mark layers as float when parsing
the ONNX network