Skip to content

I am planning to train Arabic ASR. What should be the best way? #8106

Answered by titu1994
mesut92 asked this question in Q&A
Discussion options

You must be logged in to vote

A response from Riva Team member

Use UTF-8. There are already UTF-8 characters in the n-gram, vocab, and lexicon files even for english :)

I recommend you consider training against a normalized unicode to reduce the facts the network needs to learn. You can think of normalization as a tool to ensure the same glyph always gets the same encoding. In particular NFC will merge the diacritic with the character (for a short more direct encoding). Such as عً which is one codepoint (several bytes) in NFC, and two codepoints in NFD (the diacritic and the base character separate).

print(ascii(unicodedata.normalize('NFC', '\u0061\u0301')))

There are plenty of examples of language models trained dir…

Replies: 2 comments 1 reply

Comment options

You must be logged in to vote
1 reply
@mesut92
Comment options

Comment options

You must be logged in to vote
0 replies
Answer selected by titu1994
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants