I am planning to train Arabic ASR. What should be the best way? #8106

mesut92 · 2024-01-01T19:03:31Z

mesut92
Jan 1, 2024

Title explain my issue. What should i do?Can it learn with Arabic letters? Any suggestion or experience about it?

Answered by titu1994

Jan 2, 2024

A response from Riva Team member

Use UTF-8. There are already UTF-8 characters in the n-gram, vocab, and lexicon files even for english :)

I recommend you consider training against a normalized unicode to reduce the facts the network needs to learn. You can think of normalization as a tool to ensure the same glyph always gets the same encoding. In particular NFC will merge the diacritic with the character (for a short more direct encoding). Such as عً which is one codepoint (several bytes) in NFC, and two codepoints in NFD (the diacritic and the base character separate).

print(ascii(unicodedata.normalize('NFC', '\u0061\u0301')))

There are plenty of examples of language models trained dir…

View full answer

titu1994 · 2024-01-01T19:33:49Z

titu1994
Jan 1, 2024
Maintainer

You can finetune one of the sal or English models by following the fine-tuning tutorial. Let us know if you have issues

1 reply

mesut92 Jan 1, 2024
Author

I had finetuned Turkish models before. But Arabic is not latin alphabet. Does it work or give the actual performance, I just prepare manifests and finetune? Should i do buckwalter transform? Are there any experienced any people who trained Arabic?

titu1994 · 2024-01-02T21:07:25Z

titu1994
Jan 2, 2024
Maintainer

A response from Riva Team member

Use UTF-8. There are already UTF-8 characters in the n-gram, vocab, and lexicon files even for english :)

I recommend you consider training against a normalized unicode to reduce the facts the network needs to learn. You can think of normalization as a tool to ensure the same glyph always gets the same encoding. In particular NFC will merge the diacritic with the character (for a short more direct encoding). Such as عً which is one codepoint (several bytes) in NFC, and two codepoints in NFD (the diacritic and the base character separate).

print(ascii(unicodedata.normalize('NFC', '\u0061\u0301')))

There are plenty of examples of language models trained directly on even un-normalized UTF-8.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I am planning to train Arabic ASR. What should be the best way? #8106

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

I am planning to train Arabic ASR. What should be the best way? #8106

mesut92 Jan 1, 2024

Replies: 2 comments · 1 reply

titu1994 Jan 1, 2024 Maintainer

mesut92 Jan 1, 2024 Author

titu1994 Jan 2, 2024 Maintainer

mesut92
Jan 1, 2024

Replies: 2 comments 1 reply

titu1994
Jan 1, 2024
Maintainer

mesut92 Jan 1, 2024
Author

titu1994
Jan 2, 2024
Maintainer