Skip to content

Speech-to-text from videos and audios (including youtube and tiktok links)

License

Notifications You must be signed in to change notification settings

Martouta/speech_processor

Repository files navigation

Speech Processor Maintainability CircleCI Coverage Status

πŸ“— This project fetches videos and audios from the internet, then it tries to identify their texts through AI Speech-To-Text and it saves the results.

πŸ§‘β€πŸ’» This documentation used to be mostly addressed to non-developers. Now, it is addressed only to developers. Otherwise, it would get crazy how much I would have to explain.

⚠️ The main branch is used for development. It is not a stable branch for usage. Please, use a release instead. Preferably, the latest.

🐞 When there is an error for any item being processed (non-system-exiting exceptions), it just logs the information of the error/exception and carries on to the next item.

Table of Contents

Install dependencies

  1. Install Python version 3.11 or newer.
  2. Install the dependencies (external libraries necessary for this program to work). With this command in the Terminal: pip3 install --no-cache-dir -r requirements.txt --user

You can also use Docker. And the requirements for development/test are in requirements-dev.txt

Expected input

Origin of videos/audios

The input data can come from either a file or kafka . If the environment variable INPUT_FILE is provided, it assumes it comes from a file. Otherwise, it expects to come from Kafka. The project for now it is assuming that you pass this data correctly.

Fetch videos/audios from a file

The environment variable INPUT_FILE must be the path of the file in your HDD. The project for now it is assuming that the file is actually there and the program has reading permissions for this file.

Fetch videos/audios from Kafka

It uses the environment variables KAFKA_URL (if not provided, it defaults to localhost:9092) and it requires KAFKA_RESOURCE_TOPIC. It will read the kafka messages from that URI and that topic and partition/group_id number 1. If KAFKA is not running in the provided URI, it just sleeps/wait until it is available.

Additional input information through ENV vars

  • MAX_THREADS is how many items run at once. It is optional and by default it is 5.
  • SPEECH_ENV is the environment the project is running in. It is required. It must be either production, test or development . I'm assuming that it is being passed correctly.

Input Format

The project for now it is assuming that you pass this data correctly. Example of a JSON type with multiple items having all possible inputs:

[
  {
    "integration": "youtube",
    "id": "zWQJqt_D-vo",
    "language_code": "ar",
    "resource_id": 1,
    "recognizer": "google",
    "captions": true
  },
  {
    "integration": "youtube",
    "id": "CNHe4qXqsck",
    "language_code": "ar",
    "resource_id": 2
  },
  {
    "integration": "tiktok",
    "id": "7105531486224370946",
    "language_code": "en-au",
    "resource_id": 3
  },
  {
    "integration": "hosted",
    "url": "https://scontent-mad1-1.cdninstagram.com/v/t50.16885-16/10000000_4897336923689152_6953647669213471758_n.mp4?efg=eyJ2ZW5jb2RlX3RhZyI6InZ0c192b2RfdXJsZ2VuLjEyODAuaWd0di5iYXNlbGluZSIsInFlX2dyb3VwcyI6IltcImlnX3dlYl9kZWxpdmVyeV92dHNfb3RmXCJdIn0&_nc_ht=scontent-mad1-1.cdninstagram.com&_nc_cat=104&_nc_ohc=OfiUjon4e6AAX8fa1iX&edm=ALQROFkBAAAA&vs=504042498033080_1629363706&_nc_vs=HBksFQAYJEdJQ1dtQURBa0s0YkdtWVJBQTRrc1pDMlVZQmdidlZCQUFBRhUAAsgBABUAGCRHSS1IaXhDdlJKbUlTdHdLQUNYaDgzbUpqb1JWYnZWQkFBQUYVAgLIAQAoABgAGwGIB3VzZV9vaWwBMRUAACbwmrGErMDmPxUCKAJDMywXQFeRBiTdLxsYEmRhc2hfYmFzZWxpbmVfMV92MREAdewHAA%3D%3D&ccb=7-5&oe=62AC1A5F&oh=00_AT9ijqEfW1SCDHUqt3KK79FNnZmlzE9lqGMEegg35y58VQ&_nc_sid=30a2ef",
    "language_code": "en-US",
    "resource_id": 4
  },
  {
    "integration": "hosted",
    "url": "https://lang_src.s3.amazonaws.com/7a.mp3",
    "language_code": "en-US",
    "resource_id": 5
  },
  {
    "integration": "local",
    "path": "tests/fixtures/example.mp3",
    "language_code": "ar",
    "resource_id": 6
  },
  {
    "integration": "youtube",
    "id": "E6h1HUaDbAk",
    "language_code": "ar",
    "resource_id": 7,
    "recognizer": "google",
    "captions": "try"
  }
]

For each item, each of those parameters are mandatory. This is what they mean:

  • integration must be one of these options: youtube, tiktok or hosted. The latter means that it is directly downloadable from that link, and it is either a video (with audio in the video) or an audio.

  • id is used for items that are located in either tiktok or youtube. It is the the id of the video in those websites. For example:

    • For tiktok, given a URL like https://www.tiktok.com/@robertirwin/video/7105531486224370946, the id would be just 7105531486224370946.
    • For youtube, given a URL like https://www.youtube.com/watch?v=zWQJqt_D-vo, this would be just zWQJqt_D-vo.
  • url is provided only for hosted items. It's the URL where you can directly download the item. No scrapping involved. The only supported formats are any format supported by pydub's AudioSegment.from_file, like mp4, mp3, wav, m4a and webm.

  • path is only provided for local items. It is where you have the resource in the same machine where you are running this program.

  • language_code must be a language code from the list in Documentation of Language Support of Google Cloud Speech-To-Text. Plus, for Arabic fus7a is just "ar".

  • resource_id is an optional parameter that should not matter to you unless you want the output to be saved in MongoDB. In this case, it must be an integer.

  • recognizer is an optional parameter. The default value is "google". It can be any of these options: "assemblyai", "gladia", "google", "microsoft", "openai". For each of these speec-to-text services, it is assuming that the credentials are in ENV vars:

    • For AssemblyAI, it needs the ASSEMBLYAI_API_KEY.
    • For Gladia, it needs the GLADIA_API_KEY.
    • For Google, no credentials are required.
    • For Microsoft, it needs the MS_AZURE_SPEECH_API_KEY.
  • captions is an optional parameter, only valid for YouTube integrations. If it is present and true, it fetches the subtitles from YouTube captions instead of transcribing it by AI. In that case, if the captions in that language are not present in YouTube, it does nothing. If it is present and "try", then it tries to fetch the captions from Youtube, and if it can't, then it goes for the AI.

Expected output

Destination

The output data can be saved in either a file or mongodb . The environment variable SUBS_LOCATION can be either mongodb or file and by default it assumes mongodb.

Save in a file

It saves it inside this same project path, in resources/subtitles, in one of its 3 subfolders: development, test and production depending on which environment you are in, which it takes from the environment variable SPEECH_ENV.

Save in MongoDB

It uses the environment variables MONGO_URL (if not provided, it defaults to localhost:27017) and it requires MONGO_DB for the database collection name. For mongodb, it assumes that you pass all the data correctly, that it is running and that you can actually connect and write there from this project.

Output Format

  • The format of the file output is saved in the SRT format, which includes timestamps in the following format:
1
00:00:03,400 --> 00:00:06,177
In this lesson, we're going to be talking about finance. And

2
00:00:06,177 --> 00:00:10,009
one of the most important aspects of finance is interest.
  • The format of the MongoDB output is saved in the following format:
{
  'resource_id': 1, // The resource_id provided in the input of the item. If not provided, it default to -1.
  'lines': [
    {'timestamp': '00:00:00,000 --> 00:00:03,000', 'text': 'Hello!'},
    {'timestamp': '00:00:05,000 --> 00:00:08,000', 'text': 'My name is Marta'}
  ], // The text processed itself. It is saved as an array of lines, for each line the text and the timestamps (start and end of the line). The lines are sorted by the timestamps (in the same order as in the resource).
  'language_code': 'en-US', // The same value as the 'language_code' of the input given for this item.
  'created_at': 06/11/2022, 18:54:36 // It is a datetime value type. The current datetime (in UTC) at the moment the text is saved.
}