GitHub - phubbard/tgn-whisperer: Automate transcription of entire podcasts using WhisperX

Introduction

With my discovery of the whisper.cpp project I had the idea of transcribing the podcast of some friends of mine, The Grey Nato initially, and now also the 40 and 20 podcast that I also enjoy.

It's running on my Raspberry Pi 5 and the results (static websites) are deployed to

Take a look! This code and the sites are provided free of charge as a public service to fellow fans, listeners and those who find the results useful.

After I got whisper.cpp working, an acquaintance on the TGN Slack pinged me to try their OctoAI paid/hosted version with speaker diarization and I've rewritten the code to use that. Diarization works well, the next step was naming each speaker via a combination of heuristics and an LLM. That, Brad and I solved with Claude 3.5 Sonnet and got episdode summaries as well.

This repo is the code and some notes for myself and others. As of 10/9/2023, the code handles two podcasts and is working well.

Acknowledgements

Thank you to JetBrains for an open-source license for their developer tools. As of Dec 2023, I have a free year of their IDEs that I can use to work on this project. I've been using the excellent and free Pycharm Community Edition, and am looking forward to the full PyCharm. And their other tools. Super cool of them.

Goals

Simple as possible - use existing tools whenever possible
Incremental - be able to add new episodes easily and without reworking previous ones

Workflow and requirements

Download the RSS file (process.py, using Requests)
Parse it for the episode MP3 files (xmltodict)
Call WhisperX on each (command line, pass by reference)
Speaker attribution (attribute.py, using Claude v3.5 Sonnet)
Episode synopsis (attribute.py, as part of the Claude call.)
LLM retries using Tenacity library (sometimes Claude claims copyright and refuses to work)
Export text into markdown files (to_markdown.py)
Generate a site with mkdocs
Publish (rsync)

All of these are run and orchestrated by two Makefiles. Robust, portable, deletes outputs if interrupted, working pretty well.

Makefiles are tricky to write and debug. I might need remake at some point. The makefile tutorial here was essential at several points - suffix rewriting, basename built-in, phony, etc. You can do a lot with a Makefile very concisely, and the result is robust, portable and durable. And fast.

Another good tutorial (via Lobste.rs) https://makefiletutorial.com/#top

Directory list from StackOverflow ... as one does.

The curse of URL shorteners and bit.ly in particular

For a while, the TGN podcast shared episode URLs with bit.ly. There are good reasons for this, but now when I want to sequentially retrieve pages, the bit.ly throws rate limits and I see no reason to risk errors for readers. So I've built a manual process:

Grep the RSS file for bit.ly URLs
Save same into a text file called bitly
Run the unwrap-bitly.py script to build a json dictionary that resolves them
The process.py will use the lookup dictionary and save the canonical URLs.

Episode numbers and URLs

For a project like this, you want a primary index / key / way to refer to an episode. The natural choice is "episode number". This is a field in the RSS XML:

itunes:episode

however! TGN was bad, and didn't include this. What's more, they had episodes in between episodes. The episode_number function in process.py handles this with a combination of techniques:

Try the itunes:episode key
Check the list of exceptions, keyed by string title
Try to parse an integer from the title
Starting at 2100, assign a number

The story is very similar for per-episode URLs. Should be there, often are missing, and can sometimes be parsed out of the description.

40 & 20 has clean metadata, so this was a ton easier for their feed.

Optional - wordcloud

I was curious as to how this'd look, so I used the Python wordcloud tool. A bit fussy to work with my python 3.11 install:

 python -m pip install -e git+https://github.com/amueller/word_cloud#egg=wordcloud
 cat tgn/*.txt > alltext
 wordcloud_cli --text alltext --imagefile wordcloud.png --width 1600 --height 1200

40 & 20, run Sep 24 2023 - fun to see the overlaps.

Name		Name	Last commit message	Last commit date
Latest commit History 141 Commits
app		app
archive		archive
sites		sites
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
create_files.mk		create_files.mk
episode_makefile		episode_makefile
process_podcast.py		process_podcast.py
requirements.txt		requirements.txt
run.sh		run.sh
show-notes-export.csv		show-notes-export.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Acknowledgements

Goals

Workflow and requirements

The curse of URL shorteners and bit.ly in particular

Episode numbers and URLs

Optional - wordcloud

About

Releases

Packages

Languages

License

phubbard/tgn-whisperer

Folders and files

Latest commit

History

Repository files navigation

Introduction

Acknowledgements

Goals

Workflow and requirements

The curse of URL shorteners and bit.ly in particular

Episode numbers and URLs

Optional - wordcloud

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages