Skip to content

phubbard/tgn-whisperer

Repository files navigation

Introduction

With my discovery of the whisper.cpp project I had the idea of transcribing the podcast of some friends of mine, The Grey Nato initially, and now also the 40 and 20 podcast that I also enjoy.

It's running on my Raspberry Pi 5 and the results (static websites) are deployed to

Take a look! This code and the sites are provided free of charge as a public service to fellow fans, listeners and those who find the results useful.

After I got whisper.cpp working, an acquaintance on the TGN Slack pinged me to try their OctoAI paid/hosted version with speaker diarization and I've rewritten the code to use that. Diarization works well, the next step was naming each speaker via a combination of heuristics and an LLM. That, Brad and I solved with Claude 3.5 Sonnet and got episdode summaries as well.

This repo is the code and some notes for myself and others. As of 10/9/2023, the code handles two podcasts and is working well.

Acknowledgements

jetbrains logo

Thank you to JetBrains for an open-source license for their developer tools. As of Dec 2023, I have a free year of their IDEs that I can use to work on this project. I've been using the excellent and free Pycharm Community Edition, and am looking forward to the full PyCharm. And their other tools. Super cool of them.

Goals

  1. Simple as possible - use existing tools whenever possible
  2. Incremental - be able to add new episodes easily and without reworking previous ones

Workflow and requirements

  1. Download the RSS file (process.py, using Requests)
  2. Parse it for the episode MP3 files (xmltodict)
  3. Call WhisperX on each (command line, pass by reference)
  4. Speaker attribution (attribute.py, using Claude v3.5 Sonnet)
  5. Episode synopsis (attribute.py, as part of the Claude call.)
  6. LLM retries using Tenacity library (sometimes Claude claims copyright and refuses to work)
  7. Export text into markdown files (to_markdown.py)
  8. Generate a site with mkdocs
  9. Publish (rsync)

All of these are run and orchestrated by two Makefiles. Robust, portable, deletes outputs if interrupted, working pretty well.

Makefiles are tricky to write and debug. I might need remake at some point. The makefile tutorial here was essential at several points - suffix rewriting, basename built-in, phony, etc. You can do a lot with a Makefile very concisely, and the result is robust, portable and durable. And fast.

Another good tutorial (via Lobste.rs) https://makefiletutorial.com/#top

Directory list from StackOverflow ... as one does.

The curse of URL shorteners and bit.ly in particular

For a while, the TGN podcast shared episode URLs with bit.ly. There are good reasons for this, but now when I want to sequentially retrieve pages, the bit.ly throws rate limits and I see no reason to risk errors for readers. So I've built a manual process:

  • Grep the RSS file for bit.ly URLs
  • Save same into a text file called bitly
  • Run the unwrap-bitly.py script to build a json dictionary that resolves them
  • The process.py will use the lookup dictionary and save the canonical URLs.

Episode numbers and URLs

For a project like this, you want a primary index / key / way to refer to an episode. The natural choice is "episode number". This is a field in the RSS XML:

itunes:episode

however! TGN was bad, and didn't include this. What's more, they had episodes in between episodes. The episode_number function in process.py handles this with a combination of techniques:

  1. Try the itunes:episode key
  2. Check the list of exceptions, keyed by string title
  3. Try to parse an integer from the title
  4. Starting at 2100, assign a number

The story is very similar for per-episode URLs. Should be there, often are missing, and can sometimes be parsed out of the description.

40 & 20 has clean metadata, so this was a ton easier for their feed.

Optional - wordcloud

I was curious as to how this'd look, so I used the Python wordcloud tool. A bit fussy to work with my python 3.11 install:

 python -m pip install -e git+https://github.com/amueller/word_cloud#egg=wordcloud
 cat tgn/*.txt > alltext
 wordcloud_cli --text alltext --imagefile wordcloud.png --width 1600 --height 1200

wordcloud

40 & 20, run Sep 24 2023 - fun to see the overlaps.

wordcloud_wcl

About

Automate transcription of entire podcasts using WhisperX

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published