- Run
python BotScript.py
for the Tweetbot (This script runs only withGPT2-rap-recommended
) - Run
python src/test_generation.py
with proper parameters to test language generation. - To quit the program, use
CTRL
+C
- Install requirements
pip install -r requirements.txt
- Alternatively run
./install.sh
- Apply for a Twitter Developer Account with elevated access
- Create an
.env
file including the variables:CONSUMER_API_KEY
CONSUMER_API_KEY_SECRET
ACCESS_TOKEN
ACCESS_TOKEN_SECRET
and provide the necessary credentials to each variable.
- Download fasttext's language identification model and place it in the same folder as this file.
- Create a folder called
.model
in the same folder as this file and place the proper finetuned GPT-2 model (see Models section) inside it (.model/GPT2-rap-recommended/config.json pytorch...
). The model is available here - Hardware that can deal with GPT-2.
- We gathered raps from genius.com, ohhla.com and battlerap.com. For genius.com, we used the official API
(GeniusLyrics and GetRankings repos) while genius.com and ohhla.com were scraped using a specifically tailored scrapy scraper.
In total we gathered ~70k raps which we used for finetuning. GPT-2 was finetuned by creating one large text, while T5 was finetuned
on prompts. The prompts had the form of
KEYWORDS: <keywords> RAP-LYRICS: <rap text>
which proved to be insufficient for our task. Eventually we chosed to use the fine-tuned GPT2 model. Experimental and succeeding scripts can be found in./preprocessing/finetunging
. Additionaly, a RoBERTa model was finetuned on both data from the english wikipedia, tweets regarding hate speech, the CNN/Dailymail dataset and 4k rap lyrics data (data can be found underData
) to classify the quality of the generated raps.
finetuning
FineTuneRapMachineExp.ipynb
Experimental scriptFineTuneRapMachineGPT2.ipynb
GPT2 finetuning scriptT5.ipynb
Finetuning Script for T5 on a key2text approachkeytotext.ipynb
Using the keytotext library for finetuningFineTuneRapMachineExp2.ipynb
Another experimental script, in which GPT-J and GPT-NEO were used, yet didn't succeed
data_analysis
CreateAdvData.ipynb
Script to create balanced dataset to train the ranker modelLyricsAnalyye.ipzng
Script to analyze the scraped data.
lyrics_spider
- Includes scrapy program to obtain lyrics
cleaning_and_keywords
data_cleaner
Script for removing noise from 70k scraped rap corpuskw_extraction
Script that starts building a TF-IDF model either from scratch or from an existing model to generate keywords for rap corpustf_idf
TF-IDF model script
ranker
roberta_ranker.ipynb
Roberta finetuning script
- ohhla.com - Scraped
- BattleRap.com - Scraped
- Genius.com - Accessed through API, GeniusLyrics and GetRankings used.
- To obtain lyrics from genius.com, two programs were implemented which are based on different, yet outdated, repositories.
- Both programs are part of this project
- GPT2-rap-recommended Download (Necessary to use BotScript.py)
- GPT2-small-key2text Download (Approach did not work out, trained on 4k corpus)
- Roberta Ranker Download (Ranker trained on 8k data with 4k rap corpus and 4k non-rap corpus)
- T5-large-key2text Download (Approach did not work out, trained on 70k corpus)
- T5-small-key2text Download (Approach did not work out, trained on 4k corpus)
- tf-idf pickle Download (Approach did not work out, trained on 70k corpus)
- Our data can be downloaded here