A multi-threaded GitHub scraper to collect Python code with docstrings from public repositories, creating a well-documented dataset for the JaraConverse LLM model.
nlp
scraper
script
python3
dataset
dataset-generation
nlp-machine-learning
data-scraping
github-scraper
python-code
docstring-generator
llm
causal-language-modeling
dataset-scripts
python-dataset
llm-training
docst
-
Updated
Jul 21, 2024 - Python