Skip to content

πŸ“š tsundoku is a Python toolkit to analyze Twitter data.

License

Notifications You must be signed in to change notification settings

zorzalerrante/tsundoku

Repository files navigation

πŸ“š tsundoku

tsundoku is a Python toolkit to analyze Twitter data, following the methodology published in:

Graells-Garrido, E., Baeza-Yates, R., & Lalmas, M. (2020, July). Every colour you are: Stance prediction and turnaround in controversial issues. In 12th ACM Conference on Web Science (pp. 174-183).

About the name: tsundoku is a Japanese word (積んθͺ­) that means "to pile books without reading them" (see more in Wikipedia). It is common to crawl data continuously and do nothing with them later. So, tsundoku provides a way to work with all those piled-up datasets (mainly in the form of tweets).

Development Setup

We use conda to install all necessary packages.

# Clone repository
git clone http://github.com/zorzalerrante/tsundoku

# Move into folder
cd tsundoku

# Create conda environment, install dependencies on it and activate it
make conda-create-env

# Activate the environment 
conda activate tsundoku

# make the tsundoku module available in your environment
make install-package

Optionally, you may opt to analyze the data generated by tsundoku in a Jupyter environment. In that case, you will need to install a kernel:

# install kernel for use within Jupyter
make install-kernel

Lastly, you may want to estimate embeddings for textual content:

# Optional: you can install PyTorch if you want to use deep learning models. this is the GPU version (you can use install-torch-cpu for the CPU version)
make install-torch-gpu

Environment Configuration

Create an .env file in the root of this repository with the following structure:

TSUNDOKU_PROJECT_PATH=./example_project
INCOMING_PATH=/home/egraells/data/tweets/incoming
TSUNDOKU_LANGUAGES="es|und"
TWEET_PATH=/home/egraells/data/tweets/2021_flattened
JSON_TWEET_PATH=/mnt/c/Users/nicol/Escritorio/2022_flattened

This is the meaning of each option:

  • TSUNDOKU_PROJECT_PATH: path to your project configuration (this is explained below).
  • INCOMING_PATH: directory where you stored the tweets. This code assumes that you crawl tweets using the Streaming API. These tweets are stored in JSON format, one tweet per line, in files compressed using gzip. Particularly, we assume that each file contains 10 minutes of tweets.
  • TSUNDOKU_LANGUAGES: a list of languages to be studied. In the example, es is a language, and und is what Twitter defines as undetermined. Some und tweets are also relevant for studies, such as those with emojis and images.
  • JSON_TWEET_PATH: folder where the system stores a first pre-processed version of tweets from INCOMING_PATH. In this first step, tsundoku does two things: first, it keeps tweets in the specified languages; second, it flattens the tweet structure and removes some unused attributes. It does this through the following command:
$ python -m tsundoku.data.filter_and_flatten

Note that this operation deletes the original files in INCOMING_PATH.

  • TWEET_PATH: folder where the system stores tweets in Parquet format, using the following command:
python -m tsundoku.data.filter_and_flatten

Data Formats

tsundoku assumes that the original tweet files have a specific file naming schema, although this is not a requirement. An example filename is the following:

auroracl_202112271620.data.gz

Where:

  • auroracl_ is an optional prefix. In this case, it's the codename of the project that started this repository a few years ago.
  • The rest is the date of the file: 2021 (year) 12 (month) 27 (day) 1620 (time of the day). The time of the day means that the file starts at 16:20:00 (and, potentially, ends at 16:30, but this is not enforced.)

The code I used to crawl tweets from the Twitter Streaming API v1.1 generates these files every 10 minutes. It is available in this repository.

Project Configuration

The TSUNDOKU_PROJECT_PATH folder defines a project. It contains the following files and folders:

  • config.toml: project configuration.
  • groups/*.toml: classifier configuration for several groups of users. This is arbitrary, you can define your own groups. The mandatory one is called relevant.toml.
  • experiments.toml: experiment definition and classifier hyper-parameters. Experiments enable analysis in different periods (for instance, first and second round of a presidential election).
  • keywords.txt (optional): set of keywords to filter tweets. For instance, presidential candidate names, relevant hashtags, etc.
  • stopwords.txt (optional): list of stop words.

Please see the example in the example_project folder.

In config.toml there are two important paths to configure:

[project.path]
config = "/home/egraells/repositories/tsundoku/example_project"
data = "/home/egraells/repositories/tsundoku/example_project/data"

The first path, config, states where the project lies. The second path, data, states where the imported data will be stored. This includes the raw data and the results from processing.

Import data into your project

tsundoku has three folders within the project data folder: raw, interim, and processed.

The raw folder contains a subfolder named json, and within raw/json there is one folder for each day. The format is YYYY-MM-DD. Actually, the name of each folder within raw/json could be anything, but by convention I have worked with dates, as it makes it easier to organize different experiments.

Currently, there are two ways of importing data. First, by specifying a chunk of tweet files to be imported into one folder within raw/json (A); or second, by importing files when the filename encodes datetime structure (B). Both are described next.

If none of these two options works for you, you will have to craft your own importer. Fortunately, the module tsundoku.data.importer contains the TweetImporter class that will help you do so.

A. Import a set of files into a specific target

The following command imports a set of files into a specific target folder:

$ python -m tsundoku.data.import_files /mnt/storage/tweets/*.gz --target 2021-12-12

This command takes all files pointed by the wildcard (you can also point specific files) and then it filters the tweets relevant for the project, saving them in a folder named 2021-12-12 in the project. The files do not need to be inside TWEET_PATH. However, they do need to be flattened according to the tsundoku.data.filter_and_flatten script.

B. Import by date when files have a specific naming structure

The following command imports a specific date from TWEET_PATH:

$ python -m tsundoku.data.import_date 20211219

Run your project

Let's assume you have already imported the data, and that you have defined at least one experiment. We will run the following commands to perform the experiments:

  1. $ python -m tsundoku.features.compute_features: this will estimate features (such as document-term matrices) for every day in your project.
  2. $ python -m tsundoku.features.prepare_experiment --experiment experiment_name: this will prepare the features for the specific experiment. For instance, a experiment has start/end dates, so it consolidates the data between those dates only.
  3. $ python -m tsundoku.models.predict_groups --experiment experiment_name --group relevance: this command predicts whether a user profile is relevant or not (noise) for the experiment. It uses a XGB classifier.
  4. $ python -m tsundoku.models.predict_groups --experiment experiment_name --group another_group: this command predicts groups within users. Current sample configurations include stance (which candidate is supported by this profile?), person (sex or institutional account), location (the different regions in Chile). You can define as many groups as you want. Note that for each group you must define categories in the corresponding .toml file. In this file, if a category is called noise, it means that users who fall in the category will be discarding when consolidating results.
  5. $ python -m tsundoku.analysis.analyze_groups --experiment experiment_name --group reference_group: this command takes the result from the classification and consolidates the analysis with respect to interaction networks, vocabulary, and other features. It requires a reference group to base the analysis (for instance, stance allows you to characterize the supporters of each presidential candidate).

After this, in your project data folder data/processed/experiment_name/consolidated you will find several files with the results of the analysis.

About

πŸ“š tsundoku is a Python toolkit to analyze Twitter data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published