Skip to content

weihaolun/Twitter-Sentiment-Analysis

Repository files navigation

Twitter-Sentiment-Analysis

A weekday Twitter sentiment analysis on Tesla

Presentation

Dashboard

Project Overview

The purpose of this project was to create a Twitter sentiment monitoring dashboard. By applying NLP, a dataset of over 40K pre-labeled tweets classified into positive, negative, and neutral was used to train and test different machine learning models. New tweets were collected via the Twitter API v2 and passed into the selected model to predict sentiment. The dashboard presents analysis of sentiment by weekday where chosen topic was Tesla.

Resources

Method

A. Data Gathering

  1. SemEval Dataset - The dataset contains tweets categorized to into three labels: positive, negative, and neutral. This data gathered was a result of an international workshop on semantic evaluation (SemEval) focusing on message polarity classification.
  2. API Tweets
  • New tweets were collected via Standard Basic Twitter API v2.
  • Endpoints used were search tweets and tweet counts in which one can retrieve public tweets posted and count of tweets for a query over the last 7 days.
  • Data gathered was from Nov. 1-14, 2021 (two weeks). Total tweets is 67,200 (200 tweets per hour).

B. Data Cleaning

  • A function was created to clean the tweets using regular expression to remove retweets, hyperlinks, hashtags, numbers, emojis, and mentions. NLP was applied by tokenizing the tweets, removing stop words, and lemmatizing the list of words by utilizing the NLTK library.
  • A second function was created to apply the cleaning function, build the dataframe, and assign scores (1:positive, 0:neutral, -1:negative) to each tweet accordingly.

Tweets Before Cleaning

Tweets After Cleaning

C. Machine Learning

  • Preprocessing of data: Using a .py file, the cleaning function was applied on the newly downloaded tweets from the Twitter API to be able to feed it to the model for sentiment prediction.
  • Modeling: Three algorithms were tested and compared to see which model performed best. As part of the optimization, the neutral tweets were dropped to improve the model scores. Based on the results, model selected was LinearSVC with accuracy score of 0.83.
Model Accuracy Score Precision - Recall - F1 Score
Bernoulli Naive Bayes 0.72 alt text
Logistic Regression 0.82 alt text
Linear Support Vector Classification 0.83 alt text

D. Cleaning and Scoring API tweets

  • The same cleaning function was applied to the tweets downloaded via the API to turn the tweets into list of words as a preprocessing step.
  • The model and vectorizer used in the machine learning section were saved as pickle files to be able to import it and use it on the new set of tweets.
  • The downloaded data, together with the predicted scores were merged to form dataframes to store in the database and saved it as a json file for the visualization.

E. Database

  • An AWS RDS instance was created and connected to pgAdmin where a new database was created.
  • Tables were generated using a schema.
  • Data files were uploaded to an S3 bucket where PySpark was used to extract and transform the data to match the tables in pgAdmin (see ETL process).
  • Connection was made to the AWS RDS instance to write the dataframes created to the tables in the RDS.

Sample Join Query Result

F. Interactive App/Dashboard

Filter By Week

- Date range of the week
- Total tweets occurred @tesla
- Average tweets occurred per day
- Tweets occurred per weekday

Filter By Weekday

- Accumulated weekday tweets
- Number of weeks so far
- Overall sentiment distribution 
- Positive rate
- Five positive tweets and five negative tweets
- Top 10 keywords
- Top 100 keywords cloud
- Line chart of sentiment distribution by hour

Dashboard Showcase

Tesla.Twitter.Sentiment.Dashboard.mp4

References

Other useful articles: