Disaster-response-analysis

Text data one of the most used kinds of data all over the internet, people exchange billions of text data with every new day.

During a global crisis, social media plays a crucial role in transmitting the data between people. Some of these messages are very useful for goverments and entities responsible for decision making.

To summarize, the classification of this textual data is so beneficial in order to make the transmission of information easier, faster and more efficient.

In this project, I worked on analyzing text data exchanged during a disaster, worked on building a model to classify the messages into 36 categories to make it easier to transmit information to the right entity. and finally implemented a Web application that showcases some visualizations about the data and the multi-label-classification task.

Installation:

First you need to clone the repository to your local machine using this command:
```
git clone git@github.com:aminebennaji19/Disaster-sentiment-analysis.git
```
After cloning the repository, you need to install the install the project requirements.

PS: This project was developed using Python3.7
```
pip3 install -r requirements.txt
```

File Description

app: Flask Web App
- templates: Folder including HTML templates.
- run.py: Flask Web App main script.
data: Folder containing the data files, the ETL pipeline and preprocessing steps.
- disaster_response.db: The stored database after ETL.
- categories.csv: The dataset including message categories.
- messages.csv: The dataset including text messages.
- process_data.py: The ETL pipeline script.
models: Folder containing the training and evaluation pipeline.
- classifier.pkl: The trained model pickle file.
- train_classifier.py: The modeling pipeline script.
notebooks: Notebooks folder.
- ETL Pipeline Preparation.ipynb: Notebook contains ETL Pipeline.
figures: Folder including screenshots and necessary images.
requirements.txt: The project requirements file.

Dataset

The disaster data was taken from Appen. It has two main files messages.csv and categories.csv containing the exchanged messages and their annotations.

Data Cleaning

The process of data cleaning is based on

Merge the messages and categories dataframes based on id.
Split the categories into separate columns.
Convert the categories values to binary [0,1].
Drop the categories column and use the separate category columns instead.
Drop the duplicates from the messages column.
Export the final clean dataframe into and save it as a table in a SQL database.

To run ETL pipeline that extracts, cleans and saves data in a SQL database use this command:

python3 data/process_data.py data/messages.csv data/categories.csv data/disaster_response.db

where the system arguments correspond to:

messages_path: Path of the CSV file containing the text messages.
categories_path: Path of the CSV file containing the message categories.
database_filepath: Path of the output database to save.

After the ETL pipeline is finished successfully, you'll find a file named disaster_response.db in the specified path while running the pipeline.

Modeling Process

Load the messages table from the SQL database and select categories, training features and the target columns.
Clean and tokenize text messages after removing ponctuations, capital letters, stop words and perform the lemmatization of text messages to get the root of the words.
Build a machine learning pipeline using TfidfVectorizer and XGBClassifier and add the feature for fine-tuning the parameters using GridSearchCV algorithm.
Split the data into training and test sets.
Training, evaluating and saving the classifier.
Then using hyperparameter tuning with 5 fold cross-validation fitted 40 models to find the best XGBoost model for predicting disaster response category. These are the best parameters:


{'multioutput_classifier__estimator__learning_rate': [0.1], 'multioutput_classifier__estimator__max_depth': [5], 'multioutput_classifier__estimator__n_estimators': [200]}

To run the Machine Learning pipeline that prepares, trains, evaluates and saves the classifier into a pickle file run:
```
python3 models/train_classifier.py data/disaster_response.db models/classifier.pkl True
```

Where:

-database_path: Path of the SQL database containing the transmitted messages.

-model_filepath: Path of the output model to save.

-fine_tune: A boolean parameter controlling the model fine-tuning.

PS: You can set fine_tune argument to False to use best parameters and avoid fine-tuning again.

A file called classifier.pkl will be saved to the specified path while running the ML pipeline.

Model evaluation

We evaluated our model trained with 5-folds cross validation splits using classification report for each category.

This figure illustrates the evaluation results for the column named related.

Screenshots

Later, you can use this command to run the web app showcasing our classifier and visualizations about the data.
```
python3 app/run.py
```
Go to http://0.0.0.0:3001/ from your local browser to access to the app.

These are some figures illustrating the home page of our web application alongside with a result of multi-label classification.

Effect of Imbalance:

The dataset is so imbalanced across all the categories which affects the behaviour of our classifier and causes some missing predictions.

The model will not generalize well for imbalanced classes with fewer samples. We can work on reducing the effect by using some techniques that deal with datasets imbalance.

For some categories, we should focus on recall as all the categories has the same precision.

Acknowledgments:

This project is part of the Udacity Data Science Nanodegree program that is really helpful and practical. Also, many thanks go to Appen for providing the needed data to perform this case study.

Feel free to use this code, sharing it with the community and contact me if there's anything related to this project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disaster-response-analysis

Table of Contents

Overview:

Installation:

File Description

Dataset

Data Cleaning

Modeling Process

Model evaluation

Screenshots

Effect of Imbalance:

Acknowledgments:

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
app		app
data		data
figures		figures
models		models
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

MohamedAmineBENNAJI/Disaster-sentiment-analysis

Folders and files

Latest commit

History

Repository files navigation

Disaster-response-analysis

Table of Contents

Overview:

Installation:

File Description

Dataset

Data Cleaning

Modeling Process

Model evaluation

Screenshots

Effect of Imbalance:

Acknowledgments:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages