This Project is about predicting recommendation Rating based on Reviews data of customers.
Each record in the dataset is a customer review which consists of the review title, text description and a rating (ranging from 1 - 5) for a product amongst other features. I have converted this into a binary classification problem such that a customer recommends a product (label 1:good) if the rating is > 3 else they do not recommend the product (label 0:not good).
- Merge all review text attributes (title, text description) into one attribute.
- Convert the 5-star rating system into a binary recommendation rating of 1 or 0. 3.Removed all records without review text.
1.Checked class imbalance. 2.Build train and test data
- Word Count: total number of words in the documents
- Character Count: total number of characters in the documents
- Average Word Density: average length of the words used in the documents
- Puncutation Count: total number of punctuation marks in the documents
- Upper Case Count: total number of upper count words in the documents
- Title Word Count: total number of proper case (title) words in the documents
Here we saaw problem that our model was not learning anything and wa just classifying every product as label 1 that is good product
Reviews are pretty subjective, opinionated and people often express stong emotions, feelings. This makes it a classic case where the text documents are a good candidate for extracting sentiment as a feature.TextBlob is an excellent open-source library for performing NLP tasks with ease, including sentiment analysis. It also an a sentiment lexicon which it leverages to give both polarity and subjectivity scores. 1.The polarity score is a float within the range [-1.0,1.0]. 2.The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.
Step 6 : Trained Logistic Regression Model on these new set of features and Evaluated Model Performance
Now our model is able to classify bad products also and is doing good but to make it a little more good we are gonna introduce some other features also to increase performance
1.Text Lowercasing 2.Removal of contractions 3.Removing unnecessary characters, numbers and symbols 4.Stemming 5.Stopword removal
Now our model is doing much better than before two cases. f1 score for bad reviews are 72% and good reviews are 92%. Overall f1 score is 88%.