Sentiment Analysis Model.Rmd

---
title: "Sentiment Analysis"
output:
  html_document:
    df_print: paged
---


# Developing Sentiment Analysis Model

The dataset is provided by the R package 'janeaustenR'.

tidytext package comprises of sentiment lexicons that are present in the dataset of 'sentiments'.

```{r}
#install.packages("tidytext")
library(tidytext)
```

```{r}
# view dictionary
sentiments
```

We will make user of three general purpose lexicons like:

- AFINN
- bing
- loughran

These three lexicons make use of the unigrams.

Unigrams are a type of n-gram model that consists of a sequence of 1 item, that is, a word collected from a given textual data.

**AFINN** lexicon model scores the words in a rangefrom - 5 to 5. The increase in negativity corresponds the negative sentiment whereas an increase in positivity corresponds the positive one.

**Bing** lexicon model classifies the sentiment into a binary category of negative or positive.

The **loughran** model performs analysis of the shareholder's reports. 

Following analysis will make a use of bing lexicons to extract the sentiments out of the data.

```{r}
get_sentiments("bing")
```

# getting the data

```{r}
library(janeaustenr)
library(stringr)
library(tidytext)
library(dplyr)
```

We will convert the text of the books into a tidy format using unnest_tokens() function.

```{r}
tidy_data <- austen_books() %>% group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", 
                          ignore_case = TRUE)))) %>%
# adds row number and relevant chapter number, each row represents a sentense from book
ungroup() %>%
unnest_tokens(word, text)
```

We have performed the tidy operation on our text such that each row contains a single word. We will now make use of the “bing” lexicon to and implement filter() over the words that correspond to joy. We will use the book Sense and Sensibility and derive its words to implement out sentiment analysis model.

```{r}
positive_sentiment <- get_sentiments("bing") %>%
  filter(sentiment == "positive")

tidy_data %>%
  filter(book == "Emma") %>%
  semi_join(positive_sentiment) %>%
  count(word, sord = TRUE)
# semi_join will return all words from table tidy_data which has match in positive_sentiment; joined by word
```

We could examine how sentiment changes throughout each novel. We can do this with just few lines of mostly dplyr functions. First, we find a sentiment score for each word using the Bing lexicon and inner join.

Next, we count up how many positive and negative words there are in defined sections of each book. We define an index here to keep track of where we are in the narative. This index (using integer division) counts up sections of 80 lines of text.

The %/% operator does integer division (x %/% y is equivalent to floor(x/y) so the index keeps track of which 80-line section of text we are counting up negative and positive sentiment in).

Small sections of text may not have enough words in them to get a good estimate of sentiment, while really large sections can wash out narative structure. For these books, using 80 lines works well, but this can vary depending on individual texts, how long the lines were to start with, etc. We then use spread() so that we have negative and postive sentiment in separate colums, and lastly calculate a net sentiment (positive - negative).

```{r}
library(tidyr)
bing <- get_sentiments("bing")
Emma_sentiment <- tidy_data %>%
  inner_join(bing) %>%
  count(book = "Emma", index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>% # this will show index and number of positive and negative words
mutate(sentiment = positive - negative)
```

Next step is to visualise thee words present in the book Emma based on their corresponding positive and negative scores.

```{r}
library(ggplot2)

ggplot(Emma_sentiment, aes(index, sentiment, fill = book)) +
 geom_bar(stat = "identity", show.legend = TRUE) +
 facet_wrap(~book, ncol = 2, scales = "free_x")
```

Next step is to count the most common positive and negative words that are present in the book Emma.

```{r}
counting_words <- tidy_data %>%
  inner_join(bing) %>%
  count(word, sentiment, sort = TRUE)

head(counting_words, 100)
```

In the next step, we will perform visualisation of the sentiment score. We will plot the scores along the axis that is labeled with both positive as well as negative words. We will use ggplot() function to visualise our data based on their scores.

```{r}
counting_words %>%
 filter(n > 150) %>%
 mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
 mutate(word = reorder(word, n)) %>%
 ggplot(aes(word, n, fill = sentiment))+
 geom_col() +
 coord_flip() +
 labs(y = "Sentiment Score")
````

In the final part of visualisation, we will crease a wordcloud that will delineate the most recurring positive and negative words. We will use the comparision.cloud() fuction to plot both negative and positive words in a single wordcloud.

```{r}
library(reshape2)
library(wordcloud)

tidy_data %>%
 inner_join(bing) %>%
 count(word, sentiment, sort = TRUE) %>%
 acast(word ~ sentiment, value.var = "n", fill = 0) %>%
 comparison.cloud(colors = c("red", "dark green"),
          max.words = 100)
```

The wordcloud enables us to efficiently visualize the negative as well as postive groups of data.


# Sentiment of individual books

```{r}
janeaustensentiment <- tidy_data %>%
  inner_join(get_sentiments("bing")) %>%
  count(book , index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>% # this will show index and number of positive and negative words
  mutate(sentiment = positive - negative)
```

```{r}
ggplot(janeaustensentiment, aes(index, sentiment, fill=book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

```

We can now see how the plot of each novel changes toward more positive or negative sentiment over the trajectory of the story.

More on Text Mining with R: https://books.google.co.jp/books?id=qNcnDwAAQBAJ&pg=PA18&lpg=PA18&dq=linenumber+%25/%25+80&source=bl&ots=Q0CY6mGZtX&sig=ACfU3U1wUbx31ixGfacMFKAUrAiCEp3-jQ&hl=en&sa=X&ved=2ahUKEwjj1qKy6anpAhWWMN4KHRJEBUwQ6AEwAHoECAUQAQ#v=onepage&q=linenumber%20%25%2F%25%2080&f=false