Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporation of pre-trained word embeddings functionality #171

Open
amatsuo opened this issue May 27, 2019 · 9 comments
Open

Incorporation of pre-trained word embeddings functionality #171

amatsuo opened this issue May 27, 2019 · 9 comments
Labels

Comments

@amatsuo
Copy link
Collaborator

amatsuo commented May 27, 2019

spaCy now has this:
https://spacy.io/usage/vectors-similarity

Maybe we want to make this functionality available in spacyr.

Any feedback/suggestions from users are welcome.

@amatsuo amatsuo self-assigned this May 27, 2019
@amatsuo
Copy link
Collaborator Author

amatsuo commented Jun 30, 2019

I made some tries for this option. Install branch issue-171 and try the following:

library(spacyr)
# spacy_download_langmodel("en_core_web_md")
spacy_initialize("en_core_web_md") # or spacy_initialize("en_core_web_ld") 
txt <- "To make them compact and fast, spaCy’s small models (all packages that end in sm) don’t ship with word vectors, and only include context-sensitive tensors. "
out <- spacy_parse(txt, embedding = TRUE)
attr(out, "embedding")

@malickpaye
Copy link

Hi,
Just tested this feature, works fine with my configuration.
Think that it definitely will be useful for users that need to leverage state of the art NLP approaches while sticking to their favorite [R] langage.
Malick.

@kbenoit
Copy link
Collaborator

kbenoit commented Jul 4, 2019

Just experimented with this. A few comments on the branch.

Since this is looking up the tokens from the language model using Token.vector(), we don't really need to do this at the parsing stage. Instead, we could create a set of functions such as wordvectors_lookup() that look up the word vectors for a spacy_parsed object, but store them with one vector per type.

wordvectors_apply(x, wordvectors) would could apply those created by wordvectors_lookup() to a spacy_parsed object x, or to a quanteda::tokens() object. This means that we could be making this lookup function available to any package that has tokens or words.

We could create similar functions to weight or replace tokens with their L2 normed vector scores, similar to Token.vector_norm().

@cainesap
Copy link

cainesap commented Oct 1, 2019

+1 for this. Would be great to access embeddings and then, for example, the similarity() function described on that page you first linked to (https://spacy.io/usage/vectors-similarity)

@amatsuo
Copy link
Collaborator Author

amatsuo commented Oct 24, 2019

@kbenoit and all

I've implemented the very first version of twi function (spacy_wordvectors_lookup and spacy_wordvectors_apply. Please test and give some feedback.

The following is one of the expected use cases: Calculating the similarity of short texts

## devtools::install_github("quanteda/spacyr", ref = "issue-171")


library(quanteda)
library(tidyverse)
library(spacyr)
library(DBI)

spacy_initialize(model = "en_core_web_md")

# data from here
# https://www.kaggle.com/crowdflower/twitter-airline-sentiment
db <- dbConnect(RSQLite::SQLite(), "~/Downloads/database.sqlite")

set.seed(20191024)
corpus_tw <- tbl(db, "Tweets") %>% as_tibble() %>% sample_n(1000) %>% 
    distinct(text, .keep_all = TRUE) %>%
    corpus(docid_field = "tweet_id")

twitter_parsed <- spacy_parse(corpus_tw, additional_attributes = "is_stop") 

wordvectors <- spacy_wordvectors_lookup(twitter_parsed)

wordvec_matrix <- spacy_wordvectors_apply(twitter_parsed, wordvectors)

# convert the matrix to tibble for the further manipulation
wordvec_tb <- wordvec_matrix %>% 
    as_tibble(.name_repair = "universal") %>%
    rename_all(str_replace, "\\D+", "D") %>% 
    bind_cols(twitter_parsed)

# calculate the average of the wordvector in the text
doc_vec_avg <- wordvec_tb %>% 
    filter(!is_stop) %>%
    group_by(doc_id) %>%
    summarise_at(1:300, mean) %>% ungroup()

# convert it to dfm for similarity calculation (since the matrix is dense, other package might work faster for similarity calculation)
temp <- doc_vec_avg %>%
    select(-1) %>% 
    as.matrix() %>% as.dfm

rownames(temp) <- paste(doc_vec_avg$doc_id)

simil_stat <- textstat_simil(temp, method = "cosine") %>% as.data.frame() %>% 
    sample_n(1000) %>%
    arrange(-cosine) %>% 
    mutate_at(1:2, as.character)

# print the output
for(i in seq(10)){
    cat(paste0("similarity: ", simil_stat$cosine[i], "\n",
               "doc1: ", corpus_tw[simil_stat$document1[i]], "\n",
               "doc2: ", corpus_tw[simil_stat$document2[i]], "\n\n"))
}

@cainesap
Copy link

@amatsuo tested and working fine!
Thank you for implementing and updating us :)

@kbenoit
Copy link
Collaborator

kbenoit commented Feb 29, 2020

How about

# works on a spacyr parsed object
wordvectors_get.spacyr_parsed(x, model)

# works on a named list of characters, such as from spacy_tokenize()
wordvectors_get.list(x, model)

to return a v x d matrix, where v is the number of types (unique tokens) and d is the number of dimensions. This is a dense matrix.

# attaches special attribute of wordvectors to the object
wordvectors_put.spacyr_parsed(x, wordvectors)

We don't do this for a list, since we can do that instead in quanteda::as.tokens().

The important thing here is

  • when we get the word vectors from a language model, it's not ntoken x d, but rather ntype x d, so more efficient (and can be linked later by using the token label as a key); and
  • we can "put" any word vectors from any source, not just those taken from a spaCy language model. So the infrastructure is more general than spaCy, to allow us to take pre-trained word vectors from other sources, such as fasttext, BERT, Elmo, etc.

@gary-mu
Copy link

gary-mu commented Apr 21, 2023

is the pre-trained embedding functionality added ?

@kbenoit
Copy link
Collaborator

kbenoit commented Apr 29, 2023

No not yet, but we are working on adding some of the predictive functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants