prodriguezsosa / context Goto Github PK

View Code? Open in Web Editor NEW

95.0 95.0 16.0 11.07 MB

An R package for estimating and doing statistical inference on context-specific word embeddings.

R 99.65% Rez 0.35%

context's People

Contributors

Stargazers

Watchers

Forkers

epinetz petershan1119 yixic94 borishouenou juliajung11 sliu736 sophiaaknight alejandroheca elisawirsching mlburnham davidycliao charliecarter cjbarrie yhliu2022 friederrodewald

context's Issues

Error with stratify_by

Hi all -- thanks for putting together such a helpful package. I'm running the code exactly as laid out in the Quick Start Guide and I get the following error when I run the conText regression:

Error: Column .[, stratify_by] is of unsupported class data.frame

Error when trying to run conText with vector of multiple keywords

Hi @prodriguezsosa,
Thanks first of all for this great package - it's immensely useful for my research.

I run into an issue when trying to run the embeddings regression code for a vector and two variables.

If I run the code for a single phrase (in this case "asylum"), then it works fine. However, if I try to run it with a vector containing multiple keywords (in this case "asylum" and "immigration") then I get an error message. Do you have any thoughts on what the problem could be? I've included the code below.
Thanks in advance.
Luke


set.seed(2021L)
model_asylum <- conText(formula = asylum ~ party + year,
                     data = toks_feats_uk,
                     pre_trained = glove, 
                     transform = TRUE, transform_matrix = khodakA,
                     bootstrap = TRUE, num_bootstraps = 100, 
                     permute = TRUE, num_permutations = 100,
                     window = 10, case_insensitive = TRUE, 
                     verbose = FALSE)

coefficient normed.estimate    std.error p.value
1    party_Lab     0.007491883 0.0006946093    0.00
2    year_1990     0.036427528 0.0058590948    0.00
3    year_1991     0.027542954 0.0029915331    0.01
4    year_1992     0.027062653 0.0031936243    0.01
5    year_1993     0.031330110 0.0030365470    0.00
6    year_1994     0.045332238 0.0059094438    0.00
7    year_1995     0.034009423 0.0039580333    0.00
8    year_1996     0.035373382 0.0037510402    0.00
9    year_1997     0.031067062 0.0034364145    0.00
10   year_1998     0.038391955 0.0039954072    0.00
11   year_1999     0.031251822 0.0029052327    0.00
12   year_2000     0.032218889 0.0030857129    0.00
13   year_2001     0.030436571 0.0028447632    0.00
14   year_2002     0.029558698 0.0030507649    0.00
15   year_2003     0.030426156 0.0026885509    0.00
16   year_2004     0.032758523 0.0030800205    0.00
17   year_2005     0.031767648 0.0029334574    0.00
18   year_2006     0.032996619 0.0032032519    0.00
19   year_2007     0.032024228 0.0034042820    0.00
20   year_2008     0.038336509 0.0038364726    0.00
21   year_2009     0.035864975 0.0039709854    0.00
22   year_2010     0.034893928 0.0041557936    0.00
23   year_2011     0.036788933 0.0036153164    0.00
24   year_2012     0.035026392 0.0040438055    0.03
25   year_2013     0.033690885 0.0037755951    0.00
26   year_2014     0.032883334 0.0036183347    0.00
27   year_2015     0.031222210 0.0029255373    0.00
28   year_2016     0.033563602 0.0032734454    0.00
29   year_2017     0.034259558 0.0034920834    0.00
30   year_2018     0.032179317 0.0032749506    0.00
31   year_2019     0.033100115 0.0035517084    0.00

set.seed(2021L)
model_mkw <- conText(formula = c("asylum", "immigration") ~ party + year,
                  data = toks_feats_uk,
                  pre_trained = glove, 
                  transform = TRUE, transform_matrix = khodakA,
                  bootstrap = TRUE, num_bootstraps = 100, 
                  permute = TRUE, num_permutations = 100,
                  window = 10, case_insensitive = TRUE, 
                  verbose = FALSE)
Error: It seems you are using factor() in "formula" to create a factor a variable. Please create it directly in "data" and re-run conText.

get_context function does not work with quanteda 3.0

kwic(): As of version 3, only tokens objects are supported as inputs to kwic(). Calling kwic() for character or corpus objects is still functional, but issues a warning. Passing arguments to tokens() via ... in kwic() is now disabled. Users should now create a tokens object (using tokens() from character or corpus inputs before calling kwic().

.rmd v md

@prodriguezsosa - the vignette is rmd, but do we want it as .md to render directly in the browser, or is the idea that users will know to open in R directly?

Error while computing local transformation matrix

Hello, and thank you for this great package!

I followed your vignette to estimate local GloVe embeddings. This worked so far, but when I try to estimate the transformation matrix with compute_transform(x = toks_fcm, pre_trained = local_glove, weighting = 'log'), I get the error:
.subscript.2ary(x, i, , drop = TRUE) : subscript out of bounds

This does not happen when I do not set weighting = log.

Can someone point me to why this might happen, since weighting = log is the recommended option here.

Best,
k

CI for get_grouped_similarity?

Hi all — thanks for all your work on this package and documentation. I’m just getting into word embeddings and all of your resources have been incredibly helpful.

I was excited to see the new “get_grouped_similarity” function you recently added to conText. Are you planning on integrating this function with the wrapper function for bootstrapping confidence intervals any time in the near future?

Thanks!

get_nns_ratio() not working for numeric binary variables

The minor issue

Using the wrapper function get_nns_ratio() with a 0/1-coded group-variable returns the error:

Error in nnsdfs[[numerator]] : 
  attempt to select less than one element in get1index <real>

The used code is:

myNnsRatio <- get_nns_ratio(x = myToks, 
                            N = 10,
                            groups = docvars(myToks, 'binaryVar'),
                            numerator = 0, [...]

It does not matter whether I choose 1 or 0 as the numerator, as the denominator will always be the opposite and thus the same error occurs.

Possible Solution

During or after their creation, the numerator and denominator could be converted to as.character() which seems to solve the problem.
Alternatively, if it is intended that variables cannot be numeric 0/1, then an appropriate error message would be helpful.

if (is.numeric(numerator) | is.numeric(denominator))
  stop("The numerator and denominator in the group variable must be character-types.")

I really like this package and it is incredibly useful, thank you for your work!

Code Inconsistancy in Vignette

In:

https://github.com/prodriguezsosa/conText/blob/master/vignettes/quickstart_local_transform.md#3-estimate-local-transformation-matrix

transform_matrix <- compute_transform(context_fcm = fcm_cr, pre_trained = pre_trained,
vocab = vocab_pruned, weighting = 1000)

should be:

transform_matrix <- compute_transform(context_fcm = fcm_cr, pre_trained = word_vectors,
vocab = vocab_pruned, weighting = 1000)

Interaction effects in embedding regression?

Hi, this is not exactly an issue, so I hope it is okay to post this here. I was wondering whether it is possible to include interaction effects in models using embedding regression.
If we have the below example from your quick start guide using party and gender as covariates - is it possible to include interaction effects between gender and party in the model? Thanks in advance!

Warm regards,
Luke

# two factor covariates
set.seed(2021L)
model1 <- conText(formula = immigration ~ party + gender,
                  data = toks_nostop_feats,
                  pre_trained = cr_glove_subset,
                  transform = TRUE, transform_matrix = cr_transform,
                  bootstrap = TRUE, num_bootstraps = 100,
                  permute = TRUE, num_permutations = 100,
                  window = 6, case_insensitive = TRUE,
                  verbose = FALSE)
                  
# D-dimensional beta coefficients
# the intercept in this case is the ALC embedding for female Democrats
# beta coefficients can be combined to get each group's ALC embedding
DF_wv <- model1['(Intercept)',] # (D)emocrat - (F)emale 
DM_wv <- model1['(Intercept)',] + model1['gender_M',] # (D)emocrat - (M)ale 
RF_wv <- model1['(Intercept)',] + model1['party_R',]  # (R)epublican - (F)emale 
RM_wv <- model1['(Intercept)',] + model1['party_R',] + model1['gender_M',] # (R)epublican - (M)ale

object 'CsparseMatrix_validate' not found

I'm running into an error creating a DFM when I load the conText package. The DFM function for the same code works when I do not have the conText library loaded, but once I load it the code gives me the following error: Error in validityMethod(as(object, superClass)) : object 'CsparseMatrix_validate' not found

default naming convention for `dimnames()$doc` in `dem` object

The default for dimnames()$doc on a dem object are currently text1, text2 etc. That is e.g.

toks <- tokens(cr_sample_corpus)
immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L)
immig_dfm <- dfm(immig_toks)
immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform, verbose = FALSE)
dimnames(immig_dem)$doc

returns

[1] "text1"    "text2"    "text3"    "text4"    "text5"    "text6"    "text7"

etc

I wonder if this should be something other than text because it potentially gives the impression that each incidence is a new "text" (a separate document). But, of course, the whole point here is that one can have many instantiations (and thus many embeddings of the same term) in the same document.

Perhaps we could change it to instance (or occurrence or observation or incidence)? Open to not doing anything, but just want to avoid confusion for end-users.

Issue with ncs function

Hi @prodriguezsosa,

I just realized a small issue with the ncs function. It happens that the same context is assigned to both groups. So, for example, the context appearing in a Republican speech might be assigned as a top Democrat context. This is because the ncs function calculates cosine similarities between the group-specific embedding for a term and all contexts. I guess you would want to subset the eligible contexts for each group to only those actually appearing in a group-specific text? Here is a suggestion on how to change the code:

cos_sim <- text2vec::sim2(x = as.matrix(contexts_dem), y = as.matrix(x), method = "cosine", norm = "l2") %>% data.frame()
contexts_df <- data.frame(docid = quanteda::docid(contexts), context = sapply(contexts, function(i) paste(i, collapse = " ")),
                          docgroup = docvars(contexts)[,2])
cos_sim <- cos_sim %>% dplyr::mutate(docid = rownames(cos_sim)) %>% dplyr::left_join(contexts_df, by = 'docid')

result <- tidyr::pivot_longer(cos_sim, -c(docid, context, docgroup), names_to = "target") %>%
  dplyr::filter(docgroup==target) %>% 
  dplyr::group_by(target) %>%
  dplyr::slice_max(order_by = value, n = N) %>%
  dplyr::mutate(rank = 1:dplyr::n()) %>%
  dplyr::arrange(dplyr::desc(value)) %>%
  dplyr::ungroup() %>%
  dplyr::select('target', 'context', 'rank', 'value')

Adding the argument/option to choose a language for stemming

Problem

There is no possibility of changing the stemming language since "porter" (-> English) is hard-coded as the language in SnowballC::wordStem().
It would be handy to have the option to choose the language for stemming as this opens this package fully to non-English research. There are 26 languages already supported by the wordStem-function (cf. getStemLanguages()).

Quick Workaround

Create a customWordStem-function only changing the language-argument to the desired language and assign it to the SnowballC-package as described here.

customWordStem <- function (words, language = "YOUR_LANGUAGE") {
  words <- as.character(words)
  language <- as.character(language[1])
  .Call("R_stemWords", words, language, PACKAGE = "SnowballC")
}

environment(customWordStem) <- asNamespace('SnowballC')
assignInNamespace("wordStem", customWordStem, ns = "SnowballC")

Possible Fix

Create an additional stem_language-argument for the nns and cos_sim functions. As I understand, only these two functions call wordStem() directly.
However, nns_ratio and the more complex wrapper functions get_nns, get_cos_sim and get_nns_ratio use both functions too. Thus, the stem_language-argument would have to be available on the uppermost level and passed down through the intermediate functions like nns_boostrap to end up in the nns and cos_sim functions. Maybe there is an easier way, but this seems to be the most straightforward.

Obtaining transformation matrix (for other languages)

Hi,

First of all, I would like to thank you for this great package and inspiring paper.

I'm just following your Quick Start Guide (https://github.com/prodriguezsosa/conText/blob/master/vignettes/quickstart.md) to get familiar with the procedure of applying embedding regression.

I have mainly 2 questions right now.

Do you provide a formula to learn the transformation matrix? In the tutorial you use cr_transform, while you mention that this is based on an estimation by Khodak et al. (2018) (khodakA.rds). However, is the formula below (from your 2023 paper) somehow implemented in conText?

This gets even more important if one tries to move to another language. If I would analyze the semantics of English texts, this would probably work with the provided data from the dropbox, however, I would like to analyze German text with respect to political ideology.

While German word embeddings are available: https://www.deepset.ai/german-word-embeddings, I haven't found corresponding transformation matrices for ALC so far. So I thought I could maybe rely on a function within your package or have to train the learn the transformation matrix on my own with the above formula? Are there any further obstacles when moving to another language beyond the pre-trained embeddings and the (yet to be estimated) transformation matrix?

Thanks in advance!

Best,
Lukas

Type of norm used

Hi @prodriguezsosa ! Thanks for a fantastic package.

In the run_ols() function, I was wondering why the (R's default) one norm is used norm(x,type='O'). From the paper on p.24, I would have expected the Frobenius-Euclidean; norm(x,type='F'). I might be very wrong here, it might be on purpose, but I was wondering.