prodriguezsosa / context Goto Github PK
View Code? Open in Web Editor NEWAn R package for estimating and doing statistical inference on context-specific word embeddings.
An R package for estimating and doing statistical inference on context-specific word embeddings.
Hi @prodriguezsosa,
Thanks first of all for this great package - it's immensely useful for my research.
I run into an issue when trying to run the embeddings regression code for a vector and two variables.
If I run the code for a single phrase (in this case "asylum"), then it works fine. However, if I try to run it with a vector containing multiple keywords (in this case "asylum" and "immigration") then I get an error message. Do you have any thoughts on what the problem could be? I've included the code below.
Thanks in advance.
Luke
set.seed(2021L)
model_asylum <- conText(formula = asylum ~ party + year,
data = toks_feats_uk,
pre_trained = glove,
transform = TRUE, transform_matrix = khodakA,
bootstrap = TRUE, num_bootstraps = 100,
permute = TRUE, num_permutations = 100,
window = 10, case_insensitive = TRUE,
verbose = FALSE)
coefficient normed.estimate std.error p.value
1 party_Lab 0.007491883 0.0006946093 0.00
2 year_1990 0.036427528 0.0058590948 0.00
3 year_1991 0.027542954 0.0029915331 0.01
4 year_1992 0.027062653 0.0031936243 0.01
5 year_1993 0.031330110 0.0030365470 0.00
6 year_1994 0.045332238 0.0059094438 0.00
7 year_1995 0.034009423 0.0039580333 0.00
8 year_1996 0.035373382 0.0037510402 0.00
9 year_1997 0.031067062 0.0034364145 0.00
10 year_1998 0.038391955 0.0039954072 0.00
11 year_1999 0.031251822 0.0029052327 0.00
12 year_2000 0.032218889 0.0030857129 0.00
13 year_2001 0.030436571 0.0028447632 0.00
14 year_2002 0.029558698 0.0030507649 0.00
15 year_2003 0.030426156 0.0026885509 0.00
16 year_2004 0.032758523 0.0030800205 0.00
17 year_2005 0.031767648 0.0029334574 0.00
18 year_2006 0.032996619 0.0032032519 0.00
19 year_2007 0.032024228 0.0034042820 0.00
20 year_2008 0.038336509 0.0038364726 0.00
21 year_2009 0.035864975 0.0039709854 0.00
22 year_2010 0.034893928 0.0041557936 0.00
23 year_2011 0.036788933 0.0036153164 0.00
24 year_2012 0.035026392 0.0040438055 0.03
25 year_2013 0.033690885 0.0037755951 0.00
26 year_2014 0.032883334 0.0036183347 0.00
27 year_2015 0.031222210 0.0029255373 0.00
28 year_2016 0.033563602 0.0032734454 0.00
29 year_2017 0.034259558 0.0034920834 0.00
30 year_2018 0.032179317 0.0032749506 0.00
31 year_2019 0.033100115 0.0035517084 0.00
set.seed(2021L)
model_mkw <- conText(formula = c("asylum", "immigration") ~ party + year,
data = toks_feats_uk,
pre_trained = glove,
transform = TRUE, transform_matrix = khodakA,
bootstrap = TRUE, num_bootstraps = 100,
permute = TRUE, num_permutations = 100,
window = 10, case_insensitive = TRUE,
verbose = FALSE)
Error: It seems you are using factor() in "formula" to create a factor a variable. Please create it directly in "data" and re-run conText.
kwic(): As of version 3, only tokens objects are supported as inputs to kwic(). Calling kwic() for character or corpus objects is still functional, but issues a warning. Passing arguments to tokens() via ... in kwic() is now disabled. Users should now create a tokens object (using tokens() from character or corpus inputs before calling kwic().
@prodriguezsosa - the vignette is rmd, but do we want it as .md to render directly in the browser, or is the idea that users will know to open in R directly?
Hello, and thank you for this great package!
I followed your vignette to estimate local GloVe embeddings. This worked so far, but when I try to estimate the transformation matrix with compute_transform(x = toks_fcm, pre_trained = local_glove, weighting = 'log'), I get the error:
.subscript.2ary(x, i, , drop = TRUE) : subscript out of bounds
This does not happen when I do not set weighting = log.
Can someone point me to why this might happen, since weighting = log is the recommended option here.
Best,
k
Hi all — thanks for all your work on this package and documentation. I’m just getting into word embeddings and all of your resources have been incredibly helpful.
I was excited to see the new “get_grouped_similarity” function you recently added to conText. Are you planning on integrating this function with the wrapper function for bootstrapping confidence intervals any time in the near future?
Thanks!
Using the wrapper function get_nns_ratio()
with a 0/1-coded group-variable returns the error:
Error in nnsdfs[[numerator]] :
attempt to select less than one element in get1index <real>
The used code is:
myNnsRatio <- get_nns_ratio(x = myToks,
N = 10,
groups = docvars(myToks, 'binaryVar'),
numerator = 0, [...]
It does not matter whether I choose 1 or 0 as the numerator, as the denominator will always be the opposite and thus the same error occurs.
During or after their creation, the numerator and denominator could be converted to as.character()
which seems to solve the problem.
Alternatively, if it is intended that variables cannot be numeric 0/1, then an appropriate error message would be helpful.
if (is.numeric(numerator) | is.numeric(denominator))
stop("The numerator and denominator in the group variable must be character-types.")
I really like this package and it is incredibly useful, thank you for your work!
In:
transform_matrix <- compute_transform(context_fcm = fcm_cr, pre_trained = pre_trained,
vocab = vocab_pruned, weighting = 1000)
should be:
transform_matrix <- compute_transform(context_fcm = fcm_cr, pre_trained = word_vectors,
vocab = vocab_pruned, weighting = 1000)
Hi, this is not exactly an issue, so I hope it is okay to post this here. I was wondering whether it is possible to include interaction effects in models using embedding regression.
If we have the below example from your quick start guide using party and gender as covariates - is it possible to include interaction effects between gender and party in the model? Thanks in advance!
Warm regards,
Luke
# two factor covariates
set.seed(2021L)
model1 <- conText(formula = immigration ~ party + gender,
data = toks_nostop_feats,
pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform,
bootstrap = TRUE, num_bootstraps = 100,
permute = TRUE, num_permutations = 100,
window = 6, case_insensitive = TRUE,
verbose = FALSE)
# D-dimensional beta coefficients
# the intercept in this case is the ALC embedding for female Democrats
# beta coefficients can be combined to get each group's ALC embedding
DF_wv <- model1['(Intercept)',] # (D)emocrat - (F)emale
DM_wv <- model1['(Intercept)',] + model1['gender_M',] # (D)emocrat - (M)ale
RF_wv <- model1['(Intercept)',] + model1['party_R',] # (R)epublican - (F)emale
RM_wv <- model1['(Intercept)',] + model1['party_R',] + model1['gender_M',] # (R)epublican - (M)ale
I'm running into an error creating a DFM when I load the conText package. The DFM function for the same code works when I do not have the conText library loaded, but once I load it the code gives me the following error: Error in validityMethod(as(object, superClass)) : object 'CsparseMatrix_validate' not found
The default for dimnames()$doc
on a dem
object are currently text1
, text2
etc. That is e.g.
toks <- tokens(cr_sample_corpus)
immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L)
immig_dfm <- dfm(immig_toks)
immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform, verbose = FALSE)
dimnames(immig_dem)$doc
returns
[1] "text1" "text2" "text3" "text4" "text5" "text6" "text7"
etc
I wonder if this should be something other than text
because it potentially gives the impression that each incidence is a new "text" (a separate document). But, of course, the whole point here is that one can have many instantiations (and thus many embeddings of the same term) in the same document.
Perhaps we could change it to instance
(or occurrence
or observation
or incidence
)? Open to not doing anything, but just want to avoid confusion for end-users.
Hi @prodriguezsosa,
I just realized a small issue with the ncs function. It happens that the same context is assigned to both groups. So, for example, the context appearing in a Republican speech might be assigned as a top Democrat context. This is because the ncs function calculates cosine similarities between the group-specific embedding for a term and all contexts. I guess you would want to subset the eligible contexts for each group to only those actually appearing in a group-specific text? Here is a suggestion on how to change the code:
cos_sim <- text2vec::sim2(x = as.matrix(contexts_dem), y = as.matrix(x), method = "cosine", norm = "l2") %>% data.frame()
contexts_df <- data.frame(docid = quanteda::docid(contexts), context = sapply(contexts, function(i) paste(i, collapse = " ")),
docgroup = docvars(contexts)[,2])
cos_sim <- cos_sim %>% dplyr::mutate(docid = rownames(cos_sim)) %>% dplyr::left_join(contexts_df, by = 'docid')
result <- tidyr::pivot_longer(cos_sim, -c(docid, context, docgroup), names_to = "target") %>%
dplyr::filter(docgroup==target) %>%
dplyr::group_by(target) %>%
dplyr::slice_max(order_by = value, n = N) %>%
dplyr::mutate(rank = 1:dplyr::n()) %>%
dplyr::arrange(dplyr::desc(value)) %>%
dplyr::ungroup() %>%
dplyr::select('target', 'context', 'rank', 'value')
There is no possibility of changing the stemming language since "porter"
(-> English) is hard-coded as the language in SnowballC::wordStem()
.
It would be handy to have the option to choose the language for stemming as this opens this package fully to non-English research. There are 26 languages already supported by the wordStem
-function (cf. getStemLanguages()
).
Create a customWordStem
-function only changing the language
-argument to the desired language and assign it to the SnowballC
-package as described here.
customWordStem <- function (words, language = "YOUR_LANGUAGE") {
words <- as.character(words)
language <- as.character(language[1])
.Call("R_stemWords", words, language, PACKAGE = "SnowballC")
}
environment(customWordStem) <- asNamespace('SnowballC')
assignInNamespace("wordStem", customWordStem, ns = "SnowballC")
Create an additional stem_language
-argument for the nns
and cos_sim
functions. As I understand, only these two functions call wordStem()
directly.
However, nns_ratio
and the more complex wrapper functions get_nns
, get_cos_sim
and get_nns_ratio
use both functions too. Thus, the stem_language
-argument would have to be available on the uppermost level and passed down through the intermediate functions like nns_boostrap
to end up in the nns
and cos_sim
functions. Maybe there is an easier way, but this seems to be the most straightforward.
Hi,
First of all, I would like to thank you for this great package and inspiring paper.
I'm just following your Quick Start Guide (https://github.com/prodriguezsosa/conText/blob/master/vignettes/quickstart.md) to get familiar with the procedure of applying embedding regression.
I have mainly 2 questions right now.
cr_transform
, while you mention that this is based on an estimation by Khodak et al. (2018) (khodakA.rds). However, is the formula below (from your 2023 paper) somehow implemented in conText
?While German word embeddings are available: https://www.deepset.ai/german-word-embeddings, I haven't found corresponding transformation matrices for ALC so far. So I thought I could maybe rely on a function within your package or have to train the learn the transformation matrix on my own with the above formula? Are there any further obstacles when moving to another language beyond the pre-trained embeddings and the (yet to be estimated) transformation matrix?
Thanks in advance!
Best,
Lukas
Hi @prodriguezsosa ! Thanks for a fantastic package.
In the run_ols()
function, I was wondering why the (R's default) one norm is used norm(x,type='O')
. From the paper on p.24, I would have expected the Frobenius-Euclidean; norm(x,type='F')
. I might be very wrong here, it might be on purpose, but I was wondering.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.