Git Product home page Git Product logo

bnosac / udpipe Goto Github PK

View Code? Open in Web Editor NEW
209.0 16.0 33.0 5.88 MB

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit

Home Page: https://bnosac.github.io/udpipe/en

License: Mozilla Public License 2.0

R 14.01% C++ 66.33% HTML 16.02% CSS 1.81% JavaScript 0.60% Python 1.24%
udpipe nlp natural-language-processing r r-package pos-tagging dependency-parser tokenizer rcpp r-pkg

udpipe's Introduction

udpipe - R package for Tokenization, Tagging, Lemmatization and Dependency Parsing Based on UDPipe

This repository contains an R package which is an Rcpp wrapper around the UDPipe C++ library (http://ufal.mff.cuni.cz/udpipe, https://github.com/ufal/udpipe).

  • UDPipe provides language-agnostic tokenization, tagging, lemmatization and dependency parsing of raw text, which is an essential part in natural language processing.
  • The techniques used are explained in detail in the paper: "Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe", available at https://ufal.mff.cuni.cz/~straka/papers/2017-conll_udpipe.pdf. In that paper, you'll also find accuracies on different languages and process flow speed (measured in words per second).

General

The udpipe R package was designed with the following things in mind when building the Rcpp wrapper around the UDPipe C++ library:

  • Give R users simple access in order to easily tokenize, tag, lemmatize or perform dependency parsing on text in any language
  • Provide easy access to pre-trained annotation models
  • Allow R users to easily construct your own annotation model based on data in CONLL-U format as provided in more than 100 treebanks available at http://universaldependencies.org
  • Don't rely on Python or Java so that R users can easily install this package without configuration hassle
  • No external R package dependencies except the strict necessary (Rcpp and data.table, no tidyverse)

Installation & License

The package is available under the Mozilla Public License Version 2.0. Installation can be done as follows. Please visit the package documentation at https://bnosac.github.io/udpipe/en and look at the R package vignettes for further details.

install.packages("udpipe")
vignette("udpipe-tryitout", package = "udpipe")
vignette("udpipe-annotation", package = "udpipe")
vignette("udpipe-universe", package = "udpipe")
vignette("udpipe-usecase-postagging-lemmatisation", package = "udpipe")
# An overview of keyword extraction techniques: https://bnosac.github.io/udpipe/docs/doc7.html
vignette("udpipe-usecase-topicmodelling", package = "udpipe")
vignette("udpipe-parallel", package = "udpipe")
vignette("udpipe-train", package = "udpipe")

For installing the development version of this package: remotes::install_github("bnosac/udpipe", build_vignettes = TRUE)

Example

Currently the package allows you to do tokenisation, tagging, lemmatization and dependency parsing with one convenient function called udpipe

library(udpipe)
udmodel <- udpipe_download_model(language = "dutch")
udmodel

    language                                                                             file_model
dutch-alpino C:/Users/Jan/Dropbox/Work/RForgeBNOSAC/BNOSAC/udpipe/dutch-alpino-ud-2.5-191206.udpipe

x <- udpipe(x = "Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.",
            object = udmodel)
x
 doc_id paragraph_id sentence_id start end term_id token_id     token     lemma  upos                                        xpos                               feats head_token_id      dep_rel            misc
   doc1            1           1     1   2       1        1        Ik        ik  PRON                VNW|pers|pron|nomin|vol|1|ev      Case=Nom|Person=1|PronType=Prs             2        nsubj            <NA>
   doc1            1           1     4   7       2        2      ging      gaan  VERB                               WW|pv|verl|ev Number=Sing|Tense=Past|VerbForm=Fin             0         root            <NA>
   doc1            1           1     9  10       3        3        op        op   ADP                                     VZ|init                                <NA>             4         case            <NA>
   doc1            1           1    12  15       4        4      reis      reis  NOUN                  N|soort|ev|basis|zijd|stan              Gender=Com|Number=Sing             2          obl            <NA>
   doc1            1           1    17  18       5        5        en        en CCONJ                                    VG|neven                                <NA>             7           cc            <NA>
   doc1            1           1    20  21       6        6        ik        ik  PRON                VNW|pers|pron|nomin|vol|1|ev      Case=Nom|Person=1|PronType=Prs             7        nsubj            <NA>
   doc1            1           1    23  25       7        7       nam     nemen  VERB                               WW|pv|verl|ev Number=Sing|Tense=Past|VerbForm=Fin             2         conj            <NA>
   doc1            1           1    27  29       8        8       mee       mee   ADP                                      VZ|fin                                <NA>             7 compound:prt   SpaceAfter=No
   doc1            1           1    30  30       9        9         :         : PUNCT                                         LET                                <NA>             7        punct            <NA>
...

Pre-trained models

Pre-trained models build on Universal Dependencies treebanks are made available for more than 65 languages based on 101 treebanks, namely:

afrikaans-afribooms, ancient_greek-perseus, ancient_greek-proiel, arabic-padt, armenian-armtdp, basque-bdt, belarusian-hse, bulgarian-btb, buryat-bdt, catalan-ancora, chinese-gsd, chinese-gsdsimp, classical_chinese-kyoto, coptic-scriptorium, croatian-set, czech-cac, czech-cltt, czech-fictree, czech-pdt, danish-ddt, dutch-alpino, dutch-lassysmall, english-ewt, english-gum, english-lines, english-partut, estonian-edt, estonian-ewt, finnish-ftb, finnish-tdt, french-gsd, french-partut, french-sequoia, french-spoken, galician-ctg, galician-treegal, german-gsd, german-hdt, gothic-proiel, greek-gdt, hebrew-htb, hindi-hdtb, hungarian-szeged, indonesian-gsd, irish-idt, italian-isdt, italian-partut, italian-postwita, italian-twittiro, italian-vit, japanese-gsd, kazakh-ktb, korean-gsd, korean-kaist, kurmanji-mg, latin-ittb, latin-perseus, latin-proiel, latvian-lvtb, lithuanian-alksnis, lithuanian-hse, maltese-mudt, marathi-ufal, north_sami-giella, norwegian-bokmaal, norwegian-nynorsk, norwegian-nynorsklia, old_church_slavonic-proiel, old_french-srcmf, old_russian-torot, persian-seraji, polish-lfg, polish-pdb, polish-sz, portuguese-bosque, portuguese-br, portuguese-gsd, romanian-nonstandard, romanian-rrt, russian-gsd, russian-syntagrus, russian-taiga, sanskrit-ufal, scottish_gaelic-arcosg, serbian-set, slovak-snk, slovenian-ssj, slovenian-sst, spanish-ancora, spanish-gsd, swedish-lines, swedish-talbanken, tamil-ttb, telugu-mtg, turkish-imst, ukrainian-iu, upper_sorbian-ufal, urdu-udtb, uyghur-udt, vietnamese-vtb, wolof-wtb.

These have been made available easily to users of the package by using udpipe_download_model

How good are these models?

Train your own models based on CONLL-U data

The package also allows you to build your own annotation model. For this, you need to provide data in CONLL-U format. These are provided for many languages at https://universaldependencies.org, mostly under the CC-BY-SA license. How this is done is detailed in the package vignette.

vignette("udpipe-train", package = "udpipe")

Support in text mining

Need support in text mining? Contact BNOSAC: http://www.bnosac.be

udpipe's People

Contributors

dependabot[bot] avatar dselivanov avatar jwijffels avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

udpipe's Issues

Error in keywords_rake: length(relevant) == nrow(x) is not TRUE

Hello,

I am using the function keywords_rake but it keeps throwing the following error. What does it mean?

UDtext <-as.data.table(udpipe_annotate(tagger, sometext$text))
kws <- keywords_rake(UDtext, term = "lemma", group = "doc_id", 
                          relevant = x$xpos %in% c("NN", "JJ"))

Here is the error:

Error in keywords_rake(UDtext, term = "lemma", group = "doc_id", relevant = x$xpos %in%  : 
  length(relevant) == nrow(x) is not TRUE

Is there anyway to fix it?
Thank you :)

Setting preferences for rake keyword extraction

Hi

Deg(w)/freq(w) favours longer keywords and therefore results in extracted keywords that occur in fewer documents.
I wish to extract those keywords that are also referenced higher within the set of documents. This is more relevant to the business problem when analysing customer feedback.
So how can we set RAKE to score by deg(w) in order to favour shorter keywords that occur across more feedbacks I.e. more people are talking about it?
Ideally, I want to capture all references to the extracted keywords. For example can we get the referenced document frequency ref(k) the number of feedbacks in which the keyword occurred as a Candidate keyword and extracted document frequency edf(k) the number of feedbacks from which the keyword was extracted?
We can then find out about a keyword being exclusive or essential for that set of feedbacks to inform the business to take action by edf(k) / rdf(k). Is there a way to get this included?

Words splitted by tokenizer

I have realized that using for example the Polish model (https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.0/master/inst/udpipe-ud-2.0-170801/polish-ud-2.0-170801.udpipe), the tokenizer splits certain complex verbs. For example, the token "Chciałbym" is splitted into three parts, i.e. "Chciał", "by" and "m", each of which is identified as a separate token with its own token ID, lemma and POS information. The original word token as it appears in the text, in this example "Chciałbym", however, receives to lemma and POS information. For clarity, I pasted the annotated data frame and attached a screnshot.

paragraph_id sentence_id token_id token lemma upos xpos
1: 1 2 1-3 Chciałbym NA NA NA
2: 1 2 1 Chciał chcieć VERB praet:sg:m1:imperf
3: 1 2 2 by być AUX qub
4: 1 2 3 m być AUX aglt:sg:pri:imperf:nwok
5: 1 2 4 w w ADP prep:acc:nwok
6: 1 2 5 sposób sposób NOUN subst:sg:acc:m3
7: 1 2 6 bardzo bardzo ADV adv:pos
8: 1 2 7 jednoznaczny jednoznaczny ADJ adj:sg:nom:m3:pos
screenshot from 2018-02-22 16-40-53

Is there a way to surpress this behaviour, thus preventing the tokenizer from splitting such verbs? I am only interested in the original form of such words (i.e. "Chciałbym") without the suffixes being truncated from the verb and tagged/lemmatised independently?

Ignore underscore when annotating

Hi there,

Thanks for the package - it's great!

I'm using the package to annotate upos - however, I'm pre-processing where I'm replacing specific terms with tokens. They're identified with an underscore so we know they're not the word.
e.g. I love Nike > i love brand

However, when I run the annotation function, it processes the underscore as a symbol, rather than as a noun. Is there a way to make it ignore the underscores? I've read through the documentation, but couldn't find anything.
Many thanks
Alan

note to myself

add in docs that cooccurrence.data.frame in a group by fashion which does not take into account a sequence
does not return self-occurrences and as there is no order (bag of terms) in the output term1 is always smaller than term2, need to formulate this more concisely
while cooccurrence.character goes left to right, maybe need an option right to left also
Note in Biterm Topic Modelling (https://github.com/bnosac/BTM) cooccurrences occur in window which is a bit different

external pointer is not valid - newly installed udpipe

R 3.6.0, R Studio, udpipe 0.8.2

Getting "external pointer is not valid" when running udpipe. Model file path is absolute, verified model file exists and is the correct size, 16.7MB.

modelfile <-"/users/cspenn/code/textminingdicts/english-ewt-ud-2.3-181115.udpipe"

## load english model
if (!exists("udmodel_en")) {
  udmodel_en <- udpipe_load_model(modelfile)
}

## first tokenize with udpipe

textdf$doc_id <- seq.int(nrow(textdf))

## split annotation function

# returns a data.table
annotate_splits <- function(x) {
  x <- as.data.table(udpipe_annotate(udmodel_en,
                                     x = x$content,
                                     doc_id = x$doc_id))
  return(x)
}

## run the splits
corpus_splitted <- split(textdf, seq(1, nrow(textdf), by = 100))

## run the multicore
annotation <- future_lapply(corpus_splitted, annotate_splits)

Error text is:

Error in udp_tokenise_tag_parse(object$model, x, doc_id, tokenizer, tagger,  : 
  external pointer is not valid

Had no issues prior to installing udpipe fresh from CRAN after R3.6.0. Verified the model exists in the environment:

Screen Shot 2019-05-16 at 10 30 36 AM

Why not use data.table?

Not as much an issue as a question. Since you already use data.table in your package, why not go all the way and use it more effectively in your functions?

I have tried doing a txt_freq using data.tables and even on a small dataset as mine, the speed difference is notable:

> system.time({
+   test1 <- txt_freq_DT(udpipe_test_data)
+ })
  bruger   system forløbet 
    0.00     0.00     0.03 
> system.time({
+   test2 <- txt_freq(udpipe_test_data$token)
+ })
  bruger   system forløbet 
    0.15     0.00     0.15 

The custom function I have used looks like this:

txt_freq_DT <- function(x, exclude = c(NA, NaN), order = TRUE){
  
  x <- as.data.table(x)
  x <- x[, list(freq = .N), by = list("key" = token)]
  x <- x[!(key %in% exclude)]
  x[, freq_pct := 100 * freq / sum(freq)]
  if(order) x <- x[order(-freq)]
  
  return(x)
  
}

Add sentiments tags

Hello guys.
Really enjoyed udpipe for R (and taskscheduleR as well). Have been playing with udpipe's tools for like 3 hours now on spanish texts. But I've got a request (if I may): is it possible to add if each word is positive, negative or neutral, or its positive score [-1,1], or which of the 6 sentiments adjusts more? That way we could really do a much more extensive study of our texts!
Glad guys like you exists.
Cheers.

Improvement of word networ visualisation?

First of all: thank you for make udpipe available in R! It's a great package.

I was looking at your example network visualisations and I was wondering if they could be improved by not only showing the different edge sizes but also by showing different word (node) sizes dependent on the sum of all edge sizes (of the edges linked to the node). To my opinion this would be resulting in a wordcloud 2.0.
I am curious about your opinion about this.

udpipe_annotate in parallel

Hi!

I am trying to get udpipe_annotate to work in parallel with the following code:

library(udpipe)
library(data.table)
library(future.apply)
library(janeaustenr)

ud_english <- udpipe_load_model("~/udpipe/english-ewt-ud-2.3-181115.udpipe")

plan(multiprocess, workers = 4L)
x <- janeaustenr::emma[1:1000]
anno <- split(x, seq(1, length(x), by = 50))
anno <- future_lapply(anno, FUN=function(z) udpipe_annotate(object = ud_english, z))

However, it gives me the following error:

Error in udp_tokenise_tag_parse(object$model, x, doc_id, tokenizer, tagger,  : 
  external pointer is not valid 

If I do it with lapply which would only use 1 core, it gives no such error and returns all the collnu formatted parses. Does the udp_tokenize_tag_parse() function not work when called in parallel?

In keywords_phrases() function the is_regex=T option has broken in 0.5

I am running side by side the same code, same data on two machines.

One is on udpipe 0.4 and the other on udpipe 0.5 version.

The keywords_phrases() function is broken on 0.5 if we use is_regex=T

Consider the sample example in your help document.

data(brussels_reviews_anno, package = "udpipe")
x <- subset(brussels_reviews_anno, language %in% "fr")
np <- keywords_phrases(x$xpos, pattern = c("DT", "NN", "VB", "RB", "JJ"), sep = "-")
head(np)

The above should work in both 0.4 & 0.5.

Now consider the same example but with the function executed with is_regex=T

np <- keywords_phrases(x$xpos, pattern = c("DTNNVBRBJJ"), term = x$token,is_regex=T)
head(np)
# [1] keyword ngram   pattern start   end    
# <0 rows> (or 0-length row.names)

I tried with many regex, even as simple as just pattern = "DTJJ" but none works. It seems the regex option does not work.

I have also tested that regex works on the machine (an ubuntu server) by checking out the grep family of commands in R. So regex does not work in the udipe function only,

Fix \n in misc

misc output in as.data.frame/udpipe_annotate always ends with \n due to use of std::istringstream
I don't believe this is a big deal but at least it should be fixed in order to correctly reconstruct the original text based on SpaceAfter=No/SpacesAfter/SpacesBefore/SpacesInToken so that a from/to can be added. To be used alongside crfsuite.

how to use the dependency parsing?

Hello,

Thanks again for this great lightweight package.

Something I quite dont get is how to get the dependency tree with it.
For instance, consider this simple example

library(udpipe)
dl <- udpipe_download_model(language = "english")
udmodel_en <- udpipe_load_model(file = "english-ud-2.0-170801.udpipe")

x <- udpipe_annotate(udmodel_en, 
                     x = "the economy is weak but the outloook is bright")
as.data.frame(x)

> as.data.frame(x)
  doc_id paragraph_id sentence_id                                       sentence token_id    token    lemma  upos xpos
1   doc1            1           1 the economy is weak but the outloook is bright        1      the      the   DET   DT
2   doc1            1           1 the economy is weak but the outloook is bright        2  economy  economy  NOUN   NN
3   doc1            1           1 the economy is weak but the outloook is bright        3       is       be   AUX  VBZ
4   doc1            1           1 the economy is weak but the outloook is bright        4     weak     weak   ADJ   JJ
5   doc1            1           1 the economy is weak but the outloook is bright        5      but      but CCONJ   CC
6   doc1            1           1 the economy is weak but the outloook is bright        6      the      the   DET   DT
7   doc1            1           1 the economy is weak but the outloook is bright        7 outloook outloook  NOUN   NN
8   doc1            1           1 the economy is weak but the outloook is bright        8       is       be   AUX  VBZ
9   doc1            1           1 the economy is weak but the outloook is bright        9   bright   bright   ADJ   JJ
                                                  feats head_token_id dep_rel deps            misc
1                             Definite=Def|PronType=Art             2     det <NA>            <NA>
2                                           Number=Sing             4   nsubj <NA>            <NA>
3 Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin             4     cop <NA>            <NA>
4                                            Degree=Pos             0    root <NA>            <NA>
5                                                  <NA>             9      cc <NA>            <NA>
6                             Definite=Def|PronType=Art             7     det <NA>            <NA>
7                                           Number=Sing             9   nsubj <NA>            <NA>
8 Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin             9     cop <NA>            <NA>
9                                            Degree=Pos             4    conj <NA> SpacesAfter=\\n

From this output I do not see how I can associate weak to economy and bright to outlook. Am I missing something with this package?

thanks!!

no issue, just congratulations!

amazing job, amazing package. I love the as.data.frame function (can you use dplyr::data_frame as well?). I hope the package will get better and better. what are the next updates?

Thanks!!

Space optimization

Do you have recommendation on how to manage memory taken from executing as.data.frame(udpipe_annotate(udmodel, x=documents))? Is there anything similar to database storage, like (I think) tm does?

When I run that line on a collection of 10 documents, that are quite normal blog articles, the resulting data frame takes about 2.5MB in memory. Of course, I am planning to do it on many more documents and am unsure whether memory will be an issue or not.

P.S. For this small sample where the model is 2.5MB for 20748 obs. of 14 variables, the computation takes approx 20 seconds. Seems a lot to me, but is that in line with your benchmarks?

Thanks!

Preserving document order

When document_term_matrix() is called after document_term_frequencies(), the resulting dtm is ordered differently from the original character vector. Although it is possible to correctly order the dtm using the document id's this behavior seems undesirable.

How to use textrank with tfidf

Hi,

I have been following your tutorial here and interested in finding out how to use your [textRank](https://CRAN.R-project.org/package=textrank) algorithm when done building LDA models as you mention in Use Case II?

Do you have any example code? That would be ideal.

Best,

ambiguities in the documentation?

Hi @jwijffels ,

I was looking at your improved documentation and it looks really great.

Just a quick question if you have 2 min. In the keyword_phrases() function you are using the regex "(A|N)*N(P+D*(A|N)*N)* without going into too much details.

Can you just explain what does that mean exactly?

Error in udpipe_annotate: "external pointer is not valid"

Problem description
I try to annotate the text using R example code from the guidance below.
(https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-annotation.html#udpipe_the_c++_library)
A week ago, the code runs very well. Any errors didn't occur.

But, suddenly, error output appear in R console. I don't know the reason why the code stops.
I have reinstalled R and R Studio several times to delete the hidden file etc. in many directories.

Sometimes, the R studio console even stops displaying any output after executing udpipe_annotate. No error outputs are produced in R console.

Other R codes run very well. Only the UDPIPE code doesn't work.
How can I solve the problem?
(A week ago, I used R 3.3.2, I'm now using R 3.5.1)

First Error: Error output appear in R console

library(udpipe)
dl <- udpipe_download_model(language = "dutch")

Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.0/master/inst/udpipe-ud-2.0-170801/dutch-ud-2.0-170801.udpipe to C:/Analysis/R_Analysis/2018년/20181101_udpipe/dutch-ud-2.0-170801.udpipe
trying URL 'https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.0/master/inst/udpipe-ud-2.0-170801/dutch-ud-2.0-170801.udpipe'
Content type 'application/octet-stream' length 19992491 bytes (19.1 MB)
downloaded 19.1 MB

str(dl)

'data.frame': 1 obs. of 3 variables:
$ language : chr "dutch"
$ file_model: chr "./20181101_udpipe/dutch-ud-2.0-170801.udpipe"
$ url : chr "https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.0/master/inst/udpipe-ud-2.0-170801/dutch-ud-2.0-170801.udpipe"

#Either give a file in the current working directory
udmodel_dutch <- udpipe_load_model(file = "dutch-ud-2.0-170801.udpipe")
#Or give the full path to the file
udmodel_dutch <- udpipe_load_model(file = dl$file_model)
dl$file_model

[1] "./20181101_udpipe/dutch-ud-2.0-170801.udpipe"

txt <- c("Ik ben de weg kwijt, kunt u me zeggen waar de Lange Wapper ligt? Jazeker meneer",

  •      "Het gaat vooruit, het gaat verbazend goed vooruit")
    

x <- udpipe_annotate(udmodel_dutch, x = txt)
Error in udp_tokenise_tag_parse(object$model, x, doc_id, tokenizer, tagger, :
external pointer is not valid

  1. stop(structure(list(message = "external pointer is not valid",
    call = udp_tokenise_tag_parse(object$model, x, doc_id, tokenizer,
    tagger, parser, log_every, log_now), cppstack = structure(list(
    file = "", line = -1L, stack = "C++ stack not available on this system"), class = "Rcpp_stack_trace")), class = c("Rcpp::exception", ...
  2. udp_tokenise_tag_parse(object$model, x, doc_id, tokenizer, tagger,
    parser, log_every, log_now)
  3. udpipe_annotate(udmodel_dutch, x = txt)

Second Error: R studio console even stops displaying any output after executing udpipe_annotate

ud_model <- udpipe_load_model(file = "english-ud-2.0-170801.udpipe")
x <- udpipe_annotate(ud_model, x = comments$comments)
dd
dd
dd

Expected results or ordinary results are below.

ddd
Error: object 'ddd' not found
dd
Error: object 'dd' not found
dd
Error: object 'dd' not found

txt_nextgram leaves NA values

If you use the txt_nextgram function depending on the n-gram you choose you get n-1 NA values.

x <- sprintf("%s%s", LETTERS, 1:26)
txt_nextgram(x, n = 2)
[1] "A1 B2"   "B2 C3"   "C3 D4"   "D4 E5"   "E5 F6"   "F6 G7"   "G7 H8"   "H8 I9"   "I9 J10"  "J10 K11" "K11 L12" "L12 M13" "M13 N14"
[14] "N14 O15" "O15 P16" "P16 Q17" "Q17 R18" "R18 S19" "S19 T20" "T20 U21" "U21 V22" "V22 W23" "W23 X24" "X24 Y25" "Y25 Z26" NA    

Since the NA's are not meaningful when creating n-grams like this, removing them might be the best option.
Just adding out <- out[!is.na(out)] to the function would to the trick.

sentence demarcation in POS tagging fails with a period at times and not always.

I have been trying to POS tag legal documents but at many places the udpipe R package breaks a sentence into 2 sentences when it encounters a period that is not actually the not the end of sentence marker. For example, if a sentence is

  1. In Moti Laminates Pvt. Ltd. v. Collector of Central Excise, Ahmedabad 1995(76) E.L.T.241(SC) we get a clue of an important principle, namely, principle of equivalence .

we get the sentence broken after the 4th PUNCT token of "." Not the first. Which means there is some logic to take care of false sentence end of mark detection. It detected that PUNCT after token=18 is not an end of the sentence. Also after token = 'Pvt' it did not end the sentence. Then why did it end after token= 'v' ?

I suspected udpipe English model checks capitalisation of next token to decide sentence ending but that doesn't happen here.

Here is the document term matrix subsetted with above text of two sentences (which is actually one).

Could you suggest a way to avoid false detections like these?

Thank you very much.

xx.txt

keywords_phrases broken. Alternatives to updating gcc?

Hi @jwijffels , I have the same issue as #20, however I dont have the possibility to update gcc. Are there other possible solutions? Cant you use the stringr package in the udpipe code instead? Its really the first time ever that something breaks because of linux.. What do you think?

Thanks!

xpos meaning in different languages

Hi,

I would like to know where I can find the correspondent meaning of each acrimonious in the filed xpos when using different languages. At the current stage I am working on italian text.

thank you

Bylee as a lemma

I see the word bylee coming up as a lemma for some pieces of text. Example:

> library(udpipe)
> udmodel = udpipe_download_model(language = "english")
> x <- udpipe(x = "bylaw 1234 - a bylaw is a thing", object = udmodel)
> x[1,]
  doc_id paragraph_id sentence_id                        sentence start end
1   doc1            1           1 bylaw 1234 - a bylaw is a thing     1   5
  term_id token_id token lemma upos xpos       feats head_token_id dep_rel deps
1       1        1 bylaw bylee NOUN   NN Number=Sing             8   nsubj <NA>
  misc
1 <NA>

keywords_phrases output

This isn't an issue, rather just a question! I've been exploring the package and love it, and perhaps this is an oversight by me, but is there a way to output the doc_id when using keywords_phrases?

For example, the code below returns phrases from 'x', but ideally I'd like to associate each phrase and the corresponding doc_id can be found in. I'm not an R expert - so apologies if I'm overlooking something.
`
s <- udpipe_annotate(udmodel_english, 'text data frame')

x <- data.frame(s)

Phrases <- keywords_phrases(x = x$phrase_tag, term = tolower(x$token), sep = " ",
pattern = "(A|N)N(P+D(A|N)N)",
is_regex = TRUE, detailed = TRUE)`

Error: Character string limit

I am trying to tokenise, lemmatise and parts-of-speech tag a large corpus of English texts. There are approximately 160,000 texts in the corpus, totaling approximately 46 million tokens, which means that on average the individual texts are relatively short (approx. 290 tokens). Following the example of the brussels_reviews dataset, the corpus is stored in a data table with individual raw texts in one column (see attached screenshot of corpus sample).
corpus_sample

As usual, I try to parse the corpus by calling

model <- udpipe_load_model(file = model_path)
anno.dt <- as.data.table(udpipe_annotate(model, x = corpus.dt$text,
                                                 doc_id = corpus.dt$doc_id,
                                                 tagger = "default",
                                                 parser = "none"))

This works like a charm for a Polish corpus of approximately 13 million tokens. However, for English I get the following error message:

Error in udp_tokenise_tag_parse(object$model, x, doc_id, tokenizer, tagger, : R character strings are limited to 2^31-1 bytes

I am puzzled by this error, because the longest of all texts in the corpus is only 32,458 characters long, as evidenced by the column textlen from the corpus data table. The sum of all characters across all texts is 274,025,244, which is less than the limit stipulated by the error message. As a further sanity check, I splitted the whole corpus on white spaces by corpus.dt[, .(word = unlist(strsplit(text, "[[:space:]]")), doc_nr = .GRP), by = .(doc_id)] in order to assess the length of the longest token in the corpus, and this method revealed that the longest token in the corpus (obtained by the simple white-space-splitting procedure) is only 114 characters long. Therefore, I have no idea what could have caused the error.

Do you have any idea or suggestion as to what could have gone wrong? As mentioned before, everything worked perfectly fine with a smaller Polish corpus. I will appreciate any comment on this regard, because otherwise your your UDPipe implementation does everything I need without any problems.

Foreign symbols are not parsed well

I have dutch texts with words like "Carrière". Annotating these texts never was a problem. Until now. I am not sure why or what I did wrong, but suddenly the udipe_annotate changes e.g. Carrière into the token Carri<U+653C><U+3E38>. Any ideas where to look or how to solve this? Many thanks in advance! I checked RStudio the settings for the encodings seem right (utf-8).

cooccurrence method error

Hi,

I am trying to find co-occurrence of Nouns and Adjectives.
Below line executes fine:

cooc <- cooccurrence(x = subset(verbatim_tokens, upos %in% c("NOUN", "ADJ")), term = "lemma", group = c("doc_id", "paragraph_id", "sentence_id"))

When and type cooc in the console, it prints out fine and View(cooc) also works fine. But when I do head(cooc) within Rmd document it prints this weird error: Error in [.data.table(data, is_list) : i is not found in calling scope and it is not a column of type logical. When the first argument inside DT[...] is a single symbol, data.table looks for it in calling scope.

English tokenizer issues

Hello,

Firstly, thank you for the great resource! I've found it very useful in my work.

The reason I'm writing is some strange tokenizing errors I'm seeing with the UD model for English (english-ud-2.0-170801.udpipe)

For instance, "gets" and "figures" are being split as if it's apostrophe-s possessive. (get + s, figure + s)

I know I can build my own models, so I will look at the documentation in order to do that. But I wondered if this error type rings any bells: do you think I should be looking at the training data or parameter settings first?

thank you! Andrew

udpipe_download_model overwrite=TRUE ?

Hello !
It seems to me that the default for the overwrite parameter should be FALSE, this way the model will only be downloaded if not already there.

  if(overwrite || !file.exists(to)){
    message(sprintf("Downloading udpipe model from %s to %s", url, to))
    utils::download.file(url = url, destfile = to, mode = "wb")  
  }

Cheers !

Choosing the right keyword detection technique through udpipe.

Hi @jwijffels ,

Hope you are doing good. Thanks for building udpipe in R which is really useful for POS tagging and keyword detection.

Below is a rather lengthy query, I have encountered while using udpipe's English model on a column of text.

Using the link: https://bnosac.github.io/udpipe/docs/doc7.html, I was testing 2 such approaches a. RAKE and b. Dependency Parsing for keyword detection technique. During this comparison, I found that the dependency parsing approach results in reporting a lower count of certain phrases as compared to the RAKE approach.

The input data is attached here:
Book1.xlsx

The R Code (in .txt format) is attached here:
POS_Viz_and_Dep_Parsing_v1.txt

Line # 125 to 127 in this code, runs the RAKE algorithm and identifies the keyword phrases containing only the Nouns or Adjectives. One of the keyword phrase is "good product" which has count of 139, as seen in the object top_phrases_noun_adj.

Line # 133 to 151 computes the nominal subjects through dependency parsing. It uses the dep_rel as "nsubj" & upos %in% c("NOUN") & upos_parent %in% c("ADJ") to identify such phrases. Looking at the same keyword phrase i.e. "good product" in object dep_parse_nsubj3 gives a count of 18.

This is definitely lower than RAKE as there are couple of more filter conditions related to POS of parent work and dep_rel = nsubj, which seems to be the expected behavior.

Next, I modified the dependency parsing code as seen in Line # 154 to 171 computes the nominal subject in a different way than found in Line # 133 to 151, this time including only the filter condition upos %in% c("NOUN", "ADJ"). This is same as the condition used in the lines which compute the keywords through the RAKE approach.

The same keyword "good product" as seen in the rewritten object dep_parse_nsubj3 now has a count of 21 while its reverse term i.e. "product good" has a count of 156.

I am unsure on why the 2nd variation of dependency parsing code gives a lesser count of this keyword term as compared to the RAKE approach.

Moreover, could you suggest which approach is better in terms of identifying keywords as these can be further used for other tasks like low level theme detection/bucketing of sentences and so on.

Thank you for your help in advance and best regards.

as.data.frame error

May you please help me with this problem?

After running

library(udpipe)
udmodel <- udpipe_download_model(language = "dutch")
udmodel <- udpipe_load_model(file = udmodel$file_model)
x <- udpipe_annotate(udmodel, x = "Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.")
x <- as.data.frame(x)

I have the following error:

Error in data.table::setDF(out) :
setDF only accepts data.table, data.frame or list of equal length as input

"x" looks like in the following:

$x
[1] "Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur."

$conllu
[1] "# newdoc id = doc1\n# newpar\n# sent_id = 1\n# text = Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.\n1\tIk\tik\tPRON\tPron|per|1|ev|nom\tCase=Nom|Number=Sing|Person=1|PronType=Prs\t2\tnsubj\t_\t_\n2\tging\tga\tVERB\tV|intrans|ovt|1of2of3|ev\tAspect=Imp|Mood=Ind|Number=Sing|Subcat=Intr|Tense=Past|VerbForm=Fin\t0\troot\t_\t_\n3\top\top\tADP\tPrep|voor\tAdpType=Prep\t4\tcase\t_\t_\n4\treis\treis\tNOUN\tN|soort|ev|neut\tNumber=Sing\t2\tobj\t_\t_\n5\ten\ten\tCCONJ\tConj|neven\t_\t7\tcc\t_\t_\n6\tik\tik\tPRON\tPron|per|1|ev|nom\tCase=Nom|Number=Sing|Person=1|PronType=Prs\t7\tnsubj\t_\t_\n7\tnam\tneem\tVERB\tV|trans|ovt|1of2of3|ev\tAspect=Imp|Mood=Ind|Number=Sing|Subcat=Tran|Tense=Past|VerbForm=Fin\t2\tconj\t_\t_\n8\tmee\tmee\tADV\tAdv|deelv\tPartType=Vbp\t7\tcompound:prt\t_\tSpaceAfter=No\n9\t:\t:\tPUNCT\tPunc|dubbpunt\tPunctType=Colo\t2\tpunct\t_\t_\n10\tmijn\tmijn\tPRON\tPron|bez|1|ev|neut|attr\tNumber=Sing|Person=1|Poss=Yes|PronType=Prs\t11\tnmod\t_\t_\n11\tlaptop\tlaptop\tNOUN\tN|soort|ev|neut\tNumber=Sing\t2\tnsubj\t_\tSpaceAfter=No\n12\t,\t,\tPUNCT\tPunc|komma\tPunctType=Comm\t11\tpunct\t_\t_\n13\tmijn\tmijn\tPRON\tPron|bez|1|ev|neut|attr\tNumber=Sing|Person=1|Poss=Yes|PronType=Prs\t14\tnmod\t_\t_\n14\tzonnebril\tzonnebril\tNOUN\tN|soort|ev|neut\tNumber=Sing\t11\tappos\t_\t_\n15\ten\teen\tCCONJ\tConj|neven\t_\t17\tcc\t_\t_\n16\tgoed\tgoed\tADJ\tAdj|attr|stell|onverv\tDegree=Pos\t17\tamod\t_\t_\n17\thumeur\thumeur\tNOUN\tN|soort|ev|neut\tNumber=Sing\t14\tconj\t_\tSpaceAfter=No\n18\t.\t.\tPUNCT\tPunc|punt\tPunctType=Peri\t2\tpunct\t_\tSpacesAfter=\n\n\n"

$errors
[1] ""

attr(,"class")
[1] "udpipe_connlu"

Thank you.
Best,

Marco

dtm_remove_terms: Error in base::rowSums(x, na.rm = na.rm, dims = dims, ...) : 'x' must be an array of at least two dimensions

Hi,

I have created a dtm and removed the sparse terms.

library(tm)
library(dplyr)

samp = datsub %>%
  select(Reviews) %>%
  sample_n(2)

dtm = corpus = Corpus(VectorSource(samp$Reviews)) 

dtm = DocumentTermMatrix(corpus)

dtm = removeSparseTerms(dtm, 0.98)

However, some terms are still useless, so I tried:

library(udpipe)

useless_terms = c("buy")

dtm_remove_terms(dtm = dtm, terms = useless_terms)

But I get this error:

`Error in base::rowSums(x, na.rm = na.rm, dims = dims, ...) : 'x' must be an array of at least two dimensions`
dput(samp)
structure(list(Reviews = c("problem electric connector appropriate fot phone", 
"great phone even good pricebe sure buy sim small sim model large type"
)), row.names = c(352L, 4907L), class = "data.frame")
```

How to deal with paragraphs?

The annotation output contains a column for identifying the paragraph. In what format should paragraph boundaries be encoded in the input character vectors? I have not found any information on this in the documentation. Therefore, all multi-sentence character vectors are always being parsed as one paragraph only.

Predicting topics seems not consistent

Thanks again for this brilliant package!

I ran into someting peculiar. Maybe you know why this happens.
When I try to predict topics of new documents using a trained LDA model, I don't get the same predictions every time when running te prediction script several times.

Following your example I tried this script:


library(udpipe)
data(brussels_reviews)
comments <- subset(brussels_reviews, language %in% "fr")

ud_model <- udpipe_download_model(language = "french")
ud_model <- udpipe_load_model(ud_model$file_model)
x <- udpipe_annotate(ud_model, x = comments$feedback, doc_id = comments$id)
x <- as.data.frame(x)
x$topic_level_id <- unique_identifier(x, fields = c("doc_id", "paragraph_id", "sentence_id"))
dtf <- subset(x, upos %in% c("NOUN"))
dtf <- document_term_frequencies(dtf, document = "topic_level_id", term = "lemma")
dtm <- document_term_matrix(x = dtf)
dtm_clean <- dtm_remove_lowfreq(dtm, minfreq = 5)
dtm_clean <- dtm_remove_terms(dtm_clean, terms = c("appartement", "appart", "eter"))
dtm_clean <- dtm_remove_tfidf(dtm_clean, top = 50)
library(topicmodels)
training <- dtm_clean[1:400,]
newdata <- dtm_clean[401:403,]
m <- LDA(training, k = 4, method = "Gibbs", 
         control = list(nstart = 5, burnin = 2000, best = TRUE, seed = 1:5))
scoreslist <- list()
for (i in 1:10){
scores <- predict(m, newdata = newdata, type = "topics", 
                  labels = c("labela", "labelb", "labelc", "xyz"))
scoreslist[[i]] <- scores
}
scorelist

Although most of probalities are the same over and over again. However you will also notice that sometimes the probabilities will differ. Isn't that peculiar? Wouldn't you expect exactly the same outcome every time you run the scriipt?

When I follow the solution in this SO topic., using the posterior function directly instead of the udpipe prediction function, there seems to be no changes in the outcome when I run it serveral times:

library(topicmodels)
data(AssociatedPress)

train <- AssociatedPress[1:100,]
test <- AssociatedPress[149:150,]

train.lda <- LDA(train,5)
scoreslist <- list()
for (i in 1:10){
test.topics <- posterior(train.lda,test)
scoreslist[[i]] <- test.topics[[2]]
}
scoreslist

Is there a difference in the way udpipe makes a documt term matrix that causes this problem? Are you familiar with this problem, and do you know how to solve it?
Most appreciated!

Kaggle Kernel could not find function "keywords_rake"

The function to use RAKE works very well on my local machine but when I try to run it on a Kaggle Kernel. It just doesn't work

  could not find function "keywords_rake"
Calls: render ... handle -> withCallingHandlers -> withVisible -> eval -> eval```

Any possible suggestion?

comparing noun chunks with Spacy

Hello,

I am trying to extract noun chunks using Spacy and Udpipe and I start realizing how much easier udpipe is to use.

However, I was not able to replicate the noun chunk extraction that I get using Spacy


import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers")
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
          chunk.root.head.text)

Autonomous cars cars nsubj shift
insurance liability liability dobj shift
manufacturers manufacturers pobj toward

Can we get these noun chunks with udpipe as well?

Thanks!

Fatal error: " external pointer is not valid"

When running the example-code for udpipe i get the following error:

Error in udp_tokenise_tag_parse(object$model, x, doc_id, tokenizer, tagger, :
external pointer is not valid

Steps to reproduce:
library(udpipe)
dl <- udpipe_download_model(language = "dutch")
dl
udmodel_dutch <- udpipe_load_model(file = "dutch-ud-2.0-170801.udpipe")
x <- udpipe_annotate(udmodel_dutch,
x = "Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.")
x <- as.data.frame(x)
x

I'm using Microsoft R Open - R version 3.4.0 (2017-04-21).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.