bnosac / rdrpostagger Goto Github PK
View Code? Open in Web Editor NEWR package for Ripple Down Rules-based Part-Of-Speech Tagging (RDRPOS). On more than 45 languages.
R package for Ripple Down Rules-based Part-Of-Speech Tagging (RDRPOS). On more than 45 languages.
Hello, im a newbie and im trying to get to work with Rstudio for my final project. I got error while doing POS Tagging, i want to transform my .CSV data into a sentence. The code:
#menentukan working directory
setwd("F:/SKRIPSI/POS TAGGING/RDRPOSTagger-master")
#membersihkan
rm(list = ls())
#install.packages("rJava")
#install.packages("data.table")
#install.packages("RDRPOSTagger", repos = "http://www.datatailor.be/rcube", type = "source")
#install.packages("remotes")
#remotes::install_github("bnosac/RDRPOSTagger")
library(tm)
library(NLP)
library(RDRPOSTagger)
library(tokenizers)
models <- rdr_available_models()
models$POS$language
models$MORPH$language
models$UniversalPOS$language
#x <- c("Oleg Borisovich Kulik is a Ukrainian-born Russian performance artist")
#tagger <- rdr_model(language = "English", annotation = "POS")
#rdr_pos(tagger, x = x)
x <- c("aku mau makan ku ingat kamu aku mau tidur juga ku ingat kamu",
"aku sedang bosan ku ingat kamu oh tuhan mungkin kah ku jatuh cinta",
" ", "", NA)
#tagger <- rdr_model(language = "Indonesian", annotation = "MORPH")
#rdr_pos(tagger, x = x)
tagger <- rdr_model(language = "Indonesian", annotation = "UniversalPOS")
rdr_pos(tagger, x = x)
x <- read.csv("dataset_twitter_stopword.csv", stringsAsFactors = FALSE, header = FALSE)
tagger <- rdr_model(language = "Indonesian", annotation = "UniversalPOS")
rdr_pos(tagger, x = x)
the following error (token on .csv process didnt show up and the POS only written as PUNCT)
#menentukan working directory
> setwd("F:/SKRIPSI/POS TAGGING/RDRPOSTagger-master")
>
> #membersihkan
> rm(list = ls())
>
> #install.packages("rJava")
> #install.packages("data.table")
> #install.packages("RDRPOSTagger", repos = "http://www.datatailor.be/rcube", type = "source")
> #install.packages("remotes")
> #remotes::install_github("bnosac/RDRPOSTagger")
>
> library(tm)
> library(NLP)
> library(RDRPOSTagger)
> library(tokenizers)
>
> models <- rdr_available_models()
> models$POS$language
[1] "English" "French" "German" "Hindi" "Italian" "Thai" "Vietnamese"
> models$MORPH$language
[1] "Bulgarian" "Czech" "Dutch" "French" "German" "Portuguese" "Spanish" "Swedish"
> models$UniversalPOS$language
[1] "Ancient_Greek-PROIEL" "Ancient_Greek" "Arabic" "Basque" "Belarusian" "Bulgarian" "Catalan" "Chinese"
[9] "Coptic" "Croatian" "Czech-CAC" "Czech-CLTT" "Czech" "Danish" "Dutch-LassySmall" "Dutch"
[17] "English-LinES" "English-ParTUT" "English" "Estonian" "Finnish-FTB" "Finnish" "French-ParTUT" "French-Sequoia"
[25] "French" "Galician-TreeGal" "Galician" "German" "Gothic" "Greek" "Hebrew" "Hindi"
[33] "Hungarian" "Indonesian" "Irish" "Italian-ParTUT" "Italian" "Japanese" "Korean" "Latin-ITTB"
[41] "Latin-PROIEL" "Latin" "Latvian" "Lithuanian" "Norwegian-Bokmaal" "Norwegian-Nynorsk" "Old_Church_Slavonic" "Persian"
[49] "Polish" "Portuguese-BR" "Portuguese" "Romanian" "Russian-SynTagRus" "Russian" "Slovak" "Slovenian-SST"
[57] "Slovenian" "Spanish-AnCora" "Spanish" "Swedish-LinES" "Swedish" "Tamil" "Turkish" "Urdu"
[65] "Vietnamese"
>
> #x <- c("Oleg Borisovich Kulik is a Ukrainian-born Russian performance artist")
> #tagger <- rdr_model(language = "English", annotation = "POS")
> #rdr_pos(tagger, x = x)
>
> x <- c("aku mau makan ku ingat kamu aku mau tidur juga ku ingat kamu",
+ "aku sedang bosan ku ingat kamu oh tuhan mungkin kah ku jatuh cinta",
+ " ", "", NA)
> #tagger <- rdr_model(language = "Indonesian", annotation = "MORPH")
> #rdr_pos(tagger, x = x)
>
> tagger <- rdr_model(language = "Indonesian", annotation = "UniversalPOS")
> rdr_pos(tagger, x = x)
Column 1 ['doc.id'] of item 3 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
doc_id token_id token pos
1 d1 1 aku PRON
2 d1 2 mau ADV
3 d1 3 makan VERB
4 d1 4 ku PRON
5 d1 5 ingat VERB
6 d1 6 kamu PRON
7 d1 7 aku PRON
8 d1 8 mau ADV
9 d1 9 tidur VERB
10 d1 10 juga ADV
11 d1 11 ku PRON
12 d1 12 ingat VERB
13 d1 13 kamu PRON
14 d2 1 aku PRON
15 d2 2 sedang ADV
16 d2 3 bosan NOUN
17 d2 4 ku PRON
18 d2 5 ingat VERB
19 d2 6 kamu PRON
20 d2 7 oh NOUN
21 d2 8 tuhan NOUN
22 d2 9 mungkin ADV
23 d2 10 kah NOUN
24 d2 11 ku PRON
25 d2 12 jatuh VERB
26 d2 13 cinta NOUN
27 d3 0 <NA> <NA>
28 d4 0 <NA> <NA>
29 d5 0 <NA> <NA>
>
> x <- read.csv("dataset_twitter_stopword.csv", stringsAsFactors = FALSE, header = FALSE)
> tagger <- rdr_model(language = "Indonesian", annotation = "UniversalPOS")
> rdr_pos(tagger, x = x)
doc_id token_id token pos
1 d1 1 '' PUNCT
2 d1 2 , PUNCT
3 d1 3 '' PUNCT
4 d1 4 , PUNCT
5 d1 5 '' PUNCT
6 d1 6 , PUNCT
7 d1 7 '' PUNCT
8 d1 8 , PUNCT
9 d1 9 '' PUNCT
10 d1 10 , PUNCT
11 d1 11 '' PUNCT
12 d1 12 , PUNCT
13 d1 13 '' PUNCT
14 d1 14 , PUNCT
15 d1 15 '' PUNCT
16 d1 16 , PUNCT
17 d1 17 '' PUNCT
18 d1 18 , PUNCT
19 d1 19 '' PUNCT
20 d1 20 , PUNCT
21 d1 21 '' PUNCT
22 d1 22 , PUNCT
23 d1 23 '' PUNCT
24 d1 24 , PUNCT
25 d1 25 '' PUNCT
26 d1 26 , PUNCT
27 d1 27 '' PUNCT
28 d1 28 , PUNCT
29 d1 29 '' PUNCT
30 d1 30 , PUNCT
31 d1 31 '' PUNCT
32 d1 32 , PUNCT
33 d1 33 '' PUNCT
34 d1 34 , PUNCT
35 d1 35 '' PUNCT
36 d1 36 , PUNCT
37 d1 37 '' PUNCT
38 d1 38 , PUNCT
39 d1 39 '' PUNCT
40 d1 40 , PUNCT
41 d1 41 '' PUNCT
42 d1 42 , PUNCT
43 d1 43 '' PUNCT
44 d1 44 , PUNCT
45 d1 45 '' PUNCT
46 d1 46 , PUNCT
47 d1 47 '' PUNCT
48 d1 48 , PUNCT
49 d1 49 '' PUNCT
50 d1 50 , PUNCT
51 d1 51 '' PUNCT
52 d1 52 , PUNCT
53 d1 53 '' PUNCT
54 d1 54 , PUNCT
55 d1 55 '' PUNCT
56 d1 56 , PUNCT
57 d1 57 '' PUNCT
58 d1 58 , PUNCT
59 d1 59 '' PUNCT
60 d1 60 , PUNCT
61 d1 61 '' PUNCT
62 d1 62 , PUNCT
63 d1 63 '' PUNCT
64 d1 64 , PUNCT
65 d1 65 '' PUNCT
66 d1 66 , PUNCT
67 d1 67 '' PUNCT
68 d1 68 , PUNCT
69 d1 69 '' PUNCT
70 d1 70 , PUNCT
71 d1 71 '' PUNCT
72 d1 72 , PUNCT
73 d1 73 '' PUNCT
74 d1 74 , PUNCT
75 d1 75 '' PUNCT
76 d1 76 , PUNCT
77 d1 77 '' PUNCT
78 d1 78 , PUNCT
79 d1 79 '' PUNCT
80 d1 80 , PUNCT
81 d1 81 '' PUNCT
82 d1 82 , PUNCT
83 d1 83 '' PUNCT
84 d1 84 , PUNCT
85 d1 85 '' PUNCT
86 d1 86 , PUNCT
87 d1 87 '' PUNCT
88 d1 88 , PUNCT
89 d1 89 '' PUNCT
90 d1 90 , PUNCT
91 d1 91 '' PUNCT
92 d1 92 , PUNCT
93 d1 93 '' PUNCT
94 d1 94 , PUNCT
95 d1 95 '' PUNCT
96 d1 96 , PUNCT
97 d1 97 '' PUNCT
98 d1 98 , PUNCT
99 d1 99 '' PUNCT
100 d1 100 , PUNCT
101 d1 101 '' PUNCT
102 d1 102 , PUNCT
103 d1 103 '' PUNCT
104 d1 104 , PUNCT
105 d1 105 '' PUNCT
106 d1 106 , PUNCT
107 d1 107 '' PUNCT
108 d1 108 , PUNCT
109 d1 109 '' PUNCT
110 d1 110 , PUNCT
111 d1 111 '' PUNCT
112 d1 112 , PUNCT
113 d1 113 '' PUNCT
114 d1 114 , PUNCT
115 d1 115 '' PUNCT
116 d1 116 , PUNCT
117 d1 117 '' PUNCT
118 d1 118 , PUNCT
119 d1 119 '' PUNCT
120 d1 120 , PUNCT
121 d1 121 '' PUNCT
122 d1 122 , PUNCT
123 d1 123 '' PUNCT
124 d1 124 , PUNCT
125 d1 125 '' PUNCT
126 d1 126 , PUNCT
127 d1 127 '' PUNCT
128 d1 128 , PUNCT
129 d1 129 '' PUNCT
130 d1 130 , PUNCT
131 d1 131 '' PUNCT
132 d1 132 , PUNCT
133 d1 133 '' PUNCT
134 d1 134 , PUNCT
135 d1 135 '' PUNCT
136 d1 136 , PUNCT
137 d1 137 '' PUNCT
138 d1 138 , PUNCT
139 d1 139 '' PUNCT
140 d1 140 , PUNCT
141 d1 141 '' PUNCT
142 d1 142 , PUNCT
143 d1 143 '' PUNCT
144 d1 144 , PUNCT
145 d1 145 '' PUNCT
146 d1 146 , PUNCT
147 d1 147 '' PUNCT
148 d1 148 , PUNCT
149 d1 149 '' PUNCT
150 d1 150 , PUNCT
151 d1 151 '' PUNCT
152 d1 152 , PUNCT
153 d1 153 '' PUNCT
154 d1 154 , PUNCT
155 d1 155 '' PUNCT
156 d1 156 , PUNCT
157 d1 157 '' PUNCT
158 d1 158 , PUNCT
159 d1 159 '' PUNCT
160 d1 160 , PUNCT
161 d1 161 '' PUNCT
162 d1 162 , PUNCT
163 d1 163 '' PUNCT
164 d1 164 , PUNCT
165 d1 165 '' PUNCT
166 d1 166 , PUNCT
167 d1 167 '' PUNCT
168 d1 168 , PUNCT
169 d1 169 '' PUNCT
170 d1 170 , PUNCT
171 d1 171 '' PUNCT
172 d1 172 , PUNCT
173 d1 173 '' PUNCT
174 d1 174 , PUNCT
175 d1 175 '' PUNCT
176 d1 176 , PUNCT
177 d1 177 '' PUNCT
178 d1 178 , PUNCT
179 d1 179 '' PUNCT
180 d1 180 , PUNCT
181 d1 181 '' PUNCT
182 d1 182 , PUNCT
183 d1 183 '' PUNCT
184 d1 184 , PUNCT
185 d1 185 '' PUNCT
186 d1 186 , PUNCT
187 d1 187 '' PUNCT
188 d1 188 , PUNCT
189 d1 189 '' PUNCT
190 d1 190 , PUNCT
191 d1 191 '' PUNCT
192 d1 192 , PUNCT
193 d1 193 '' PUNCT
194 d1 194 , PUNCT
195 d1 195 '' PUNCT
196 d1 196 , PUNCT
197 d1 197 '' PUNCT
198 d1 198 , PUNCT
199 d1 199 '' PUNCT
200 d1 200 , PUNCT
201 d1 201 '' PUNCT
202 d1 202 , PUNCT
203 d1 203 '' PUNCT
204 d1 204 , PUNCT
205 d1 205 '' PUNCT
206 d1 206 , PUNCT
207 d1 207 '' PUNCT
208 d1 208 , PUNCT
209 d1 209 '' PUNCT
210 d1 210 , PUNCT
211 d1 211 '' PUNCT
212 d1 212 , PUNCT
213 d1 213 '' PUNCT
214 d1 214 , PUNCT
215 d1 215 '' PUNCT
216 d1 216 , PUNCT
217 d1 217 '' PUNCT
218 d1 218 , PUNCT
219 d1 219 '' PUNCT
220 d1 220 , PUNCT
221 d1 221 '' PUNCT
222 d1 222 , PUNCT
223 d1 223 '' PUNCT
224 d1 224 , PUNCT
225 d1 225 '' PUNCT
226 d1 226 , PUNCT
227 d1 227 '' PUNCT
228 d1 228 , PUNCT
229 d1 229 '' PUNCT
230 d1 230 , PUNCT
231 d1 231 '' PUNCT
232 d1 232 , PUNCT
233 d1 233 '' PUNCT
234 d1 234 , PUNCT
235 d1 235 '' PUNCT
236 d1 236 , PUNCT
237 d1 237 '' PUNCT
238 d1 238 , PUNCT
239 d1 239 '' PUNCT
240 d1 240 , PUNCT
241 d1 241 '' PUNCT
242 d1 242 , PUNCT
243 d1 243 '' PUNCT
244 d1 244 , PUNCT
245 d1 245 '' PUNCT
246 d1 246 , PUNCT
247 d1 247 '' PUNCT
248 d1 248 , PUNCT
249 d1 249 '' PUNCT
250 d1 250 , PUNCT
[ reached 'max' / getOption("max.print") -- omitted 5831 rows ]
the .csv data:
https://drive.google.com/file/d/10x_mhVQWSsFM6zkCV9acSm2YJc4jnMoD/view?usp=sharing
i hope you can help me to fix my problem, thank you in advance!
My installation failed. Notice that the command 'library(rJava)' works fine and the R package 'wordnet' importing rJava works fine... So I don't understand exactly the problem because java doesn't seem to be a cause of problems with other packages...
`devtools::install_github("bnosac/RDRPOSTagger")
Downloading GitHub repo bnosac/RDRPOSTagger@master
√ checking for file 'C:\Users\Ludovic\AppData\Local\Temp\RtmpGqPYPs\remotes4de47ddc3d23\bnosac-RDRPOSTagger-af51e38/DESCRIPTION' ...
Installing package into ‘C:/Users/Ludovic/Documents/R/win-library/3.6’
(as ‘lib’ is unspecified)
Thanks for writing this great package!
I am trying to parse tweets and came across an issue with the tagger when passing it quoted text. The tokens after and before the quotation marks are deleted in the tagging process.
Here is an example:
rdr_pos(rdr_model(language = "English", annotation = "UniversalPOS"), "Some guy asked -\"what is the issue\"")
The returned object is missing "what" and "issue".
For the time being, I am simply gsub'ing the \"
but this would obviously be better addressed internal to the function.
Using universal POS tag for a phrase: "supervise correctional procedures", the word "supervise" was tagged as a noun.
Hi,
I try to install RDRPOSTagger
package. If I run your stated code I get the following result.
devtools::install_github("bnosac/RDRPOSTagger", build_vignettes = TRUE)
Downloading GitHub repo bnosac/RDRPOSTagger@master
from URL https://api.github.com/repos/bnosac/RDRPOSTagger/zipball/master
Installing RDRPOSTagger
"C:/PROGRA~1/R/R-35~1.0/bin/x64/R" --no-site-file --no-environ --no-save \
--no-restore --quiet CMD build \
"C:\Users\X1\AppData\Local\Temp\RtmpMjuZpO\devtools2f7c59287379\bnosac-RDRPOSTagger-af51e38" \
--no-resave-data --no-manual
* checking for file 'C:\Users\X1\AppData\Local\Temp\RtmpMjuZpO\devtools2f7c59287379\bnosac-RDRPOSTagger-af51e38/DESCRIPTION' ... OK
* preparing 'RDRPOSTagger':
* checking DESCRIPTION meta-information ... OK
* installing the package to build vignettes
* creating vignettes ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* building 'RDRPOSTagger_1.1.tar.gz'
"C:/PROGRA~1/R/R-35~1.0/bin/x64/R" --no-site-file --no-environ --no-save \
--no-restore --quiet CMD INSTALL \
"C:/Users/X1/AppData/Local/Temp/RtmpMjuZpO/RDRPOSTagger_1.1.tar.gz" \
--library="C:/Users/X1/Documents/R/win-library/3.5" --install-tests
* installing *source* package 'RDRPOSTagger' ...
** R
** inst
** tests
** byte-compile and prepare package for lazy loading
Warnung: package 'rJava' was built under R version 3.5.2
** help
*** installing help indices
converting help for package 'RDRPOSTagger'
finding HTML links ... fertig
rdr_add_space_around_punctuations html
rdr_available_models html
rdr_model html
rdr_pos html
** building package indices
** installing vignettes
** testing if installed package can be loaded
*** arch - i386
Warnung: package 'rJava' was built under R version 3.5.2
Error: package or namespace load failed for 'rJava':
.onLoad failed in loadNamespace() for 'rJava', details:
call: fun(libname, pkgname)
error: No CurrentVersion entry in Software/JavaSoft registry! Try re-installing Java and make sure R and Java have matching architectures.
Error : package 'rJava' could not be loaded
Fehler: Laden fehlgeschlagen
Ausführung angehalten
*** arch - x64
Warnung: package 'rJava' was built under R version 3.5.2
ERROR: loading failed for 'i386'
* removing 'C:/Users/X1/Documents/R/win-library/3.5/RDRPOSTagger'
In R CMD INSTALL
Installation failed: Command failed (1)
I came across an error when passing the tagger sentences that have a leading symbol like -
or ?
.
Here is an example:
rdr_pos(rdr_model(language = "English", annotation = "UniversalPOS"), "- what is wrong?")
Returns the following error:
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, : java.lang.StringIndexOutOfBoundsException: String index out of range: 0
It seems like this is an rJava error but I thought I'd post it here first.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.