Comments (1)
There are some sentence and word break rules that can be tweaked (see https://github.com/quanteda/quanteda/tree/master/inst/breakrules), but these are largely defined as the current ICU defaults and clearly, it does not work with "Dr. Someone". A better solution would be to use spacyr.
library("spacyr")
library("quanteda")
#> Package version: 4.0.2
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: disabled
#> See https://quanteda.io for tutorials and examples.
txt <- "Dr. Einstein wrote about general relativity. Prof. Bohr about quantum physics."
spacy_tokenize(txt, what = "sentence") |>
unlist() |>
corpus()
#> successfully initialized (spaCy Version: 3.7.4, language model: en_core_web_sm)
#> Corpus consisting of 2 documents.
#> text11 :
#> "Dr. Einstein wrote about general relativity."
#>
#> text12 :
#> "Prof. Bohr about quantum physics."
Created on 2024-04-30 with reprex v2.1.0
from quanteda.
Related Issues (20)
- Make tokens_substitute() to replace characters in tokens?
- Add more explicit information on enabling parallelization in quanteda >v4.0.0 HOT 1
- Experiencing problem with textmodel_mlp
- Add apply_if to tokens_ngrams()
- Error in parallel computing HOT 1
- Add invert to sampling functions
- parallel computing is disabled in CRAN version HOT 2
- Keep original unigrams in tokens_compound() HOT 1
- Add only_unigram argument
- Only geneate existing sequence
- Function dfm_stem() does not exist but is required to replace dfm(stem) HOT 1
- Add tokens_trim()
- CRAN problems: documentation links
- CRAN problems: UBSAN HOT 31
- Return value of `cpp_dfm` can be invalid, non-deterministic HOT 2
- dfm_weight with the weights= option does not produce a dfm HOT 1
- Trouble creating fcm from very large tokens object HOT 6
- Error in left_join running topic model HOT 1
- Always remove paddings in `dfm()` HOT 1
- Quanteda: Can create tokens on one subset of corpus, but not the other: Error: The type of x must be character HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from quanteda.