Comments (3)
Ken's notes:
Proposed the design of a NEW corpus object
(1) The corpus object is an S3 class defined as a speical class of list
(2) Corpus list elements:
a) data.frame of documents, called attribs (as now) consisting of:
i. texts
vector of the texts in the corpus, with an Encoding() flag set on
each element
ii. user-defined variables associated with each document
iii. row.names(attribs) will be a unique key of document names
b) data.frame of document-level metadata
automatically defined or defined by the user. row.names correspond to those
in the documents ("attribs") data.frame
- original file name
- source (disk, assignment, etc.)
- notes
- LANGUAGE
- optional info from the "Dublin Scheme"
c) list of corpus-level meta-data, including
- notes
- citation information
- creation details
d) user-supplied variables-level meta-data
- details on each user-defined "attribute"
e) collocations. List of word sequences that will be treated as single
types when extracting word-based features
f) dictionar(y/ies). named list of dictionaries associated with the corpus
g) stopwords. list of character elements associated with the stopwords.
f) stemming. TRUE or FALSE depending on whether to use stemming with this corpus.
g) clean rules? such as punctuation/number/case
(3) Index flag (TRUE or FALSE) - gets reset depending on the operation
(3) Note: all options can be overridden when using specific commands (dfm, kwic)
but the settings will determine the defaults. This is for replication purposes
and convenience if a user determines that for a corpus, there should be a
"standard" set of option settings.
Methods:
corpus(texts, ...) <- replaces corpusCreate. Similar to data.frame which
creates a data.frame. Would be nice to combine existing
functions into options (for reading from file= or
directory= etc options)
print.corpus(corpus) displays summary information on a corpus, esp. metadata
and citation information and current settings for things
like collocations, stemming, dictionaries, etc.
summary.corpus(corpus) details of the texts in a corpus
'+' corpus concatenate texts in two corpus objects
union of meta-data, first gets priority
index.corpus(corpus) recompiles the corpus index. Could include counts,
word syllable counts, document, paragraph, and sentence
locations. Or POS for each word.
subset.corpus() as it now exists
sample.corpus(corpus, level=c("sentence", "documment", "word", "paragraph"), size, replace=TRUE, prob=NULL)
for producing a sample of texts and meta-data from a corpus where the resampling
of the texts is performed at the "level" option. Meta-data is matched to the
sampled document units.
sample.character(characterVector) core engine of sample.corpus
Extractor/Assignment functions for corpus slots:
documents.corpus(corpus)
extracts or assigns the texts (same as current getTexts())
metadata.corpus(corpus, level=c("documents", "corpus"))
extracts or assigns corpus metata data
stopwords.corpus(corpus) extracts or assigns stopwords associated with corpus
collocations.corpus(corpus)
extracts or assigns collocations to be treated as "features"
when extracting features from the corpus
stemming.corpus(corpus) TRUE or FALSE flag to be set with corpus
trim.corpus(corpus) min doc and min word trimming features
encoding(corpus) set or extract encodings of attribs$texts
dictionary(corpus, name="dictionaryname")
to extract or set the dictionaries associated with corpus
Extractor only (no assignment):
sentences.corpus(corpus) extract sentence list from a corpus
words/vocabulary.corpus(corpus) extract list of word types from a corpus (given settings)
Analysis of corpus directly: (also defined for .character whenever applicable)
readability.corpus(corpus, [options])
kwic.corpus(corpus, [options])
collocations.corpus(corpus, [options])
Manipulation/conversion of corpus
export.corpus(to=c("text", "alceste", "tm", "qdaminerXML", "maxqda"), from=c("quanteda"), [options])
import.corpus(from=c("text", "alceste", "tm", "qdaminerXML", "maxqda"), from=c("quanteda"))
from quanteda.
Constructor for corpus as outlined here:http://adv-r.had.co.nz/OO-essentials.html#s3
The constructor should be a generic function named "corpus". If no arguments are passed, getTextsGui can be run.
from quanteda.
Some issues resolved by last hackathon, others distributed into new issues.
from quanteda.
Related Issues (20)
- Error when combining more than 3 tokens objects
- Replace %>% with |>
- Documentation issues that need solving HOT 4
- Elapsted time seems wrong
- Inconsistency in what in oject meta
- Can't install (or load) Quanteda correctly HOT 7
- Add pass argument to tokens functions to return documents intact HOT 6
- Upgrading tokens_replace() to keep tokens and keys togather HOT 4
- ndiMatrix / replValueSp - definition not updated HOT 1
- Review verbose behaviours HOT 3
- Deprecate char_ngrams() HOT 3
- Warning: sparse->dense coercion: allocating vector of size 5.5 GiBWarning: Feature names cannot have underscores ('_'), replacing with dashes ('-') HOT 1
- Warning: sparse->dense coercion: allocating vector of size 5.5 GiBWarning: Feature names cannot have underscores ('_'), replacing with dashes ('-') HOT 1
- Dictionaries in Portuguese HOT 1
- Make cpp_kwic() HOT 1
- Quanteda installation: invalid permissions HOT 13
- Warning message in convert(., to = "stm") wrong HOT 1
- tbb::parallel_for crash
- Changes to generics and UseMethod in R development version breaking tokens generic
- NOTE created by clean script not removing Makevars.win
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from quanteda.