Accessor functions: texts() words() data() - (return only th

Constructor for corpus as outlined here:<a href="http://adv-r.had.co.nz/OO-essentials.

Targets for Refactoring about quanteda HOT 3 CLOSED

quanteda commented on May 24, 2024

Targets for Refactoring

from quanteda.

Comments (3)

pnulty commented on May 24, 2024

Ken's notes:

Proposed the design of a NEW corpus object

(1) The corpus object is an S3 class defined as a speical class of list

(2) Corpus list elements:
a) data.frame of documents, called attribs (as now) consisting of:
i. texts
vector of the texts in the corpus, with an Encoding() flag set on
each element
ii. user-defined variables associated with each document
iii. row.names(attribs) will be a unique key of document names
b) data.frame of document-level metadata
automatically defined or defined by the user. row.names correspond to those
in the documents ("attribs") data.frame
- original file name
- source (disk, assignment, etc.)
- notes
- LANGUAGE
- optional info from the "Dublin Scheme"
c) list of corpus-level meta-data, including
- notes
- citation information
- creation details
d) user-supplied variables-level meta-data
- details on each user-defined "attribute"
e) collocations. List of word sequences that will be treated as single
types when extracting word-based features
f) dictionar(y/ies). named list of dictionaries associated with the corpus
g) stopwords. list of character elements associated with the stopwords.
f) stemming. TRUE or FALSE depending on whether to use stemming with this corpus.
g) clean rules? such as punctuation/number/case

(3) Index flag (TRUE or FALSE) - gets reset depending on the operation

(3) Note: all options can be overridden when using specific commands (dfm, kwic)
but the settings will determine the defaults. This is for replication purposes
and convenience if a user determines that for a corpus, there should be a
"standard" set of option settings.

Methods:

corpus(texts, ...) <- replaces corpusCreate. Similar to data.frame which
creates a data.frame. Would be nice to combine existing
functions into options (for reading from file= or
directory= etc options)

print.corpus(corpus) displays summary information on a corpus, esp. metadata
and citation information and current settings for things
like collocations, stemming, dictionaries, etc.

summary.corpus(corpus) details of the texts in a corpus

'+' corpus concatenate texts in two corpus objects
union of meta-data, first gets priority

index.corpus(corpus) recompiles the corpus index. Could include counts,
word syllable counts, document, paragraph, and sentence
locations. Or POS for each word.

subset.corpus() as it now exists

sample.corpus(corpus, level=c("sentence", "documment", "word", "paragraph"), size, replace=TRUE, prob=NULL)
for producing a sample of texts and meta-data from a corpus where the resampling
of the texts is performed at the "level" option. Meta-data is matched to the
sampled document units.
sample.character(characterVector) core engine of sample.corpus

Extractor/Assignment functions for corpus slots:

documents.corpus(corpus)
extracts or assigns the texts (same as current getTexts())

metadata.corpus(corpus, level=c("documents", "corpus"))
extracts or assigns corpus metata data

stopwords.corpus(corpus) extracts or assigns stopwords associated with corpus

collocations.corpus(corpus)
extracts or assigns collocations to be treated as "features"
when extracting features from the corpus

stemming.corpus(corpus) TRUE or FALSE flag to be set with corpus

trim.corpus(corpus) min doc and min word trimming features

encoding(corpus) set or extract encodings of attribs$texts

dictionary(corpus, name="dictionaryname")
to extract or set the dictionaries associated with corpus

Extractor only (no assignment):

sentences.corpus(corpus) extract sentence list from a corpus
words/vocabulary.corpus(corpus) extract list of word types from a corpus (given settings)

Analysis of corpus directly: (also defined for .character whenever applicable)

readability.corpus(corpus, [options])
kwic.corpus(corpus, [options])
collocations.corpus(corpus, [options])

Manipulation/conversion of corpus

export.corpus(to=c("text", "alceste", "tm", "qdaminerXML", "maxqda"), from=c("quanteda"), [options])
import.corpus(from=c("text", "alceste", "tm", "qdaminerXML", "maxqda"), from=c("quanteda"))

from quanteda.

pnulty commented on May 24, 2024

Constructor for corpus as outlined here:http://adv-r.had.co.nz/OO-essentials.html#s3

The constructor should be a generic function named "corpus". If no arguments are passed, getTextsGui can be run.

from quanteda.

kbenoit commented on May 24, 2024

Some issues resolved by last hackathon, others distributed into new issues.

from quanteda.

Targets for Refactoring about quanteda HOT 3 CLOSED

Comments (3)

Methods:

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent