Git Product home page Git Product logo

Comments (3)

pnulty avatar pnulty commented on May 24, 2024

Ken's notes:

Proposed the design of a NEW corpus object

(1) The corpus object is an S3 class defined as a speical class of list

(2) Corpus list elements:
a) data.frame of documents, called attribs (as now) consisting of:
i. texts
vector of the texts in the corpus, with an Encoding() flag set on
each element
ii. user-defined variables associated with each document
iii. row.names(attribs) will be a unique key of document names
b) data.frame of document-level metadata
automatically defined or defined by the user. row.names correspond to those
in the documents ("attribs") data.frame
- original file name
- source (disk, assignment, etc.)
- notes
- LANGUAGE
- optional info from the "Dublin Scheme"
c) list of corpus-level meta-data, including
- notes
- citation information
- creation details
d) user-supplied variables-level meta-data
- details on each user-defined "attribute"
e) collocations. List of word sequences that will be treated as single
types when extracting word-based features
f) dictionar(y/ies). named list of dictionaries associated with the corpus
g) stopwords. list of character elements associated with the stopwords.
f) stemming. TRUE or FALSE depending on whether to use stemming with this corpus.
g) clean rules? such as punctuation/number/case

(3) Index flag (TRUE or FALSE) - gets reset depending on the operation

(3) Note: all options can be overridden when using specific commands (dfm, kwic)
but the settings will determine the defaults. This is for replication purposes
and convenience if a user determines that for a corpus, there should be a
"standard" set of option settings.

Methods:

corpus(texts, ...) <- replaces corpusCreate. Similar to data.frame which
creates a data.frame. Would be nice to combine existing
functions into options (for reading from file= or
directory= etc options)

print.corpus(corpus) displays summary information on a corpus, esp. metadata
and citation information and current settings for things
like collocations, stemming, dictionaries, etc.

summary.corpus(corpus) details of the texts in a corpus

'+' corpus concatenate texts in two corpus objects
union of meta-data, first gets priority

index.corpus(corpus) recompiles the corpus index. Could include counts,
word syllable counts, document, paragraph, and sentence
locations. Or POS for each word.

subset.corpus() as it now exists

sample.corpus(corpus, level=c("sentence", "documment", "word", "paragraph"), size, replace=TRUE, prob=NULL)
for producing a sample of texts and meta-data from a corpus where the resampling
of the texts is performed at the "level" option. Meta-data is matched to the
sampled document units.
sample.character(characterVector) core engine of sample.corpus

Extractor/Assignment functions for corpus slots:


documents.corpus(corpus)
extracts or assigns the texts (same as current getTexts())

metadata.corpus(corpus, level=c("documents", "corpus"))
extracts or assigns corpus metata data

stopwords.corpus(corpus) extracts or assigns stopwords associated with corpus

collocations.corpus(corpus)
extracts or assigns collocations to be treated as "features"
when extracting features from the corpus

stemming.corpus(corpus) TRUE or FALSE flag to be set with corpus

trim.corpus(corpus) min doc and min word trimming features

encoding(corpus) set or extract encodings of attribs$texts

dictionary(corpus, name="dictionaryname")
to extract or set the dictionaries associated with corpus

Extractor only (no assignment):


sentences.corpus(corpus) extract sentence list from a corpus
words/vocabulary.corpus(corpus) extract list of word types from a corpus (given settings)

Analysis of corpus directly: (also defined for .character whenever applicable)


readability.corpus(corpus, [options])
kwic.corpus(corpus, [options])
collocations.corpus(corpus, [options])

Manipulation/conversion of corpus


export.corpus(to=c("text", "alceste", "tm", "qdaminerXML", "maxqda"), from=c("quanteda"), [options])
import.corpus(from=c("text", "alceste", "tm", "qdaminerXML", "maxqda"), from=c("quanteda"))

from quanteda.

pnulty avatar pnulty commented on May 24, 2024

Constructor for corpus as outlined here:http://adv-r.had.co.nz/OO-essentials.html#s3

The constructor should be a generic function named "corpus". If no arguments are passed, getTextsGui can be run.

from quanteda.

kbenoit avatar kbenoit commented on May 24, 2024

Some issues resolved by last hackathon, others distributed into new issues.

from quanteda.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.