Git Product home page Git Product logo

corpora's Introduction

corpora

What is a corpus? A corpus is a large collection of texts. It is a body of written or spoken material, such as articles, books, conversations, etc. The plural form of corpus is corpora.

Why would you need a corpus? Instead of teaching a computer how to spell or the rules of grammar for a particular language, you'd use a large collection of texts (a corpus), to test against, in order to determine whether a sample of text is properly spelled, has valid grammar, to interpolate missing words, etc.

By testing against existing texts, a computer can determine how a word SHOULD be spelled as well as where and how a word SHOULD be used.

The way you would actually go about teaching a computer to process text is the basis of Natural Language Processing/Understanding and is not covered here. This is a simple utility, to filter and clean up large collections of texts, for later processing.

Samples

You can find a few sample texts in the folder samples, they were taken from textfiles.com.

original

APPLE II HISTORY
===== == =======

Compiled and written by Steven Weyhrich
(C) Copyright 1991, Zonker Software

 (PART 6 -- THE APPLE II PLUS)
[v1.1 :: 12 Dec 91]

processed

APPLE II HISTORY.

Compiled and written by Steven Weyhrich
Copyright 1991, Zonker Software.

PART 6 - THE APPLE II PLUS
v1.1 12 Dec 91.

original

This should be
a single sentence!

processed Sentence Stitching

This should be a single sentence!

USAGE

There is only a single required argument you must provide, that is the input filename or directory. If a folder is specified it will be recursively searched for files to process.

The output argument should be a filename to output the resulting corpus after processing all input files. If no output is specified, results will be sent to the console (stdout).

var Corpora = require('./lib/corpora.js')
var corpora = new Corpora()

corpora.input = process.argv[2]
corpora.output = process.argv[3]
corpora.process(function(){
	console.log('FINITO')
})

Custom Processing

By default a series of processing steps are done. They are defined in default_steps.js. You can specify your own steps:

NOTE Steps are fed directly to String.prototype.replace as arguments.

var corpora = new Corpora()

// Step - uppercase
corpora.steps.push(
	[
		/[a-z]+/gi,
		function UPPERCASE(string)
		{
			return string.toUpperCase()
		}
	]
)

// Replace Special Characters with NULL character
corpora.steps.push(
	[
		/[@#$%]+/g,
		0x00
	]
)

// Replace a specific string
corpora.steps.push(
	[
		'John Smith',
		'Adam Smith'
	]
)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.