Git Product home page Git Product logo

wordprocessor's Introduction

made-with-python

Word Processor

Python Word Processor useful for indexing and searching into a collection of documents

DOCSTRING

Fill in the docstring part at the beginning of the python module and also for each function, by providing a description of the input, program functionality and output.

USER INPUT

A collection of text documents with .txt extension. This collection can be found in the folder dataset on blackboard.

INSTRUCTIONS

1.THE INDEXING MODULE :

The indexing module aims at generating an index for a collection of documents. It starts by reading a collection of text documents, processing each of the documents content and extract its terms and their frequencies. After processing all documents, a dictionary that serves as index is created that goes from terms to list of documents containing the terms and their frequencies in each of the documents containing the term. See the following page for a flowchart.

The following functions should be implemented:

-def printMenu():This function displays the following menu to the user. The user must select 1 for indexing and 3 to exit. This function checks for valid/invalid input. If an invalid input is entered, the function prompts the user again and prints an informative message. If the input is valid, it is returned. Note: you will upgrade this function in milestone 2. Menu:

Please enter 1 for indexing and 3 to exit

1.Indexing

3.Exit

def readFolderContent():

The code for this function is given. This function reads all the text files in the folder dataset and appends them to a list. Each item in the list is the content of a text file in the dataset. def indexing(): The indexing module aims at generating an index for a collection of documents. It starts by reading a collection of text documents, processing each of the documents content and extract its terms and their frequencies. After processing all documents, a dictionary that serves as index is created that goes from terms to list of documents containing the terms and their frequencies in each of the documents containing the term. See the flowchart above for more information.

#def stopWordRemoval(text):This function takes a text as an argument, removes all the stop words from the text, and returns the text. A list of stop words is provided for you in blackboard. For example, words “the” and “on” are considered as stop words. Thus, they are removed from the following input. input: “The monkeys jump on the bed.” output: “monkeys jump bed.”

def punctuationRemoval(text):This function takes a text as an argument, removes all the punctuations from the text, and returns the text. Here is a list of punctuations you may use in your code. punctuations = '''!()-[]{};:'",<>./?@#$%^&*_~'''

def appendTermDocFreq(cleanText, termDocFreqFile):This functions takes a clean text as argument, and appends the file TermDocFreqFile with the list of terms, the document in which they appear and their frequencies. termDocFreqFile format: 3 columns, values separated with space Term doc# freq recipe 1 5 sweet 1 1 sugar 1 2

def genIndex(termDocFreqFile):This function reads termDocFreqFile line by line and appends the global index that goes from terms as keys to the list of documents that contains them with their frequencies. Appending the index works as follows:

for line in termDocFreqFile, if the term does not exist in index, add the term as key, the value will be a dictionary containg docid:freq as key:val in index if the term already exists in index, append the val (which is a dictionary) with docid:freq

wordprocessor's People

Contributors

soorajsoman avatar

Stargazers

 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.