Git Product home page Git Product logo

csc212spellchecking's Introduction

CSC212: Spell Checking

In this assignment, we will compare and contrast different data structures for creating a SpellChecker.

Rubric (100)

This assignment is out of 100 points. There are three pure coding sections, adding up to (20) points. The rest involves a little more data analysis. (You will have to code, though). All data analysis can be done in Google Sheets or Excel or R (etc.) -- you need not make graphs with Java.

(20) Report & Reflection.

This assignment is more like a scientific investigation. Prepare a report with answers to these questions, and submit it as a PDF.

(10) Conclusion: Which Data Structure do you recommend?

In your report, have a conclusion section where you recommend one of these data structures for "Spell-Checking". Justify your selection.

(15) Measure Creation (Insertion) Speed.

Consider the code in CheckSpelling first.

We create a number of data structures: HashSet, TreeSet, SortedStringListSet, CharTrie, and LLHash. Figure out how many nanoseconds per insert are required. This will involve studying my System.nanoTime() timing code. Since nanoseconds are metric, they are 10^-9 seconds, or 1e-9 in Java notation (hence my division by 1e9 to convert to seconds).

  • How long does it take to fill each data structure?
  • Plot insertion time per element for each of these data structures.
  • Is there a timing difference between constructing HashSet and TreeSet with their input data or calling add in a for loop? If yes, use the words "balancing" and "resizing" to explain what's going on.

(20) Plot Query Speed

Now consider the code in FakeDatasetExperiment.

I have devised a method timeLookup that calculates per-item query time for all the words in a structure. It also prints out the "fraction found" of the dataset.

  • Construct a dataset that has Strings that are both in and not in the dictionary.
  • For full credit, devise a method to inject some percentage of hits and misses. Create a line plot as the percentage of hits goes from (0.0 to 1.0) in steps of 0.1, where each line is a different data structure.

(10) Spell-check a Project Gutenberg book

Now consider the code in BookExperiment.

  • What is the ratio that's "mis-spelled"?
  • Are the query speeds the same over real-world data?
  • What are some of the words that are "mis-spelled"?
  • I gave you WordSplitter again.

Now implement some code!

This is the more traditional "data-structures" portion of the assignment.

(10) CharTrie.countNodes()

Study the CharTrie implementation and complete the countNodes method on the CharTrie.Node class. A recursive solution will be simplest.

You might want to create a unit test so that you count the nodes of a CharTrie that you can draw by hand. (So you know if you get it correct).

For clarity, countNodes should return the number of characters stored in the tree. This should be more than the number of words in the vocabulary, but less than the number of characters in the vocabulary (since a Trie shares prefixes).

(10) SortedStringListSet.binarySearch

Right now, this data structure merely calls Java's built-in Collections.binarySearch. Replace it with your own implementation.

For bonus points, find out why most binary search implementations are incorrect. Try to fix it in your implementation.

Double-check that your query speeds have not changed with your implementation of binary search. If they have, why might that be?

(5) LLHash.countCollisions() and LLHash.countUsedBuckets()

LLHash maintains an ArrayList of Buckets inside of itself. Use this list to compute how many collisions occurred and how many buckets are used. CheckSpelling.main uses them in print statements to compute the load-factor.

Play with the size of LLHash. Does this change your perception of its speed?

csc212spellchecking's People

Contributors

jjfiv avatar

Watchers

 avatar  avatar

csc212spellchecking's Issues

Confusing text about createFakeDataset

Still have text for the non-signature version of this:

For full credit, devise a method to inject some percentage of hits and misses. Create a line plot as the percentage of hits goes from (0.0 to 1.0) in steps of 0.1, where each line is a different data structure.

Update the readme

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.