Git Product home page Git Product logo

inf3800-2018's Introduction

Introduction

There are five obligatory assignments in INF3800 and INF4800. The assignments are a mix of pen-and-paper exercises and coding exercises.

The pen-and-paper exercises are a selection of questions from the final exams from previous years. They are obligatory if you take this course as INF4800 and optional if you take this course as INF3800. You'll find the final exams from previous years as part of this repository.

The coding assignments assume basic familiarity with the Python language. At least version 3.6 will be assumed. If you are on an older version of Python you’re on your own, and you might have to backport code. You can use whatever development environment you want, but you will probably be more productive and have an easier time if you use a good IDE. We can recommend PyCharm. You can use another set of tools if you want, but then don't expect help with solving challenges related to setup or tooling.

You will be provided with some "precode" or "starter code", i.e., a set of helper classes and functions that you can make use of so that you don't have to start the coding assignments completely from scratch. This precode also sets some structure on how you implement the assignments. Please familiarize yourself with what's available. The precode is commented and has some illustrative usage examples.

The precode is in many places annotated with type hints, using the typing module. These type hints are read by the PyCharm type checker and can be specified for both function signatures and variables. This doesn't make Python a statically typed language, but are just hints that are possible to ignore and abuse. They do convey intent and serve as a kind of additional documentation, though, and enable PyCharm and other IDEs to give you better and richer programming support.

Common for the precode is that NotImplementedError is raised in places where you are meant to provide a working implementation. After having provided working implementations, the following should work and run without errors:

>python3 assignments.py

The above invocation runs all tests for all assignments. If you want to only run the tests for, say, assignments A and C, you can pass this as command line arguments:

>python3 assignments.py a c

Making tests pass for one assignment should not result in tests breaking for previous assignments.

If your code raises no exceptions and passes all the assert statements, you should see the following printed to the console:

*************************
*** ALL TESTS PASSED! ***
*************************

Please strive to create readable and modular code. At a minimum, your code should pass all PyCharm's quality checks with PyCharm's standard configuration. (PyCharm warns about quality issues in its right-hand scrollbar. Please fix all these before submitting your code through Devilry.)

All your implementations should be reasonably efficient. If your code is very slow and you want to measure where the time is spent, you can use the built-in cProfile module to do this:

>python3 –m cProfile assignments.py

Output from cProfile can be visualized using, e.g., SnakeViz.

Assignment A

The purpose of this assignment is to build a simple in-memory inverted index and show how to merge posting lists.

Pen-and-paper exercises:

Implementation notes:

  • The InMemoryInvertedIndex class implements a simple inverted index. We create an inverted index from a corpus (i.e., a collection of documents) as represented by the InMemoryCorpus class.
  • After having indexed the corpus (or at least a specified set of fields in the documents) and created an InMemoryInvertedIndex object, we will have created a dictionary of indexed terms as represented by the InMemoryDictionary class, and a posting list for each term. Each posting in a posting list should keep track of the document identifier and the number of times the term occurs in the identified document. The resulting posting lists must be sorted in ascending order by document identifiers.
  • For text normalization and tokenization purposes, you can use the BrainDeadNormalizer and BrainDeadTokenizer classes.
  • You might find the Counter class in the built-in collections module useful.

Your task is to:

  • Familiarize yourself with the precode.
  • Implement the missing indexing code in the InMemoryInvertedIndex class.
  • Implement the missing code for merging two posting lists in the PostingsMerger class.
  • Make all tests pass.

Some optional bonus challenges for the interested student:

  • Above you implemented the AND and OR operators over posting lists. Extend this to also implement the ANDNOT operator.
  • The implementations for the AND and OR operators take two posting lists as arguments. Can you generalize your implementations to be n-ary instead of binary, i.e., how would you efficiently traverse n posting lists in "parallel"?
  • Extend your posting list implementation to include skip lists, extend your indexing code to build these, and extend your merging code to make good use of this additional data structure.

Expected output:

>python3 assignments.py a
*** ASSIGNMENT A ***
INDEXING...
prøve
{'document_id': 1, 'term_frequency': 1}
wtf
test
{'document_id': 0, 'term_frequency': 1}
{'document_id': 1, 'term_frequency': 2}
{'this': [{'document_id': 0, 'term_frequency': 1}], 'is': [{'document_id': 0, 'term_frequency': 1}], 'a': [{'document_id': 0, 'term_frequency': 1}], 'test': [{'document_id': 0, 'term_frequency': 1}, {'document_id': 1, 'term_frequency': 2}], 'prøve': [{'document_id': 1, 'term_frequency': 1}]}
LOADING...
INDEXING...
hydrogen
{'document_id': 11634, 'term_frequency': 1}
{'document_id': 11635, 'term_frequency': 1}
{'document_id': 11636, 'term_frequency': 1}
{'document_id': 11637, 'term_frequency': 1}
{'document_id': 11638, 'term_frequency': 1}
{'document_id': 11639, 'term_frequency': 1}
{'document_id': 19011, 'term_frequency': 1}
{'document_id': 22229, 'term_frequency': 1}
hydrocephalus
{'document_id': 11622, 'term_frequency': 1}
{'document_id': 11623, 'term_frequency': 1}
MERGING...
HIV AND pROtein
{'document_id': 11316, 'fields': {'body': 'hiv core protein p24', 'meta': '20'}}
{'document_id': 11319, 'fields': {'body': 'hiv envelope protein gp120', 'meta': '26'}}
{'document_id': 11320, 'fields': {'body': 'hiv envelope protein gp160', 'meta': '26'}}
{'document_id': 11321, 'fields': {'body': 'hiv envelope protein gp41', 'meta': '25'}}
water OR Toxic
{'document_id': 3078, 'fields': {'body': 'body water', 'meta': '10'}}
{'document_id': 8138, 'fields': {'body': 'epidermal necrolysis, toxic', 'meta': '27'}}
{'document_id': 8635, 'fields': {'body': 'extravascular lung water', 'meta': '24'}}
{'document_id': 9379, 'fields': {'body': 'fresh water', 'meta': '11'}}
{'document_id': 14472, 'fields': {'body': 'megacolon, toxic', 'meta': '16'}}
{'document_id': 18572, 'fields': {'body': 'plants, toxic', 'meta': '13'}}
{'document_id': 23234, 'fields': {'body': 'tar-water', 'meta': '9'}}
{'document_id': 23985, 'fields': {'body': 'toxic actions', 'meta': '13'}}
{'document_id': 25265, 'fields': {'body': 'water', 'meta': '5'}}
{'document_id': 25266, 'fields': {'body': 'water deprivation', 'meta': '17'}}
{'document_id': 25267, 'fields': {'body': 'water intoxication', 'meta': '18'}}
{'document_id': 25268, 'fields': {'body': 'water loss, insensible', 'meta': '22'}}
{'document_id': 25269, 'fields': {'body': 'water microbiology', 'meta': '18'}}
{'document_id': 25270, 'fields': {'body': 'water movements', 'meta': '15'}}
{'document_id': 25271, 'fields': {'body': 'water pollutants', 'meta': '16'}}
{'document_id': 25272, 'fields': {'body': 'water pollutants, chemical', 'meta': '26'}}
{'document_id': 25273, 'fields': {'body': 'water pollutants, radioactive', 'meta': '29'}}
{'document_id': 25274, 'fields': {'body': 'water pollution', 'meta': '15'}}
{'document_id': 25275, 'fields': {'body': 'water pollution, chemical', 'meta': '25'}}
{'document_id': 25276, 'fields': {'body': 'water pollution, radioactive', 'meta': '28'}}
{'document_id': 25277, 'fields': {'body': 'water purification', 'meta': '18'}}
{'document_id': 25278, 'fields': {'body': 'water softening', 'meta': '15'}}
{'document_id': 25279, 'fields': {'body': 'water supply', 'meta': '12'}}
{'document_id': 25280, 'fields': {'body': 'water-electrolyte balance', 'meta': '25'}}
{'document_id': 25281, 'fields': {'body': 'water-electrolyte imbalance', 'meta': '27'}}
*************************
*** ALL TESTS PASSED! ***
*************************

inf3800-2018's People

Contributors

aohrn avatar olafosheimgrostad avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.