Git Product home page Git Product logo

main's Introduction

CS429: Information Retrieval

Illinois Institute of Technology
Spring 2016

Overview

  • Course: CS 429: Information Retrieval
  • Instructor: Dr. Aron Culotta
  • Meetings: 3:15 - 4:30 pm T/R Stuart 104
  • E-mail: culotta at cs.iit.edu
  • Phone: 312-567-5261
  • Office Hours: T/R 10:00 a.m. - 11:00 a.m.
  • Office: Stuart Hall 229B
  • TA:
  • Zhao Wang; Office hour W 1-2 in Stuart 002
  • Ehsan Ardehaly; Office Hours M 12-1 in Stuart 002

Description: Overview of fundamental issues of information retrieval with theoretical foundations. The information-retrieval techniques and theory, covering both effectiveness and run-time performance of information-retrieval systems are covered. The focus is on algorithms and heuristics used to find documents relevant to the user request and to find them fast. The course covers the architecture and components of the search engine such as parser, stemmer, index builder, and query processor. The students learn the material by building a prototype of such a search engine. Prerequisites: CS 331 or CS 401; requires strong programming knowledge. 3-0-3 (C) (T)

Textbook: Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Cambridge University Press. 2008.

You can use the electronic version of this book.

Grading

  • 250 points - Assignments (5 @ 50 points each)
  • 100 points - Midterm
  • 100 points - Final
  • 60 points - Quizzes (4 @ 15 points each)
  • 510 total points
Percent Grade
100-90 A
89-80 B
79-70 C
69-60 D
< 60 E

Academic Integrity

  • Please read IIT's Academic Honesty Policy
  • All work you turn in must be done by you alone.
  • All violations will be reported to [email protected] and may result in a failing grade for the assignment and/or course.

Late Submission Policy

  • Late assignments will not be accepted, unless:
    • There is an unavoidable medical, family, or other emergency, and
    • You notify me prior to the due date

Course Outcomes

  1. Explain the information retrieval storage methods (Inverted Index and Signature Files)
  2. Explain retrieval models, such as Boolean model, Vector Space model, Probabilistic model, Inference Networks, and Neural Networks.
  3. Explain retrieval utilities such as Stemming, Relevance Feedback, N-gram, Clustering, and Thesauri, and Parsing and Token recognition.
  4. Design and implement a search engine prototype using the storage methods, retrieval models and utilities.
  5. Apply the research ideas into their experiments in building a search engine prototype

Program Outcomes

  • a. An ability to apply knowledge of computing and mathematics appropriate to the discipline.
  • c. An ability to design, implement and evaluate a computer-based system, process, component, or program to meet desired needs.
  • d. An ability to function effectively on teams to accomplish a common goal.
  • f. An ability to communicate effectively with a range of audiences.
  • i. An ability to use current techniques, skills, and tools necessary for computing practices.
  • j. An ability to apply mathematical foundations, algorithmic principles, and computer science theory in the modeling and design of computer-based systems in a way that demonstrates comprehension of the tradeoffs involved in design choices.
  • k. An ability to apply design and development principles in the construction of software systems of varying complexity.

main's People

Contributors

aronwc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

main's Issues

assignment question 1.

Q. In searcher.py, why do we keep an inverted index instead of simply a list of document vectors (e.g., dicts)? What is the difference in time and space complexity between the two approaches?

In this question you have asked why do we prefer an inverted index instead of a list ?

git pull not working

do I have to create the a1 folder?
I thought doing a git pull should copy the files for assignment 1 for me.

word count query

Should we perform the word count on tokenized document or normal? [for the spelling corrector]

divide by zero error

Hi Professor,

I receive a divide by zero error in query_to_vector because it seems a term in a query is not present in the documents and the idf cannot be computed. For example, in the first query, "KENNEDY ADMINISTRATION PRESSURE ON NGO DINH DIEM TO STOP SUPPRESSING THE BUDDHISTS ." suppressing is not a word in TIME.ALL

Quizzes / In-class assignments

The syllabus mentions that quizzes / in-class assignments will count for 50/700 points. Have we had any of these yet? If so, what were they? I missed a couple classes, so I'm wondering if I missed some.

Tokenize function -contd

I'm sorry to reply late..I still have one more doubt, If thats the case then under the create_index function the professor has passed a list with sublists as argument create_index([['a', 'b'], ['a', 'c']])
that obviously means that he's trying to pass multiple sublists within a parent list as a paramenter ..am i right ??

Tests for count_doc_frequencies

I would like to check the tests for the function count_doc_frequencies.
In the parameters of the function we have 4 a's but in the test result we have: res['a'] = 3. The test is incorrect, right?

Issue with reading documents.txt

While I was trying to test things in my code, I received the following error trying to read the file:

ValueError: Invalid mode ('rtb')

Looking a little bit deeper, the issue stems from these lines in codecs.py:

876             # Force opening of the file in binary mode
877             mode = mode + 'b'
--> 878     file = __builtin__.open(filename, mode, buffering)

What exactly am I supposed to do in this case? I haven't modified any pre-given function.

doctest possible error assignment 2

doctest for count_doc_frequencies:
res = Index().count_doc_frequencies([['a', 'b', 'a'], ['a', 'b', 'c'], ['a']])
res['a'] should have a value of 4 in this case, not 3.

No tests run for doctest

So as I mentioned in class, I had the issue where when I ran the doctests, no tests ran:

$ python -m doctest boolean_search.py -v
8 items had no tests:
   boolean_search
   boolean_search.create_index
   boolean_search.intersect
   boolean_search.main
   boolean_search.read_lines
   boolean_search.search
   boolean_search.sort_by_num_postings
   boolean_search.tokenize
0 tests in 8 items.
0 passed and 0 failed.
Test passed.

However running the code normally produces the expected results. I doubt it's a path issue since the doctest actually ran, so I'm not sure what the issue could be.

query_to_vector

Hi Professor,

For query_to_vector, wouldn't we also need to provide the list of documents instead of just the query_terms? Since we need to find the idf weight for each term in query_terms, we would need to know the total number of documents and the total number of documents each term appears in. I am not too sure how to achieve this with just the list of query terms.

Output

Do we need exactly the same out of Log.txt? I have this.

QUERY= city Using Champion List
1199 2.679080e-01
3691 2.367081e-01
2680 1.913987e-01
410 1.750654e-01
4256 1.516835e-01
3288 1.516195e-01
811 1.490004e-01
1983 1.416778e-01
2336 1.241685e-01
5362 9.983552e-02

However the Log.txt has this:
QUERY= city Using Champion List
1199 2.674604e-01
3691 2.366818e-01
2680 1.912368e-01
410 1.749006e-01
4256 1.516393e-01
3288 1.516023e-01
811 1.489597e-01
1983 1.415582e-01
2336 1.241447e-01
5362 9.984798e-02

Similar cases for all the queries

doc frequency

i am unable to find the doc frequency, rather i end up finding the number of terms in total. Please Help. Any tips on how to proceed. and for professor i have my recent code checked in.
Thank you

Floating Point Underflow Error

For the Naive Bayes classifier I'm getting an underflow error which makes the floating point value 0.0 which makes it difficult to compare if the message is spam or not spam if both are 0.0. Are there any easy solutions to this problem or underflow error is an error and I need to check my math?

Thanks,
Adrian

Questions regarding TIME.REL

since both this file and the TIME.QUE has 83 entries, does this implies the relevant documents for each query? the issue above says the document id start from 1, but from TIME.ALL there isn't a document with id 1, it starts with 17. thanks.

Private repo 404?

I don't think my private repo was created, is there a way for me to double-check?

Slight error in doctest

In search, there is the following doctest entry:
E.g., below we search for documents containing the phrase 'a b c':
>>> search({'a': [[0, 4], [1, 1]], 'b': [[0, 5], [1, 10]], 'c': [[0, 6], [1, 11]]}, 'a b')

the search is actually performed only for 'a b', thought you might want to correct it for futur uses.

Alexis

Bigrams

Can we use the predefined functions using NLTK in order to find the top bigrams?

compute_doc_length

lengths = Index().compute_doc_lengths({'a': [[0, 3]], 'b': [[0, 4]]})
how can i access docid '0' for both 'a' and 'b' and compare them if they r equal? I have spent so much time trying to do it but no result. how can i access the weight (which i think i can figure once i can access doc id)? Please Help!!

python phrase_search.py

Hi Professor,

Could you post the correct output of python phrase_search.py like you did for assignment0?

Assingment 1 - Short Answers

Hi !

For the first question on the Short Answers part, should we provide the answers using asymptotic complexity or is it just necessary to iterate through the algorithm provided by the book ("intersect with skips") ? (e.g. "the skip pointer is followed three times" for the first question).

Thanks !

Index Benchmark

This may not be the requirement for this assignment, but could you provide some benchmark for the indexing? For example, how long should it take to index a document with approximately 100,000 words?

Tokenize from lecture

Are we allowed to use the tokenize source from the lecture, or are we supposed to write ours differently?

regarding skip lists

for short answer question 1 part a. does it mean how many elements are being skipped? I am not sure what how often is skip pointer "followed" means? please help!!

Short answer

Extend the postings merge algorithm to arbitrary Boolean query formulas. What is
its time complexity?

Do we have to explain how to extend the algorithm. Or do we just answer the rest of the question as if it were extended?

append 2 lists/dictionaries

Is it possible to append two lists or dictionaries? I was trying to do the create_positional_index code part and when I was looking at the expected results I saw this:

index['a']
[[0, 0, 2], [1, 0]]
>>> index['b']
[[0, 1]]
>>> index[('c')]
[[1, 1]]

There are two brackets onthe beginning and end of each result, and when it comes to index of A, there are 2 lists, so to do so I would need to merge/append two dictionaries or lists in my index, but how can I do that?

create_tf_idf_index test

Hi,

I'm not sure I understand this test:

index = Index().create_tfidf_index([['a', 'b', 'a'], ['a']], {'a': 2., 'b': 1., 'c': 1.})
sorted(index.keys())
['a', 'b']
index['a']
[[0, 0.0], [1, 0.0]]
index['b'] # doctest:+ELLIPSIS
[[0, 0.301...]]

Both idf value for 'a' are 0 because 'a' appears in all documents, hence log(N/dft)=log(1)=0.
This I understand.

However, I cannot get around the value for 'b'.
It seem to me that the expression would be tf_idf = log(1+1)*log(2/1), which is roughly 0.09. But the result of the doctest is 0.301, which means that N would have been given a value of 10, which would be valid if there were 10 documents total (but would make values for 'a' false), except we don't have this kind of information.

Thanks
Alexis

[EDIT] Just a precision, I understand the number of documents is probably available using len(self.documents) or something similar, however, this is not available for tests.

Last date for assignment 2

Professor,
When is the last day to submit assignment 2. Where could I get this information henceforth.

A0: Normalize tokens?

For the inverted index, are we required to normalize the tokens before the creation of the inverted index? Or just we need to create the tokens and then index them?

No additional normalization is required for this assignment.

Tokenize function

hi, the tokenize function should be built to split each sentence into individual lists within the parent list or is it supposed to split the sentences according to the given sample test case under the tokenize function (i.e) splitting all sentences into one single list ??

setup problem with git

i am trying to setup git by following the instructions given but it gives me error when i am typing
git config --global sjain41 "Smit Jain"
error:key does not contain selection:sjain41
Please help.
capture2

Submission date for Assignment 1

Hi,
I want to confirm the submission date for Assignment 1.
In iit-cs429/main/assignments/assignment1, it is mentioned as 2/06 at 11:59pm but in schedule it is 2/05.
Could you please confirm the date.

query_to_vector

Hi,

Can you give me an example for query_to_vector ? I'm a little confused by the input, as well as the output you're asking for. Could you give me a sample input and sample output? I'm confused by whether we are given a list of lists where each sublist represents one document, or if we are given one list with multiple words in it.

Thanks!

assignment0 file not read

i am running the code in ipython notebook, i cant happen to save documents.txt file. how can i save it there so i can run the code and have the code read the documents.txt file? currently it shows this error obviously:
IOError: [Errno 2] No such file or directory: 'documents.txt'

please help

Output of assignment 2 not matching

I am getting slightly different values (change in decimal point) for document score, therefore the output of my code is not matching with the output given, sometimes even the top 10 document ids are changing.
For e.g my output of QUERY= pop love song
4330 1.619429e+00
2203 9.325103e-01
2205 8.227300e-01
4693 7.713332e-01
312 6.915496e-01
5113 6.480366e-01
3401 6.463378e-01
4996 6.095998e-01
2683 5.734033e-01
2162 5.492592e-01

Is it acceptable? or is there any workaround?

phrase_intersect

There is not a test case provided for this, but if a phrase were to occur multiple times in the same document, should those results returned in one sub-list? For example, if a phrase occurred twice in document 0 and once in document 5, should it look something like
[[0, 3, 9], [5, 4]]
or should it be
[[0, 3], [0, 9], [5, 4]]
or, does it not matter for the purposes of this assignment...?

Copying assignment files

I noticed that the private repo has only the Readme file instead of all other files required to complete the assignment (short answer, boolean search, documents and query). Should I just copy them from the main repository ?

Yes.

How to delete a specific numbers on a dictionary?

Hey guys, I am having a bit of trouble in the last part of the top bigrams function. I was able to store the information of each bigrams in a dictionary and order the number in a decrescent order, so if we use the example that the professor gave in the code, the result would be "[('b c', 3), ('a b', 2), ('c a', 1), ('c d', 1)]". My question is, how can I eliminate some of the elements from the dictionary, but not all. For example, if the user type he wants just the top 3, I would eliminate just the last element of the dict, resulting in "[('b c', 3), ('a b', 2), ('c a', 1)]."

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.