iit-cs429 / main Goto Github PK

CS429: Information Retrieval @ IIT

Python 0.06% TeX 2.10% Shell 0.03% Jupyter Notebook 97.80%

main's Issues

regarding skip lists

for short answer question 1 part a. does it mean how many elements are being skipped? I am not sure what how often is skip pointer "followed" means? please help!!

Floating Point Underflow Error

For the Naive Bayes classifier I'm getting an underflow error which makes the floating point value 0.0 which makes it difficult to compare if the message is spam or not spam if both are 0.0. Are there any easy solutions to this problem or underflow error is an error and I need to check my math?

Thanks,
Adrian

Tokenize function -contd

I'm sorry to reply late..I still have one more doubt, If thats the case then under the create_index function the professor has passed a list with sublists as argument create_index([['a', 'b'], ['a', 'c']])
that obviously means that he's trying to pass multiple sublists within a parent list as a paramenter ..am i right ??

How to delete a specific numbers on a dictionary?

Hey guys, I am having a bit of trouble in the last part of the top bigrams function. I was able to store the information of each bigrams in a dictionary and order the number in a decrescent order, so if we use the example that the professor gave in the code, the result would be "[('b c', 3), ('a b', 2), ('c a', 1), ('c d', 1)]". My question is, how can I eliminate some of the elements from the dictionary, but not all. For example, if the user type he wants just the top 3, I would eliminate just the last element of the dict, resulting in "[('b c', 3), ('a b', 2), ('c a', 1)]."

lemmatizer in stem function

i am getting 'u' before the word, i cant happen to find the reason. i cant get rid of it, please help

Slight error in doctest

In search, there is the following doctest entry:
E.g., below we search for documents containing the phrase 'a b c':
>>> search({'a': [[0, 4], [1, 1]], 'b': [[0, 5], [1, 10]], 'c': [[0, 6], [1, 11]]}, 'a b')

the search is actually performed only for 'a b', thought you might want to correct it for futur uses.

Alexis

Tests for count_doc_frequencies

I would like to check the tests for the function count_doc_frequencies.
In the parameters of the function we have 4 a's but in the test result we have: res['a'] = 3. The test is incorrect, right?

I can't open the slides with iPython

Problem: You follow the instructions to open the lecture notebook: python notebook Introduction.ipynb, but you get an error message like "unreadable json notebook."

Solution: Update to the latest version of ipython. These links may help:
http://askubuntu.com/questions/335883/how-to-use-the-newest-ipython-in-ubuntu12-04
https://www.youtube.com/watch?v=llX5bn1_BF4

lemmatizing

even after downloading nltk its showing me to download, how to fix this error.
here is the link below ipython notebook as an example. please help
http://localhost:8888/notebooks/Untitled0.ipynb

Assingment 1 - Short Answers

Hi !

For the first question on the Short Answers part, should we provide the answers using asymptotic complexity or is it just necessary to iterate through the algorithm provided by the book ("intersect with skips") ? (e.g. "the skip pointer is followed three times" for the first question).

Thanks !

Output of assignment 2 not matching

I am getting slightly different values (change in decimal point) for document score, therefore the output of my code is not matching with the output given, sometimes even the top 10 document ids are changing.
For e.g my output of QUERY= pop love song
4330 1.619429e+00
2203 9.325103e-01
2205 8.227300e-01
4693 7.713332e-01
312 6.915496e-01
5113 6.480366e-01
3401 6.463378e-01
4996 6.095998e-01
2683 5.734033e-01
2162 5.492592e-01

Is it acceptable? or is there any workaround?

Questions regarding TIME.REL

since both this file and the TIME.QUE has 83 entries, does this implies the relevant documents for each query? the issue above says the document id start from 1, but from TIME.ALL there isn't a document with id 1, it starts with 17. thanks.

Output

Do we need exactly the same out of Log.txt? I have this.

QUERY= city Using Champion List
1199 2.679080e-01
3691 2.367081e-01
2680 1.913987e-01
410 1.750654e-01
4256 1.516835e-01
3288 1.516195e-01
811 1.490004e-01
1983 1.416778e-01
2336 1.241685e-01
5362 9.983552e-02

However the Log.txt has this:
QUERY= city Using Champion List
1199 2.674604e-01
3691 2.366818e-01
2680 1.912368e-01
410 1.749006e-01
4256 1.516393e-01
3288 1.516023e-01
811 1.489597e-01
1983 1.415582e-01
2336 1.241447e-01
5362 9.984798e-02

Similar cases for all the queries

Copying assignment files

I noticed that the private repo has only the Readme file instead of all other files required to complete the assignment (short answer, boolean search, documents and query). Should I just copy them from the main repository ?

Yes.

queries.txt does not match Log.txt in assignment1

It was pointed out that queries.txt does not match what was used to generate Log.txt.

I updated queries.txt to

lion
What has
why because
has four legs
did the chicken
why did the
vies
the moo-vies
why did
why does
what's

Index Benchmark

This may not be the requirement for this assignment, but could you provide some benchmark for the indexing? For example, how long should it take to index a document with approximately 100,000 words?

Learn Python: Python Koans

Basically a giant unit test, start from the basics all the way to more advanced topics.

https://github.com/gregmalcolm/python_koans

I've done these, and would be happy to help to get you going if you need.

Tokenize function

hi, the tokenize function should be built to split each sentence into individual lists within the parent list or is it supposed to split the sentences according to the given sample test case under the tokenize function (i.e) splitting all sentences into one single list ??

git pull not working

do I have to create the a1 folder?
I thought doing a git pull should copy the files for assignment 1 for me.

I can't see my private repo from my GitHub login page

Problem: you log into github, but don't see a link to iit-cs429/[iit user name]

Solution: Switch your view to the iit-cs429 organization. See attached.

query_to_vector

Hi,

Can you give me an example for query_to_vector ? I'm a little confused by the input, as well as the output you're asking for. Could you give me a sample input and sample output? I'm confused by whether we are given a list of lists where each sublist represents one document, or if we are given one list with multiple words in it.

Thanks!

Assingment 1 Stemming Function

Hi ! For the Assignment 1, can we use the WordNet Lemmatizer to build the stem(tokens) function?

Thanks !

python phrase_search.py

Hi Professor,

Could you post the correct output of python phrase_search.py like you did for assignment0?

assignment question 1.

Q. In searcher.py, why do we keep an inverted index instead of simply a list of document vectors (e.g., dicts)? What is the difference in time and space complexity between the two approaches?

In this question you have asked why do we prefer an inverted index instead of a list ?

No tests run for doctest

So as I mentioned in class, I had the issue where when I ran the doctests, no tests ran:

$ python -m doctest boolean_search.py -v
8 items had no tests:
   boolean_search
   boolean_search.create_index
   boolean_search.intersect
   boolean_search.main
   boolean_search.read_lines
   boolean_search.search
   boolean_search.sort_by_num_postings
   boolean_search.tokenize
0 tests in 8 items.
0 passed and 0 failed.
Test passed.

However running the code normally produces the expected results. I doubt it's a path issue since the doctest actually ran, so I'm not sure what the issue could be.

Start Index in TIME.REL

Are the document ids in Time.REL start from 1?

Bigrams

Can we use the predefined functions using NLTK in order to find the top bigrams?

I can't get git installed on windows

it says "I don't like windows"

append 2 lists/dictionaries

Is it possible to append two lists or dictionaries? I was trying to do the create_positional_index code part and when I was looking at the expected results I saw this:

index['a']
[[0, 0, 2], [1, 0]]
>>> index['b']
[[0, 1]]
>>> index[('c')]
[[1, 1]]

There are two brackets onthe beginning and end of each result, and when it comes to index of A, there are 2 lists, so to do so I would need to merge/append two dictionaries or lists in my index, but how can I do that?

Tokenize from lecture

Are we allowed to use the tokenize source from the lecture, or are we supposed to write ours differently?

Quizzes / In-class assignments

The syllabus mentions that quizzes / in-class assignments will count for 50/700 points. Have we had any of these yet? If so, what were they? I missed a couple classes, so I'm wondering if I missed some.

Last date for assignment 2

Professor,
When is the last day to submit assignment 2. Where could I get this information henceforth.

doctest possible error assignment 2

doctest for count_doc_frequencies:
res = Index().count_doc_frequencies([['a', 'b', 'a'], ['a', 'b', 'c'], ['a']])
res['a'] should have a value of 4 in this case, not 3.

Issue with reading documents.txt

While I was trying to test things in my code, I received the following error trying to read the file:

ValueError: Invalid mode ('rtb')

Looking a little bit deeper, the issue stems from these lines in codecs.py:

876             # Force opening of the file in binary mode
877             mode = mode + 'b'
--> 878     file = __builtin__.open(filename, mode, buffering)

What exactly am I supposed to do in this case? I haven't modified any pre-given function.

phrase_intersect

There is not a test case provided for this, but if a phrase were to occur multiple times in the same document, should those results returned in one sub-list? For example, if a phrase occurred twice in document 0 and once in document 5, should it look something like
[[0, 3, 9], [5, 4]]
or should it be
[[0, 3], [0, 9], [5, 4]]
or, does it not matter for the purposes of this assignment...?

setup problem with git

i am trying to setup git by following the instructions given but it gives me error when i am typing
git config --global sjain41 "Smit Jain"
error:key does not contain selection:sjain41
Please help.

compute_doc_length

lengths = Index().compute_doc_lengths({'a': [[0, 3]], 'b': [[0, 4]]})
how can i access docid '0' for both 'a' and 'b' and compare them if they r equal? I have spent so much time trying to do it but no result. how can i access the weight (which i think i can figure once i can access doc id)? Please Help!!

Log.txt vs. Log2.txt

These seem to contain the exact same results. Is this correct?

create_tf_idf_index test

Hi,

I'm not sure I understand this test:

index = Index().create_tfidf_index([['a', 'b', 'a'], ['a']], {'a': 2., 'b': 1., 'c': 1.})
sorted(index.keys())
['a', 'b']
index['a']
[[0, 0.0], [1, 0.0]]
index['b'] # doctest:+ELLIPSIS
[[0, 0.301...]]

Both idf value for 'a' are 0 because 'a' appears in all documents, hence log(N/dft)=log(1)=0.
This I understand.

However, I cannot get around the value for 'b'.
It seem to me that the expression would be tf_idf = log(1+1)*log(2/1), which is roughly 0.09. But the result of the doctest is 0.301, which means that N would have been given a value of 10, which would be valid if there were 10 documents total (but would make values for 'a' false), except we don't have this kind of information.

Thanks
Alexis

[EDIT] Just a precision, I understand the number of documents is probably available using len(self.documents) or something similar, however, this is not available for tests.

A0: Normalize tokens?

For the inverted index, are we required to normalize the tokens before the creation of the inverted index? Or just we need to create the tokens and then index them?

No additional normalization is required for this assignment.

I can't access my private repo (404 error)

I go to https://github.com/iit-cs429/[my-github-id]-asg but get a 404 error.

Be sure you're logged into github first and confirm I have created a private repo for you.

Submission date for Assignment 1

Hi,
I want to confirm the submission date for Assignment 1.
In iit-cs429/main/assignments/assignment1, it is mentioned as 2/06 at 11:59pm but in schedule it is 2/05.
Could you please confirm the date.

word count query

Should we perform the word count on tokenized document or normal? [for the spelling corrector]

divide by zero error

Hi Professor,

I receive a divide by zero error in query_to_vector because it seems a term in a query is not present in the documents and the idf cannot be computed. For example, in the first query, "KENNEDY ADMINISTRATION PRESSURE ON NGO DINH DIEM TO STOP SUPPRESSING THE BUDDHISTS ." suppressing is not a word in TIME.ALL

doc frequency

i am unable to find the doc frequency, rather i end up finding the number of terms in total. Please Help. Any tips on how to proceed. and for professor i have my recent code checked in.
Thank you

how to copy files to my private repository

how do i copy those files from main to my repository??

Private repo 404?

I don't think my private repo was created, is there a way for me to double-check?

Short answer

Extend the postings merge algorithm to arbitrary Boolean query formulas. What is
its time complexity?

Do we have to explain how to extend the algorithm. Or do we just answer the rest of the question as if it were extended?

assignment0 file not read

i am running the code in ipython notebook, i cant happen to save documents.txt file. how can i save it there so i can run the code and have the code read the documents.txt file? currently it shows this error obviously:
IOError: [Errno 2] No such file or directory: 'documents.txt'

please help

query_to_vector

Hi Professor,

For query_to_vector, wouldn't we also need to provide the list of documents instead of just the query_terms? Since we need to find the idf weight for each term in query_terms, we would need to know the total number of documents and the total number of documents each term appears in. I am not too sure how to achieve this with just the list of query terms.

iit-cs429 / main Goto Github PK

main's Issues

Recommend Projects

Recommend Topics

Recommend Org