iit-cs429 / main Goto Github PK
View Code? Open in Web Editor NEWCS429: Information Retrieval @ IIT
CS429: Information Retrieval @ IIT
for short answer question 1 part a. does it mean how many elements are being skipped? I am not sure what how often is skip pointer "followed" means? please help!!
For the Naive Bayes classifier I'm getting an underflow error which makes the floating point value 0.0 which makes it difficult to compare if the message is spam or not spam if both are 0.0. Are there any easy solutions to this problem or underflow error is an error and I need to check my math?
Thanks,
Adrian
I'm sorry to reply late..I still have one more doubt, If thats the case then under the create_index function the professor has passed a list with sublists as argument create_index([['a', 'b'], ['a', 'c']])
that obviously means that he's trying to pass multiple sublists within a parent list as a paramenter ..am i right ??
Hey guys, I am having a bit of trouble in the last part of the top bigrams function. I was able to store the information of each bigrams in a dictionary and order the number in a decrescent order, so if we use the example that the professor gave in the code, the result would be "[('b c', 3), ('a b', 2), ('c a', 1), ('c d', 1)]". My question is, how can I eliminate some of the elements from the dictionary, but not all. For example, if the user type he wants just the top 3, I would eliminate just the last element of the dict, resulting in "[('b c', 3), ('a b', 2), ('c a', 1)]."
In search, there is the following doctest entry:
E.g., below we search for documents containing the phrase 'a b c':
>>> search({'a': [[0, 4], [1, 1]], 'b': [[0, 5], [1, 10]], 'c': [[0, 6], [1, 11]]}, 'a b')
the search is actually performed only for 'a b', thought you might want to correct it for futur uses.
Alexis
I would like to check the tests for the function count_doc_frequencies.
In the parameters of the function we have 4 a's but in the test result we have: res['a'] = 3. The test is incorrect, right?
Problem: You follow the instructions to open the lecture notebook: python notebook Introduction.ipynb
, but you get an error message like "unreadable json notebook."
Solution: Update to the latest version of ipython. These links may help:
http://askubuntu.com/questions/335883/how-to-use-the-newest-ipython-in-ubuntu12-04
https://www.youtube.com/watch?v=llX5bn1_BF4
even after downloading nltk its showing me to download, how to fix this error.
here is the link below ipython notebook as an example. please help
http://localhost:8888/notebooks/Untitled0.ipynb
Hi !
For the first question on the Short Answers part, should we provide the answers using asymptotic complexity or is it just necessary to iterate through the algorithm provided by the book ("intersect with skips") ? (e.g. "the skip pointer is followed three times" for the first question).
Thanks !
I am getting slightly different values (change in decimal point) for document score, therefore the output of my code is not matching with the output given, sometimes even the top 10 document ids are changing.
For e.g my output of QUERY= pop love song
4330 1.619429e+00
2203 9.325103e-01
2205 8.227300e-01
4693 7.713332e-01
312 6.915496e-01
5113 6.480366e-01
3401 6.463378e-01
4996 6.095998e-01
2683 5.734033e-01
2162 5.492592e-01
Is it acceptable? or is there any workaround?
since both this file and the TIME.QUE has 83 entries, does this implies the relevant documents for each query? the issue above says the document id start from 1, but from TIME.ALL there isn't a document with id 1, it starts with 17. thanks.
Do we need exactly the same out of Log.txt? I have this.
QUERY= city Using Champion List
1199 2.679080e-01
3691 2.367081e-01
2680 1.913987e-01
410 1.750654e-01
4256 1.516835e-01
3288 1.516195e-01
811 1.490004e-01
1983 1.416778e-01
2336 1.241685e-01
5362 9.983552e-02
However the Log.txt has this:
QUERY= city Using Champion List
1199 2.674604e-01
3691 2.366818e-01
2680 1.912368e-01
410 1.749006e-01
4256 1.516393e-01
3288 1.516023e-01
811 1.489597e-01
1983 1.415582e-01
2336 1.241447e-01
5362 9.984798e-02
Similar cases for all the queries
I noticed that the private repo has only the Readme file instead of all other files required to complete the assignment (short answer, boolean search, documents and query). Should I just copy them from the main repository ?
Yes.
It was pointed out that queries.txt does not match what was used to generate Log.txt.
I updated queries.txt to
lion
What has
why because
has four legs
did the chicken
why did the
vies
the moo-vies
why did
why does
what's
This may not be the requirement for this assignment, but could you provide some benchmark for the indexing? For example, how long should it take to index a document with approximately 100,000 words?
Basically a giant unit test, start from the basics all the way to more advanced topics.
https://github.com/gregmalcolm/python_koans
I've done these, and would be happy to help to get you going if you need.
hi, the tokenize function should be built to split each sentence into individual lists within the parent list or is it supposed to split the sentences according to the given sample test case under the tokenize function (i.e) splitting all sentences into one single list ??
do I have to create the a1 folder?
I thought doing a git pull should copy the files for assignment 1 for me.
Hi,
Can you give me an example for query_to_vector ? I'm a little confused by the input, as well as the output you're asking for. Could you give me a sample input and sample output? I'm confused by whether we are given a list of lists where each sublist represents one document, or if we are given one list with multiple words in it.
Thanks!
Hi ! For the Assignment 1, can we use the WordNet Lemmatizer to build the stem(tokens) function?
Thanks !
Hi Professor,
Could you post the correct output of python phrase_search.py
like you did for assignment0?
Q. In searcher.py, why do we keep an inverted index instead of simply a list of document vectors (e.g., dicts)? What is the difference in time and space complexity between the two approaches?
In this question you have asked why do we prefer an inverted index instead of a list ?
So as I mentioned in class, I had the issue where when I ran the doctests, no tests ran:
$ python -m doctest boolean_search.py -v
8 items had no tests:
boolean_search
boolean_search.create_index
boolean_search.intersect
boolean_search.main
boolean_search.read_lines
boolean_search.search
boolean_search.sort_by_num_postings
boolean_search.tokenize
0 tests in 8 items.
0 passed and 0 failed.
Test passed.
However running the code normally produces the expected results. I doubt it's a path issue since the doctest actually ran, so I'm not sure what the issue could be.
Are the document ids in Time.REL start from 1?
Can we use the predefined functions using NLTK in order to find the top bigrams?
it says "I don't like windows"
Is it possible to append two lists or dictionaries? I was trying to do the create_positional_index code part and when I was looking at the expected results I saw this:
index['a']
[[0, 0, 2], [1, 0]]
>>> index['b']
[[0, 1]]
>>> index[('c')]
[[1, 1]]
There are two brackets onthe beginning and end of each result, and when it comes to index of A, there are 2 lists, so to do so I would need to merge/append two dictionaries or lists in my index, but how can I do that?
Are we allowed to use the tokenize source from the lecture, or are we supposed to write ours differently?
The syllabus mentions that quizzes / in-class assignments will count for 50/700 points. Have we had any of these yet? If so, what were they? I missed a couple classes, so I'm wondering if I missed some.
Professor,
When is the last day to submit assignment 2. Where could I get this information henceforth.
doctest for count_doc_frequencies:
res = Index().count_doc_frequencies([['a', 'b', 'a'], ['a', 'b', 'c'], ['a']])
res['a'] should have a value of 4 in this case, not 3.
While I was trying to test things in my code, I received the following error trying to read the file:
ValueError: Invalid mode ('rtb')
Looking a little bit deeper, the issue stems from these lines in codecs.py:
876 # Force opening of the file in binary mode
877 mode = mode + 'b'
--> 878 file = __builtin__.open(filename, mode, buffering)
What exactly am I supposed to do in this case? I haven't modified any pre-given function.
There is not a test case provided for this, but if a phrase were to occur multiple times in the same document, should those results returned in one sub-list? For example, if a phrase occurred twice in document 0 and once in document 5, should it look something like
[[0, 3, 9], [5, 4]]
or should it be
[[0, 3], [0, 9], [5, 4]]
or, does it not matter for the purposes of this assignment...?
lengths = Index().compute_doc_lengths({'a': [[0, 3]], 'b': [[0, 4]]})
how can i access docid '0' for both 'a' and 'b' and compare them if they r equal? I have spent so much time trying to do it but no result. how can i access the weight (which i think i can figure once i can access doc id)? Please Help!!
These seem to contain the exact same results. Is this correct?
Hi,
I'm not sure I understand this test:
index = Index().create_tfidf_index([['a', 'b', 'a'], ['a']], {'a': 2., 'b': 1., 'c': 1.})
sorted(index.keys())
['a', 'b']
index['a']
[[0, 0.0], [1, 0.0]]
index['b'] # doctest:+ELLIPSIS
[[0, 0.301...]]
Both idf value for 'a' are 0 because 'a' appears in all documents, hence log(N/dft)=log(1)=0.
This I understand.
However, I cannot get around the value for 'b'.
It seem to me that the expression would be tf_idf = log(1+1)*log(2/1), which is roughly 0.09. But the result of the doctest is 0.301, which means that N would have been given a value of 10, which would be valid if there were 10 documents total (but would make values for 'a' false), except we don't have this kind of information.
Thanks
Alexis
[EDIT] Just a precision, I understand the number of documents is probably available using len(self.documents) or something similar, however, this is not available for tests.
For the inverted index, are we required to normalize the tokens before the creation of the inverted index? Or just we need to create the tokens and then index them?
No additional normalization is required for this assignment.
I go to https://github.com/iit-cs429/[my-github-id]-asg but get a 404 error.
Be sure you're logged into github first and confirm I have created a private repo for you.
Hi,
I want to confirm the submission date for Assignment 1.
In iit-cs429/main/assignments/assignment1, it is mentioned as 2/06 at 11:59pm but in schedule it is 2/05.
Could you please confirm the date.
Should we perform the word count on tokenized document or normal? [for the spelling corrector]
Hi Professor,
I receive a divide by zero error in query_to_vector because it seems a term in a query is not present in the documents and the idf cannot be computed. For example, in the first query, "KENNEDY ADMINISTRATION PRESSURE ON NGO DINH DIEM TO STOP SUPPRESSING THE BUDDHISTS ." suppressing is not a word in TIME.ALL
i am unable to find the doc frequency, rather i end up finding the number of terms in total. Please Help. Any tips on how to proceed. and for professor i have my recent code checked in.
Thank you
how do i copy those files from main to my repository??
I don't think my private repo was created, is there a way for me to double-check?
Extend the postings merge algorithm to arbitrary Boolean query formulas. What is
its time complexity?
Do we have to explain how to extend the algorithm. Or do we just answer the rest of the question as if it were extended?
i am running the code in ipython notebook, i cant happen to save documents.txt file. how can i save it there so i can run the code and have the code read the documents.txt file? currently it shows this error obviously:
IOError: [Errno 2] No such file or directory: 'documents.txt'
please help
Hi Professor,
For query_to_vector, wouldn't we also need to provide the list of documents instead of just the query_terms? Since we need to find the idf weight for each term in query_terms, we would need to know the total number of documents and the total number of documents each term appears in. I am not too sure how to achieve this with just the list of query terms.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.