prima's Introduction

prima

Personal Research Management for the IA

Installation

Note: This project uses gensim, nltk tokenizer, pandas, and pdfminer in addition to the internetarchive python library for some of the tools and runs with python 2.7 and c. You will need to download them by following the instructions on the websites.

Clone this repository to somewhere in your file system.
To install the tool, first download SQLite and place the files titled sqlite3.c and sqlite3.h in the prima/src directory. Then run
```
 $ sudo python setup.py install
```
To create the appropriate directories for your file system run the following commands where project_name is the desired file name for a collection to be saved in
```
 $ init_workspace.sh
 $ cd workspace
 $ init_project.sh project_name
 $ cd project_name
 $ init_collection.sh
```
To download a collection into the auto-generated source/ directory, run the following where collection_name is a valid id for a collection in archive.org (for example this collection would use collection_name=toronto)
```
 $ fetch_collection.sh collection_name 
```
After completing steps 1-5, run the following to get stats on your collection where tool is one of the options listed below with the appropriate parameters
```
 $ toolname params
```
For every new collection to be created, make sure you're in the workspace directory and repeat step 3 lines 3-5 and step 4 with the new collection name before using any tools.

Tools

Current available tools included in the prima and basic examples are:

BM25 (default k=10)
```
 $ bm25.sh [k] "sample query here"
```
K-means clustering (default k=3)
```
 $ k_means_clusterer.sh [k]
```
Latent Dirichlet allocation (default k=100)
```
 $ lda.sh [k]
```
Latent semantic indexing (default k=100)
```
 $ lsi.sh [k]
```

MinHash (default k=10)

 $ min_hash.sh
 $ min_hash_sim.sh source/folder/document [k]

tf-idf
```
 $ tfidf.sh
```
Word count
```
 $ word_count.sh
```

More detail on how exactly to use these can be found in the wiki.

prima's People

Contributors

Watchers

prima's Issues

Hash function in minHash.c may not be the best choice

I implemented the first function from this page which works well historically, didn't cause any collisions on my small sample, and produced hashes that gave good results for min_hash_sim.py. However, reading this page makes me wonder if we can work on finding a better one in the future.

TODO: Keeping/creating databases

shingles.db is created the first time min_hash.py is run then used for repeated calls to that function (it isn't recreated unless the user deletes it.)
inverted_index.db used to be created by tfidf.py in order to get my function k_means_clusterer.py to work but I changed that code so it doesn't need a database.
Right now I'm thinking keep shingles.db the way it is to save computation time on repeated calls to min_hash.py and just get rid of inverted_index.db altogether.

TODO: Command line arguments

Should the user be able to specify the file type they want saved in the command line arguments?
In min_hash should they be able to specify k-shingles and number of hash functions?
In bm25 should they be able to specify number of documents to be returned or minimum score for documents returned?

Recommend Projects

u-alberta / prima Goto Github PK

prima's Introduction

prima

Installation

Tools

prima's People

Contributors

Watchers

prima's Issues

Hash function in minHash.c may not be the best choice

TODO: Keeping/creating databases

TODO: Command line arguments

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent