Git Product home page Git Product logo

prima's Introduction

prima

Personal Research Management for the IA

Installation

Note: This project uses gensim, nltk tokenizer, pandas, and pdfminer in addition to the internetarchive python library for some of the tools and runs with python 2.7 and c. You will need to download them by following the instructions on the websites.

  1. Clone this repository to somewhere in your file system.

  2. To install the tool, first download SQLite and place the files titled sqlite3.c and sqlite3.h in the prima/src directory. Then run

     $ sudo python setup.py install
    
  3. To create the appropriate directories for your file system run the following commands where project_name is the desired file name for a collection to be saved in

     $ init_workspace.sh
     $ cd workspace
     $ init_project.sh project_name
     $ cd project_name
     $ init_collection.sh
    
  4. To download a collection into the auto-generated source/ directory, run the following where collection_name is a valid id for a collection in archive.org (for example this collection would use collection_name=toronto)

     $ fetch_collection.sh collection_name 
    
  5. After completing steps 1-5, run the following to get stats on your collection where tool is one of the options listed below with the appropriate parameters

     $ toolname params
    
  6. For every new collection to be created, make sure you're in the workspace directory and repeat step 3 lines 3-5 and step 4 with the new collection name before using any tools.

Tools

Current available tools included in the prima and basic examples are:

  1. BM25 (default k=10)

     $ bm25.sh [k] "sample query here"
    
  2. K-means clustering (default k=3)

     $ k_means_clusterer.sh [k]
    
  3. Latent Dirichlet allocation (default k=100)

     $ lda.sh [k]
    
  4. Latent semantic indexing (default k=100)

     $ lsi.sh [k]
    
  5. MinHash (default k=10)

     $ min_hash.sh
     $ min_hash_sim.sh source/folder/document [k]
    
  6. tf-idf

     $ tfidf.sh
    
  7. Word count

     $ word_count.sh
    

More detail on how exactly to use these can be found in the wiki.

prima's People

Contributors

dmbarbosa avatar elmtree8 avatar

Watchers

 avatar  avatar

prima's Issues

TODO: Keeping/creating databases

shingles.db is created the first time min_hash.py is run then used for repeated calls to that function (it isn't recreated unless the user deletes it.)
inverted_index.db used to be created by tfidf.py in order to get my function k_means_clusterer.py to work but I changed that code so it doesn't need a database.
Right now I'm thinking keep shingles.db the way it is to save computation time on repeated calls to min_hash.py and just get rid of inverted_index.db altogether.

TODO: Command line arguments

Should the user be able to specify the file type they want saved in the command line arguments?
In min_hash should they be able to specify k-shingles and number of hash functions?
In bm25 should they be able to specify number of documents to be returned or minimum score for documents returned?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.