Git Product home page Git Product logo

bm25's Introduction

Features

A simple IR library to index files and retrieve key words using scoring algorithms.

Example

Take a PDF, upload it and the system indexes the key-words/sentences for retrieval.

Example output:

1 || 1 || berkshire hathaway inc
1 || 2 ||  annual report
3 || 1 || berkshire hathaway inc
3 || 2 ||  annual report table of contents berkshires performance vs
3 || 3 || the sp
3 || 63 ||  chairmans letter
3 || 141 || - form-k businessdescription
3 || 212 || k- riskfactors
3 || 290 || k- description ofproperties
3 || 358 || k- managementsdiscussion
3 || 425 || k- managementsreport oninternalcontrols
3 || 479 || k- independentauditorsreport
3 || 543 || k- consolidatedfinancialstatements
3 || 603 || k- notes toconsolidatedfinancialstatements
3 || 656 || k- appendices  shareholderevent andmeetinginformation
3 || 709 || a- propertycasualtyinsurance
3 || 774 || a- operatingcompanies
3 || 845 || a- stocktransferagent
3 || 916 || a- directors andofficers ofthecompany
3 || 967 || insidebackcover bywarrenebuffett copyright allrightsreserved
4 || 1 || berkshires performance vs
4 || 2 || the sp  annual percentage change in per-share in sp  market value of with dividends berkshire included year
4 || 4236 ||   compoundedannualgain -
4 || 4281 ||   overallgain -
4 || 4337 ||   note data arefor calendaryears withthese exceptionsandyear endedmonths ended
5 || 1 || berkshire hathaway inc
5 || 2 || to the shareholders of berkshire hathaway inc charlie munger my long-time partner and i have the job of managing the savings of a great number of individuals
5 || 3 || we are grateful for their enduring trust a relationship that often spans much of their adult lifetime
5 || 4 || it is those dedicated savers that are forefront in my mind as i write this letter
5 || 5 || a common belief is that people choose to save when young expecting thereby to maintain their living standards after retirement
5 || 6 || any assets that remain at death this theory says will usually be left to their families or possibly to friends and philanthropy
5 || 7 || our experience has differed

Classes

[x] Corpus [x] Indexer [x] Retriever [x] Scorer [x] Tokenizer

First priority

[x] Translate Corpus into Inverted Index. [x] Cleanup Existing Code Base - [x] Ensure class files and folder names are updated accordingly. - [x] Ensure class files are not the same as namespaces for sake of using conventions. [x] Update the Indexer class. <--

Second Priority

[ ] Add S3 Capabilities. [ ] Update Retriever class. [ ] Build a Tokenizer class ( to tokenize ~ lemmatize and normalize each document )

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.