-
Build a search engine for Environmental News NLP archieve.
-
Built a corpus for archieve with atleast 418 documents.
- clone the git repo https://github.com/ShashankMG123/InformationRetrieval.git
cd InformationRetrieval
python3 setup.py
- nltk
- BTrees
Set up these Libraries for your system.
├── 0226_0286_0298_1557_AIR_Report.pdf ├── bigramIndex ├── documentInfo ├── ElasticSearchUtil │ ├── createIndex.py │ └── jsonInputs ├── indexes ├── README.md ├── setup.py └── src ├── input ├── SimpleSearch.json ├── PhraseSearch.json └── WildCard.json ├── compareES.py ├── genetateBigramIndex.py ├── indexConstruction.py ├── phraseQuery.py ├── queryDriver.py ├── simpleSearch.py ├── utils.py └── wildcardQuery.py
mode :
- 0 (Single File search)
- 1 (All File search)
fileName : Name of file if mode 0 search : list of terms for simple search must : phrase for positional search wildcard : for wildcard query top : number of docs required
- Change the json file in the input directory
python3 queryDriver.py <type of query> <compare with ES flag>
If compare flag is on make sure ES has started with all indexes running on port 9200. sample commandpython3 queryDriver.py SimpleSearch 0