ElasticSearch docker-compose config was adapted from Elastic's Docker documentation. A custom Dockerfile was created to bootstrap the installation with the ElasticSearch Learning to Rank Plugin. We are using the default container number recommended by ElasticSearch of 2 for local development, though in Invitae's production infrastructure we can scale this to more containers.
To start the ElasticSearch cluster locally, navigate to ./elasticsearch
and run:
docker-compose up
To load test documents into the index, start up the cluster and run:
$ test_index.sh
Note: Codes and a sample dataset are included. The original data files are not. The sample dataset was generated from
the original dataset with the script make_data_sample.py
.
Steps:
- Run
bag_of_words.py
to convert the corpus to a doc-by-word BoW matrix. - Run
doc_labels.py
to build a doc-by-label matrix. - Run
classifiers.py
to train and test various classifiers.