Git Product home page Git Product logo

anlessini's Introduction

anlessini

Requirements

  • Anserini: an open-source information retrieval toolkit built on Lucene.
  • Java 11+
  • Python 3.7+
  • AWS CLI
  • AWS SAM CLI

Get Started

First let's build the project.

$ mvn clean install

Anlessini uses AWS SAM/Cloudformation for describing the infrastructure. So let's create a S3 bucket for storing the artifacts.

$ ./bin/create-artifact-bucket.sh

Now let's provision the AWS infrastructure for Anlessini. We recommend that you spin up individual CloudFormation stack for each of the collection, as they are logically isolated. The following is am example of Anlessini serving COVID-19 Open Research Dataset.

# package the artifact and upload to S3
$ sam package --template-file template.yaml --s3-bucket $(cat artifact-bucket.txt) --output-template-file cloudformation/cord19.yaml --s3-prefix cord19
# create cloudformation stack
$ sam deploy --template-file cloudformation/cord19.yaml $(cat artifact-bucket.txt) --s3-prefix cord19 --stack-name cord19 --capabilities CAPABILITY_NAMED_IAM

Now we have our infrastructure up, we can populate S3 with our index files, and import the corpus into DynamoDB.

We will be using Anserini to index our corpus, so please refer to the documentation for your specific corpus.

First, download and extract the corpus.

$ cd /path/to/anserini
$ curl https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_2020-10-09.tar.gz -o collections/cord19-2020-10-09.tar.gz
$ pushd collections && tar -zxvf cord19-2020-10-09.tar.gz && rm cord19-2020-10-09.tar.gz && popd

Now we will build the Lucene index. Note that we do not enable -storeContents, -storeRaw, or -storePositions to keep the index minimal. Keeping an index small helps speed up search queries.

$ cd /path/to/anserini
$ mvn clean package appassembler:assemble
$ target/appassembler/bin/IndexCollection \
    -collection Cord19AbstractCollection -generator Cord19Generator \
    -threads 8 -input collections/cord19-2020-10-09 \
    -index indexes/lucene-index-cord19-abstract-2020-10-09 \
    -storeDocvectors

Now lets upload the index files to S3.

$ cd /path/to/anserini
$ export INDEX_BUCKET=$(aws cloudformation describe-stacks --stack-name cord19 --query "Stacks[0].Outputs[?OutputKey=='IndexBucketName'].OutputValue" --output text)
$ aws s3 cp indexes/lucene-index-cord19-abstract-2020-10-09/ s3://$INDEX_BUCKET/cord19/ --recursive

To import the corpus into DynamoDB, use the ImportCollection util. You may first run the command with -dryrun option to perform validation and sanity check without writing to DynamoDB. If everything goes well in the dryrun, you can write the document contents to DynamoDB.

$ cd /path/to/anlessini
$ export DYNAMO_TABLE=$(aws cloudformation describe-stacks --stack-name cord19 --query "Stacks[0].Outputs[?OutputKey=='DynamoTableName'].OutputValue" --output text)
$ utils/target/appassembler/bin/ImportCollection \
    -collection Cord19AbstractCollection -generator Cord19Generator \
    -dynamo.table $DYNAMO_TABLE \
    -threads 8 -input /path/to/anserini/collections/cord19-2020-10-09

Now we can try invoking our function:

$ export API_URL=$(aws cloudformation describe-stacks --stack-name cord19 --query "Stacks[0].Outputs[?OutputKey=='SearchApiUrl'].OutputValue" --output text)
$ curl $API_URL\?query\=incubation\&max_docs\=3

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.