Git Product home page Git Product logo

cassandra-lucene-index's Introduction

Stratio’s Cassandra Lucene Index

Stratio’s Cassandra Lucene Index, derived from Stratio Cassandra, is a plugin for Apache Cassandra that extends its index functionality to provide near real time search such as ElasticSearch or Solr, including full text search capabilities and free multivariable, geospatial and bitemporal search. It is achieved through an Apache Lucene based implementation of Cassandra secondary indexes, where each node of the cluster indexes its own data. Stratio’s Cassandra indexes are one of the core modules on which Stratio’s BigData platform is based.

Index relevance searches allows you to retrieve the n more relevant results satisfying a search. The coordinator node sends the search to each node in the cluster, each node returns its n best results and then the coordinator combines these partial results and gives you the n best of them, avoiding full scan. You can also base the sorting in a combination of fields.

Index filtered searches are a powerful help when analyzing the data stored in Cassandra with MapReduce frameworks as Apache Hadoop or, even better, Apache Spark. Adding Lucene filters in the jobs input can dramatically reduce the amount of data to be processed, avoiding full scan.

Any cell in the tables can be indexed, including those in the primary key as well as collections. Wide rows are also supported. You can scan token/key ranges, apply additional CQL3 clauses and page on the filtered results.

This project is not intended to replace Apache Cassandra denormalized tables, inverted indexes, and/or secondary indexes. It is just a tool to perform some kind of queries which are really hard to be addressed using Apache Cassandra out of the box features.

More detailed information is available at Stratio’s Cassandra Lucene Index documentation.

Features

Stratio’s Cassandra Lucene Index and its integration with Lucene search technology provides:

  • Full text search
  • Geospatial search
  • Bitemporal search
  • Boolean (and, or, not) search
  • Near real-time search
  • Relevance scoring and sorting
  • General top-k queries
  • Custom analyzers
  • CQL complex types (list, set, map, tuple and UDT)
  • CQL user defined functions (UDF)
  • Third-party CQL-based drivers compatibility
  • Spark compatibility
  • Hadoop compatibility
  • Paging over filtering searches

Not yet supported:

  • Thrift API
  • Legacy compact storage option
  • Indexing counter columns
  • Columns with TTL
  • Static columns
  • Other partitioners than Murmur3
  • Paging over top-k searches

Requirements

  • Cassandra (identified by the three first numbers of the plugin version)
  • Java >= 1.8 (OpenJDK and Sun have been tested)
  • Maven >= 3.0

Build and install

Stratio’s Cassandra Lucene Index is distributed as a plugin for Apache Cassandra. Thus, you just need to build a JAR containing the plugin and add it to the Cassandra’s classpath:

  • Build the plugin with Maven: mvn clean package
  • Copy the generated JAR to the lib folder of your compatible Cassandra installation:

    cp plugin/target/cassandra-lucene-index-plugin-*.jar <CASSANDRA_HOME>/lib/

  • Start/restart Cassandra as usual
Alternatively, patching can also be done with this Maven profile, specifying the path of your Cassandra installation,

this task also delete previous plugin's JAR versions in CASSANDRA_HOME/lib/ directory:

mvn clean package -Ppatch -Dcassandra_home=<CASSANDRA_HOME>

If you don’t have an installed version of Cassandra, there is also an alternative profile to let Maven download and patch the proper version of Apache Cassandra:

mvn clean package -Pdownload_and_patch -Dcassandra_home=<CASSANDRA_HOME>

Now you can run Cassandra and do some tests using the Cassandra Query Language:

<CASSANDRA_HOME>/bin/cassandra -f
<CASSANDRA_HOME>/bin/cqlsh

The Lucene’s index files will be stored in the same directories where the Cassandra’s will be. The default data directory is /var/lib/cassandra/data, and each index is placed next to the SSTables of its indexed column family.

For more details about Apache Cassandra please see its documentation.

Example

We will create the following table to store tweets:

CREATE KEYSPACE demo
WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor': 1};
USE demo;
CREATE TABLE tweets (
    id INT PRIMARY KEY,
    user TEXT,
    body TEXT,
    time TIMESTAMP,
    latitude FLOAT,
    longitude FLOAT
);

Now you can create a custom Lucene index on it with the following statement:

CREATE CUSTOM INDEX tweets_index ON tweets ()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        fields : {
            id    : {type : "integer"},
            user  : {type : "string"},
            body  : {type : "text", analyzer : "english"},
            time  : {type : "date", pattern : "yyyy/MM/dd", sorted : true},
            place : {type : "geo_point", latitude:"latitude", longitude:"longitude"}
        }
    }'
};

This will index all the columns in the table with the specified types, and it will be refreshed once per second. Alternatively, you can explicitly refresh all the index shards with an empty search with consistency ALL:

CONSISTENCY ALL
SELECT * FROM tweets WHERE expr(tweets_index,'{refresh:true}');
CONSISTENCY QUORUM

Now, to search for tweets within a certain date range:

SELECT * FROM tweets WHERE expr(tweets_index,'{
    filter : {type:"range", field:"time", lower:"2014/04/25", upper:"2014/05/01"}
}') limit 100;

The same search can be performed forcing an explicit refresh of the involved index shards:

SELECT * FROM tweets WHERE expr(tweets_index,'{
    filter : {type:"range", field:"time", lower:"2014/04/25", upper:"2014/05/01"},
    refresh : true
}') limit 100;

Now, to search the top 100 more relevant tweets where body field contains the phrase “big data gives organizations” within the aforementioned date range:

SELECT * FROM tweets WHERE expr(tweets_index,'{
    filter : {type:"range", field:"time", lower:"2014/04/25", upper:"2014/05/01"},
    query  : {type:"phrase", field:"body", value:"big data gives organizations", slop:1}
}') limit 100;

To refine the search to get only the tweets written by users whose name starts with “a”:

SELECT * FROM tweets WHERE expr(tweets_index,'{
    filter : {type:"boolean", must:[
                   {type:"range", field:"time", lower:"2014/04/25", upper:"2014/05/01"},
                   {type:"prefix", field:"user", value:"a"} ] },
    query  : {type:"phrase", field:"body", value:"big data gives organizations", slop:1}
}') limit 100;

To get the 100 more recent filtered results you can use the sort option:

SELECT * FROM tweets WHERE expr(tweets_index,'{
    filter : {type:"boolean", must:[
                   {type:"range", field:"time", lower:"2014/04/25", upper:"2014/05/01"},
                   {type:"prefix", field:"user", value:"a"} ] },
    query  : {type:"phrase", field:"body", value:"big data gives organizations", slop:1},
    sort   : {fields: [ {field:"time", reverse:true} ] }
}') limit 100;

The previous search can be restricted to a geographical bounding box:

SELECT * FROM tweets WHERE expr(tweets_index,'{
    filter : {type:"boolean", must:[
                   {type:"range", field:"time", lower:"2014/04/25", upper:"2014/05/01"},
                   {type:"prefix", field:"user", value:"a"},
                   {type:"geo_bbox",
                    field:"place",
                    min_latitude:40.225479,
                    max_latitude:40.560174,
                    min_longitude:-3.999278,
                    max_longitude:-3.378550} ] },
    query  : {type:"phrase", field:"body", value:"big data gives organizations", slop:1},
    sort   : {fields: [ {field:"time", reverse:true} ] }
}') limit 100;

Alternatively, you can restrict the search to retrieve tweets that are within a specific distance from a geographical position:

SELECT * FROM tweets WHERE expr(tweets_index,'{
    filter : {type:"boolean", must:[
                   {type:"range", field:"time", lower:"2014/04/25", upper:"2014/05/01"},
                   {type:"prefix", field:"user", value:"a"},
                   {type:"geo_distance",
                    field:"place",
                    latitude:40.393035,
                    longitude:-3.732859,
                    max_distance:"10km",
                    min_distance:"100m"} ] },
    query  : {type:"phrase", field:"body", value:"big data gives organizations", slop:1},
    sort   : {fields: [ {field:"time", reverse:true} ] }
}') limit 100;

Finally, if you want to restrict the search to a certain token range:

SELECT * FROM tweets WHERE expr(tweets_index,'{
    filter : {type:"boolean", must:[
                   {type:"range", field:"time", lower:"2014/04/25", upper:"2014/05/01"},
                   {type:"prefix", field:"user", value:"a"} ,
                   {type:"geo_distance",
                    field:"place",
                    latitude:40.393035,
                    longitude:-3.732859,
                    max_distance:"10km",
                    min_distance:"100m"} ] },
    query  : {type:"phrase", field:"body", value:"big data gives organizations", slop:1]}
}') AND token(id) >= token(0) AND token(id) < token(10000000) limit 100;

This last is the basis for Hadoop, Spark and other MapReduce frameworks support.

Please, refer to the comprehensive Stratio’s Cassandra Lucene Index documentation.

cassandra-lucene-index's People

Contributors

adelapena avatar andreaspetter avatar jcortejoso avatar ml0renz0 avatar talberto-zz avatar witokondoria avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.