Based on and motivated by the following resources:
- Apache project list
- Edd Dumbill's What is Apache Hadoop?
- Edd Dumbill's The SMAQ stack for big data
- My Interactive analysis of large-scale datasets post
- Accumulo, http://accumulo.apache.org/ - a sorted, distributed key/value store
- Cassandra, http://cassandra.apache.org/ - column-oriented database
- Cayenne, http://cayenne.apache.org/ - object-relational mapping (ORM) and remoting services
- CouchDB, http://couchdb.apache.org/ - NoSQL document-oriented datastore
- Gora, http://gora.apache.org/ - provides an in-memory data model and persistence for big data
- Hadoop, http://hadoop.apache.org/ - a distributed computing platform:
- HDFS - distributed redundant file system for Hadoop
- MapReduce - parallel computation on server clusters
- HBase, http://hbase.apache.org/ - column-oriented database on top of Hadoop
- Hive, http://hive.apache.org/ - data warehouse with SQL-like access
- Flume, http://flume.apache.org/ - collection and import of log and event data
- Lucene, http://lucene.apache.org/ - indexing
- Mahout, http://mahout.apache.org/ - library of machine learning and data mining algorithms on top of Hadoop
- Pig, http://pig.apache.org/ - high-level programming language for Hadoop computations
- Oozie, http://oozie.apache.org/ - orchestration and workflow management for Hadoop
- Solr, http://lucene.apache.org/solr/ - Lucene-based enterprise search platform
- Sqoop, http://sqoop.apache.org/ - imports data from relational databases into Hadoop
- Whirr, http://whirr.apache.org/ - cloud-agnostic deployment of clusters
- Zookeeper, http://zookeeper.apache.org/ - configuration management and coordination
- Ambari, http://incubator.apache.org/ambari/ - deployment, configuration and monitoring of Hadoop clusters
- Blur, http://incubator.apache.org/blur/ - search platform for searching massive amounts of data in a cloud computing environment
- Chukwa, http://incubator.apache.org/chukwa/ - log collection and analysis framework for Apache Hadoop clusters
- Crunch, http://incubator.apache.org/crunch/ - a Java library for writing, testing, and running pipelines of MapReduce jobs
- Drill, http://incubator.apache.org/drill/ - interactive analysis of large-scale data
- HCatalog, http://incubator.apache.org/hcatalog/ - schema and data type sharing over Pig, Hive and MapReduce
- Kafka, http://incubator.apache.org/kafka/ - distributed publish-subscribe messaging system
- Mesos, http://incubator.apache.org/mesos/ - a cluster manager that provides resource sharing and isolation across cluster applications
- S4, http://incubator.apache.org/s4/ - distributed platform for processing continuous unbounded streams of data
- Tashi, http://incubator.apache.org/tashi/ - infrastructure for service providers to build applications harnessing cluster computing resources to efficiently access repositories of rich data