Git Product home page Git Product logo

big-data-made-easy's Introduction

Big Data Made Easy

A list of frameworks, libraries, resources, and shiny things. Inspired by awesome-... stuff. Those most frequently used or well-know items are not listed here, which could be referred from awesome series: Awesome Big Data by Onur Akpolat and The Big-Data Ecosystem Table by Andrea Mostosi .

Projects

###Storage Design and Data Structures

  • Db-readings - Readings in Databases .
  • Bitvector - A C++ container-like data structure for storing a vector of bits with fast appending on both sides and fast insertion in the middle, all in succinct space .
  • BitSliceIndex - Experiments on bit-slice indexing .
  • RoaringBitmap - Roaring Bitmap .
  • Cpp-btree - C++ in-memory containers based on a B-tree data structure.
  • Graphillion - Fast, lightweight graphset operation library .
  • Emphf - An efficient external-memory algorithm for the construction of minimal perfect hash functions .
  • Splay Map - STL map implemented with splay tree .
  • Cedar - C++ implementation of efficiently-updatable double-array trie .
  • WikiSort - Fast and stable sort algorithm that uses O(1) memory. Public domain .
  • Annoy - Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk .
  • Expgram - An ngram toolkit with succinct storage .
  • Cuckoofilter - A Bloom filter replacement for approximated set-membership queries .
  • PackedArray - Random access array of tightly packed unsigned integers .
  • FrameOfReference - C++ library to pack and unpack vectors of integers having a small range of values using a technique called Frame of Reference .
  • FFBF - Feed-forward Bloom filters .
  • Concurrent Trees - C++ implementation of concurrent Binary Search Trees .
  • Concurrent B-Tree - A working project for High-concurrency B-tree source code in C .
  • Block-graph - A succinct implementation of a block-graph data structure .
  • RePair-WaveletTree-Graph - Graph Implementation with repair bitmap compressed WaveletTree .
  • RLZ - Contains the RLZ compression and self-index source code .
  • Serangequerying - Space-Efficient Structures for Range Querying .
  • Succinct - Experimentation with various succinct data structures. Combines previous doc-counter and wavelet-tree repos .
  • Sdsl-lite - Succinct Data Structure Library 2.0 .
  • Relative-FMIndex - Relative FM-index which is smaller but slower than plain FMIndex.
  • GCSA - Generalized Compressed Suffix Array.
  • Succinct - A collection of succinct data structures .
  • Rmq - Implementations of LCA and RMQ data structures from "The LCA Problem Revisited" .
  • YuNomi - Compressed Array Library .
  • DACs - Directly Addressable Codes (DACs) consist in a variable-length encoding scheme for integers that enables direct access to any element of the encoded sequence and obtains compact spaces .
  • Cpi00 - The compressed permuterm index .
  • Smbt - Succinct Multibit Tree for similarity search .
  • Gwt - Graph-indexing wavelet tree for graph similarity search .
  • Webgraphs - Fast and Compact Web Graph Representations .
  • Erika-trie - Erika-trie: succinct trie library .
  • Path_decomposed_tries - Implementation of the data structures described in the paper "Fast Compressed Tries using Path Decomposition" .
  • Sumire-tries - A variety of succinct tries .
  • Trie4j - (Succinct) trie implementation in Java .
  • SuDS - Succinct Data Structures (SuDS) www.cs.helsinki.fi .
  • Marisa-trie - Marisa succinct trie .
  • LibCDS - Compact Data Structures Library .
  • HSDS - Succinct Data Structure Library Collection.Includes bit-vector/wavelet-matrix/trie .
  • BWTIL - BWT Text Indexing Library: a set of tools to work with BWT-based text indexes .
  • Hip-hyperloglog - C++ implementation of an approximate distinct counter by HIP estimator on HyperLogLog .
  • Gonzalo Navarro - Publications of Gonzalo Navarro .
  • Kvtx - Transaction over CAS see https://docs.google.com/open?id=0B04zCRiCIQGGZDcyNTEwZGQtODk4Yy00NjEwLWI1MjQtYjc3NzJhN2RlNzk0 .
  • Fatcache - Memcache on SSD .
  • WiredTiger - WiredTiger's source tree http://source.wiredtiger.com/ .
  • FD-Tree - FD-Tree: a Tree Index on Solid State Drives .
  • Silo - Multicore in-memory storage engine .
  • MemC3 - An in-memory key-value cache based on concurrent cuckoo hashing.
  • Libart - Adaptive Radix Trees implemented in C .
  • Masstree - Masstree, a fast, multi-core key-value store .
  • NVMKV - NVM key-value store API lIbrary repository. http://opennvm.github.io/nvmkv-documents/ .
  • HYRISE - In-Memory Hybrid Storage Engine .
  • HyPer - A hybrid online transactional processing (OLTP) and online analytical processing (OLAP) high-performance main memory database system that is optimized for modern hardware .
  • NoVoHT - NoVoHT: a Lightweight Dynamic Persistent NoSQL Key/Value Store on NVRAM .
  • HERD - A Highly Efficient key-value system for RDMA .
  • Cayley - An open-source graph database .
  • Forestdb - A Fast Key-Value Storage Engine Based on Hierarchical B+-Tree Trie .
  • STSDB - Waterfalltree .
  • Mdbm - A very fast memory-mapped key/value store by Yahoo .
  • Nldb - Nanolat Database supporting 1M transactions per second .
  • Sophia - Modern embeddable key-value database designed for a high load environment .
  • FOEDUS - Transactional fast optimistic engine optimized for a large number of CPU cores and NVRAM storage (or fast SSD) .
  • Weaver - A scalable, fast, consistent graph store http://weaver.systems .
  • FastBit_UDF - MySQL UDF for creating, manipulating and querying FastBit indexes .
  • Jump Consistent Hash - A Go implementation of the jump consistent hash .
  • Content Defined Chunking - High Performance Content Defined Chunking .
  • SSD optimizations - Optimizing SSDs random IOPs, noop/tpps scheduler, rotational=0, add_random=0 .
  • Article-SSD - Coding for SSDs - What every programmer should know about solid-state drives .
  • Article-Key-Value - Implementing a Key-Value Store .
  • Article-MVCC - Implementation of MVCC Transactions for Key-Value Stores .
  • Article-SSD - Solid-state revolution: in-depth on how SSDs really work .
  • Dexter - Dexter database research group .
  • Streaminer - A collection of algorithms for mining data streams http://mayconbordin.github.io/streaminer/ .
  • Article-Art of Approximating - The Art of Approximating Distributions: Histograms and Quantiles at Scale .
  • Article-Sketch of the Day - Sketch of the Day: Frugal Streaming .
  • Article-Sketch of the Day - Sketch of the Day: K-Minimum Values .
  • Article-Sketch of the Day - Sketch of the Day: K-Minimum Values: Sketching Error, Hash Functions, and You .

###Distributed System

  • Pequod - A distributed key-value cache with builtin materialized views, see "Easy Freshness with Pequod Cache Joins" .
  • Crate - CRATE: Your Elastic Data Store .
  • Elliptics - Distributed hashtable storage .
  • Mcrouter - Mcrouter is a memcached protocol router for scaling memcached deployments .
  • Codis - Yet another fast distributed solution for Redis .
  • zBase - A high-performance, elastic, distributed key-value store .
  • Sirius - A distributed system library for managing application reference data from Comcast .
  • Machi - Reliable, distributed, highly available large file store based on Chain Replication .
  • Dynomite - A generic dynamo implementation for different k-v storage engines .
  • AsterixDB - Full-function BDMS (Big Data Management System) .
  • RAMCloud - A new class of storage for large-scale datacenter applications. It is a key-value store that keeps all data in DRAM at all times .
  • Geode - Open source version of Gemfire .
  • Cockroach - A Scalable, Geo-Replicated, Transactional Datastore .
  • Seaweed-FS - A simple and highly scalable distributed file system. There are two objectives: to store billions of files! to serve the files fast .
  • InfiniSQL - InfiniSQL is the database for always on, rapid growth applications that need to collect and analyze in real time--even for complex transactions .
  • Wasp - A megastore-like system http://alibaba.github.io/wasp/ .
  • Haeinsa - linearly scalable multi-row, multi-table transaction library for HBase based on Percolator .
  • Yrmcds - Memcached compatible KVS with master/slave replication. http://cybozu.github.io/yrmcds/ .
  • 3levelmemcache - Memcache improvements by Data.com .
  • Vitess - Vitess provides servers and tools which facilitate scaling of MySQL databases for large scale web services .
  • Cotton - MySQL over Mesos .
  • Replicant - A system for maintaining replicated state machines .
  • CorfuDB - Tango: Distributed Data Structures over a Shared Log .
  • Skipgraph - Implementation of skipgraph on messagepack-rpc .
  • Pinpoint - Non-intrusive Dapper-like APM solution .
  • CAT - APM solution at Dianping Inc .
  • Brave - Java version of OpenZipkin .
  • Appdash - Golang version of Dapper .
  • Druid - Real²time Exploratory Analytics on Large Datasets http://druid.io .
  • Pinot - Something like Druid .
  • Kylin - Data Cube based OLAP .
  • Pulsar - Business level monitor and analysis .
  • Cubert - A fast and efficient batch computation engine for complex analysis and reporting of massive datasets on Hadoop .
  • REEF - The Retainable Evaluator Execution Framework .
  • Sparrow - Sparrow low-latency scheduling platform .
  • Phat - An implementation of the Chubby lock service protocol in Msgpack RPC .
  • Hydra - A distributed data processing and storage system originally developed at AddThis .
  • Hystrix - A latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable .
  • Phantom - High performance proxy for accessing distributed services inspired by Twitter Fingle and Netlifx Hystrix .
  • rDSN - Open framework for quickly building and managing high performance and robust distributed systems .
  • Nativetask - A high performance C++ API & runtime for Hadoop MapReduce .
  • Taskgraph - A fault tolerant, distributed task driven framework written in Go.
  • Project Eru - Docker Cloud inspired by Kubernetes/Borg.
  • Summingbird - Streaming MapReduce with Scalding and Storm https://twitter.com/summingbird .
  • Hustle - A column oriented, embarrassingly distributed relational event database .
  • Embulk - A plugin-based parallel bulk data loader that makes painful data integration works relaxed .
  • Gobblin - Data ingestion as a service .
  • Chronos - Chronos: A Replacement for Cron, see http://nerds.airbnb.com/introducing-chronos/ .
  • Ochopod - Orchestration overlay over Mesos, K8S and more .
  • Helios - Docker orchestration platform of Spotify .
  • SDC - Joyent Smart Datacenter .
  • Apollo - Mesos cluster provisioning and orchestration .
  • Microservices-infrastructure - Microservice infrastructure of CiscoCloud .
  • Vamp - Microservices orchestration platform .
  • Tyrant - Golang job scheduler based on mesos.
  • Firmament - Cluster scheduler based on Quincy to be included into Kubernetes .
  • Cocaine - An open-source PaaS (platform as a service) system for creating custom cloud hosting apps from Yandex .
  • Weave - The Docker Network .
  • QJump - Optimizing network latency of DataCenter .
  • ConcourseDB - Distributed database with ACID(2PC) .
  • RebornDB - Distributed database fully compatible with redis protocol(modified from Codis) .
  • Calvin - Distributed database with ACID without 2PC .
  • Bottledwater-pg - PostgreSQL replication made easy .
  • MDCC - Multi-DataCenter Consistency protocol .
  • URingPaxos - High throughput atomic multicast protocol .
  • CorfuDB - Distributed logging(CORFU) .
  • Course-CS6452 - Datacenter Networks and Services .

###Concurrency

  • Concurrent Queue - A fast multiple-producer, multi-consumer lock-free concurrent queue for C++11 .
  • CAF - An Open Source Implementation of the Actor Model in C++ .
  • TAMER - C++ extensions for readable event-driven programming .
  • C++React - A reactive programming library for C++11 .
  • Libslock - Cross-platform atomic operations and lock algorithm library http://lpd.epfl.ch/site/ssync .
  • CDS - Header only C++ Concurrent Data Structures library .
  • Libcds - A C++ template library of lock-free and fine-grained algorithms .
  • Locksmith - A library for debugging locking in C, C++, or Objective C programs .
  • Concurrency-concepts - A guide to concurrency, multi-threading and parallel programming concepts. Explains the differences between every concept, their advantages and disadvantages in detail .
  • Concurrency Kit - Concurrency primitives, safe memory reclamation mechanisms and non-blocking data structures for the research, design and implementation of high performance concurrent systems .
  • Nanahan - An implementation of Hopscotch hashing for single thread .
  • Scalex - Code snippets for the workshop on concurrent data structure implementation .
  • CBB - Provides a set of concurrent building blocks (Java & C/C++) that can be used to develop parallel/multi-threaded applications .
  • Thrust - A parallel algorithms library which resembles the C++ Standard Template Library (STL) .
  • Varon-t - A C implementation of Disruptor queues http://varon-t.readthedocs.org/ .
  • disruptor-- - Disruptor concurency pattern in c++ .
  • Lockfree Queue - Lock-free Condition Wait for Lock-free Multi-producer Multi-consumer Queue, see http://natsys-lab.blogspot.ru/2013/08/lock-free-condition-wait-for-lock-free.html .
  • Ssmem - A simple object-based memory allocator with epoch-based garbage collection, the publication "Asynchronized Concurrency: The Secret to Scaling Concurrent Search Data Structures" .
  • CLHT - A very fast and scalable (lock-based and lock-free) hash table that uses cache-line sized buckets .
  • Comsat - Comsat lets your application enjoy the scalability of asynchronous web-frameworks, serving many thousands of concurrent long-lived connections, or issuing hundreds of web-service calls for each request, all while maintaining the simple “thread per request” model .
  • Quasar-thrift - Quasar fiber based Thrift RPC .
  • Article-TM - Transactional Memory: History and Development .

###Compression

###System Performance And Profiling

  • Vmmlib - A templatized C++ vector and matrix math library .
  • Blaze-lib - A high performance C++ math library .
  • Light-matrix - A Light-weight and Fast Template Matrix Library .
  • Light-simd - A light weight library for SIMD based computation .
  • MathSimd - SIMD-optimized math library in C++ .
  • Opti - Experiment of x86/x64 optimization .
  • Fmath - Fast log and exp functions for x86/x64 SSE http://homepage1.nifty.com/herumi/soft/fmath.html .
  • Mie - Fast string library with SSE4.2 .
  • Libsimdpp - Header-only zero-overhead C++ wrapper for SIMD intrinsics of multiple instruction sets .
  • SEQAN - An open source C++ library of efficient algorithms and data structures for the analysis of sequences with the focus on biological data .
  • Fastsocket - A highly scalable socket and its underlying networking implementation of Linux kernel .
  • Smart - SMT-aware Real-time scheduler for Linux from Yandex.
  • Simple Binary Encoding - Serialization with ultra low latency .
  • Libdivide - An open source library for optimizing integer division .
  • Farmhash - FarmHash is a successor to CityHash, and includes many of the same tricks and techniques, several of them taken from Austin Appleby’s MurmurHash .
  • Proxygen - A collection of C++ HTTP libraries including an easy to use HTTP server .
  • Yamail - YMail General Purpose Library .
  • mTCP - A Highly Scalable User-level TCP Stack for Multicore Systems .
  • WDT - Warp speed Data Transfer (WDT) is an embeddedable library (and command line tool) aiming to transfer data between 2 systems as fast as possible over multiple TCP paths .
  • UNetStack - Userspace TCP/IP stack .
  • CamIO - Userspace IO abstraction .
  • Ktap - A lightweight script-based dynamic tracing tool for Linux http://ktap.org .
  • Perfbook - Is Parallel Programming Hard, And, If So, What Can You Do About It ?
  • Article-GC-Java - Garbage Collection Optimization for High-Throughput and Low-Latency Java Applications | LinkedIn Engineering .
  • Article-Memory Management - Optimizing Linux Memory Management for Low-latency / High-throughput Databases | LinkedIn Engineering .
  • Article-Modern Microprocessors - Modern Microprocessors A 90 Minute Guide! .
  • Article-Cache Oblivious Array - Cache oblivious array operations .
  • Article-Understanding Memory - Understanding Memory .
  • Article-1975 Programming - So what's wrong with 1975 programming? .
  • Article-Database Research - Database Research on Modern Computing Architecture .
  • Article-Linux Learn From Solaris - What Linux can learn from Solaris performance and vice-versa .
  • Brendan D. Gregg - Blog of Brendan D. Gregg .
  • Course-CMU 18-645 - How to Write Fast Code .
  • ParallelismBook - A book about parallel computing & code optimization .
  • Blackhole - Yet another logging library. http://blackhole-logger.herokuapp.com .
  • Handystats - C++ library for collecting user-defined in-process runtime statistics with low overhead .

###Search Engine and Information Retrieval

  • SF1R - A distributed massive data engine for enterprise/vertical search written in C++ .
  • Partitioned_elias_fano - Code used for the experiments in the paper "Partitioned Elias-Fano Indexes" .
  • Data Structures for Inverted Indexes - Optimal Space-Time Tradeoffs for Inverted Indexes .
  • Surf - SUccinct Retrieval Framework .
  • FastPFor - Fast integer compression .
  • Simdcomp - A simple C library for compressing lists of integers .
  • SIMDCompressionAndIntersection - A C++ library to compress and intersect sorted lists of integers using SIMD instructions .
  • TurboPFor - Fastest Integer Compression .
  • Pos-cmp - Comparison framework for positional inverted indexes and self-index supporting phrase queries .
  • MaskedVByte - SIMD-accelerated VByte Compression, Publication "Vectorized VByte Decoding" .
  • Wavelet - Information Retrieval based on Wavelet Tree .
  • Shuffla - Search engine using kd-tree .
  • RoSA - Large-Scale Pattern Search Using Reduced-Space On-Disk Suffix Arrays .
  • Dualsorted - Dual sorted inverted index based on Wavelet Tree .
  • Treap - Faster and Smaller Inverted Indices with Treaps .
  • Gigablast - A distributed open source search engine and spider written in C/C++ for Linux .
  • Libface - Fastest auto-complete in the east .
  • SIMD-Based-Posting-lists - Implementation of Alexander A. Stepanov inverted Index Compression algorithms .
  • Groonga - Open-source fulltext search engine and column store .
  • Pastec - An open source index and search engine for image recognition .
  • Enterprise-search - An open source search engine for corporate data and websites. http://www.searchdaimon.com/ .
  • Verticut - Image search engine on Infiniband .
  • Atire - A search engine built using the most effective recent research techniques discovered by Information Retrieval researchers around the world .
  • Mg4j - Academic search engine with succinct design(say quasi-succinct indices) .
  • Argos - A structural data search engine .

###Large Scale Machine Learning

  • LASER - A Scalable Response Prediction Platform For Online Advertising .
  • Parameter Server - A distributed machine learning framework. http://parameterserver.org .
  • Petuum - A distributed machine learning framework implementing parameter server model .
  • Paracel - Parameter server by Douban Inc .
  • H2O - Fastest in-memory platform for machine learning and predictive analytics on big data .
  • Oryx - Simple real-time large-scale machine learning infrastructure implementing Lambda Architecture .
  • Admm_Allreduce - ADMM optimizer on Apache Hadoop with allReduce. .
  • Hivemall - Scalable machine learning library for Hive/Hadoop .
  • Ml-ease - ADMM based large scale logistic regression .
  • douban_pGBRT - Parallel GBRT from Douban Inc .
  • Parlearn - Parallel SGD implementation .
  • Xgboost - eXtreme Gradient Boosting (Tree) Library .
  • AcroMUSASHI Stream-ML - AcroMUSASHI Stream-ML - Machine Learning Library .
  • DIMSUM - All-pairs similarity via DIMSUM .
  • StreamSVM - StreamSVM is the fastest implementation to learn linear SVM with large dataset that cannot fit in memory in your computer .
  • Distributed-liblinear - Libraries for Large-scale Linear Classification on Distributed Environments .
  • SparkADMM - ADMM implementation on Spark Cluster .
  • NOMAD - Non-locking, stOchastic Multi-machine algorithm for Asynchronous and Decentralized matrix completion .
  • Stream-ml - Streaming SGD inspired by http://blog.smola.org/post/977927287/parallel-stochastic-gradient-descent .
  • LIBPMF - A Library for Large-scale Parallel Matrix Factorization .
  • LIBMF - A Matrix-factorization Library for Recommender Systems .
  • KnittingBoar - Parallel Iterative Algorithm (SGD) on Hadoop's YARN framework .
  • Trident-ml - Trident-ML : A realtime online machine learning library .
  • Mlpack - A scalable c++ machine learning library .
  • LASSO - A parallel regression model learning system based on MRML.
  • Jubatus - Distributed Online Machine Learning Framework .
  • Vowpal_Wabbit - A fast online learning algorithm http://hunch.net/~vw/ .
  • DeepDist - Lightning-Fast Deep Learning on Spark via parallel stochastic gradient updates(compared with MLLib) .
  • DMLC - Distributed (Deep) Machine Learning Common .
  • SINGA - A General Distributed Deep Learning Platform .
  • BIDMach - CPU and GPU-accelerated Machine Learning Library in Scala .
  • Spark-Multiboost - An implementation of the multi-class/multi-label classifier, of which the training is carried out using AdaBoost.MH on Apache Spark .
  • Veles - Distributed platform for rapid Deep learning application development by Samsung.
  • Chainer - A flexible framework of neural networks for deep learning http://chainer.org by PFINetwork.

big-data-made-easy's People

Contributors

k0t3r avatar yingfeng avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.