pravega / pravega Goto Github PK

View Code? Open in Web Editor NEW

2.0K 106.0 407.0 48.22 MB

Pravega - Streaming as a new software defined storage primitive

Home Page: http://pravega.io

License: Apache License 2.0

Shell 0.46% Java 99.36% Python 0.08% HCL 0.08% Dockerfile 0.02% Jinja 0.01%

streaming streaming-data distributed-storage real-time-data data-ingestion

pravega's Introduction

Pravega

Pravega is an open source distributed storage service implementing Streams. It offers Stream as the main primitive for the foundation of reliable storage systems: a high-performance, durable, elastic, and unlimited append-only byte stream with strict ordering and consistency.

To learn more about Pravega, visit https://pravega.io

Prerequisites

Java 11+

In spite of the requirements of using JDK 11+ to build this project, client artifacts (and its dependencies) must be compatible with a Java 8 runtime. All other components are built and ran using JDK11+.

The clientJavaVersion project property determines the version used to build the client (defaults to 8).

Building Pravega

Checkout the source code:

git clone https://github.com/pravega/pravega.git
cd pravega

Build the pravega distribution:

./gradlew distribution

Install pravega jar files into the local maven repository. This is handy for running the pravega-samples locally against a custom version of pravega.

./gradlew install

Running unit tests:

./gradlew test

Setting up your IDE

Pravega uses Project Lombok so you should ensure you have your IDE setup with the required plugins. Using IntelliJ is recommended.

To import the source into IntelliJ:

Import the project directory into IntelliJ IDE. It will automatically detect the gradle project and import things correctly.
Enable Annotation Processing by going to Build, Execution, Deployment -> Compiler > Annotation Processors and checking 'Enable annotation processing'.
Install the Lombok Plugin. This can be found in Preferences -> Plugins. Restart your IDE.
Pravega should now compile properly.

For eclipse, you can generate eclipse project files by running ./gradlew eclipse.

^{Note: Some unit tests will create (and delete) a significant amount of files. For improved performance on Windows machines, be sure to add the appropriate 'Microsoft Defender' exclusion.}

Releases

The latest pravega releases can be found on the Github Release project page.

Snapshot artifacts

All snapshot artifacts from master and release branches are available in GitHub Packages Registry

Add the following to your repositories list and import dependencies as usual.

maven {
    url "https://maven.pkg.github.com/pravega/pravega"
    credentials {
        username = "pravega-public"
        password = "\u0067\u0068\u0070\u005F\u0048\u0034\u0046\u0079\u0047\u005A\u0031\u006B\u0056\u0030\u0051\u0070\u006B\u0079\u0058\u006D\u0035\u0063\u0034\u0055\u0033\u006E\u0032\u0065\u0078\u0039\u0032\u0046\u006E\u0071\u0033\u0053\u0046\u0076\u005A\u0049"
    }
}

Note GitHub Packages requires authentication to download packages thus credentials above are required. Use the provided password as is, please do not decode it.

If you need a dedicated token to use in your repository (and GitHub Actions) please reach out to us.

As alternative option you can use JitPack (https://jitpack.io/#pravega/pravega) to get pre-release artifacts.

Quick Start

Read Getting Started page for more information, and also visit sample-apps repo for more applications.

Running Pravega

Pravega can be installed locally or in a distributed environment. The installation and deployment of pravega is covered in the Running Pravega guide.

Support

Don’t hesitate to ask! Contact the developers and community on slack (signup) if you need any help. Open an issue if you found a bug on Github Issues.

Documentation

The Pravega documentation is hosted on the website: https://pravega.io/docs/latest or in the documentation directory of the source code.

Contributing

Become one of the contributors! We thrive to build a welcoming and open community for anyone who wants to use the system or contribute to it. Here we describe how to contribute to Pravega! You can see the roadmap document here.

About

Pravega is 100% open source and community-driven. All components are available under Apache 2 License on GitHub.

pravega's People

Contributors

Stargazers

Watchers

Forkers

vroy fpj arvindkandhare tkaitchuck baaskal abhijeet-jadhav rmetzger shrids valiantljk jdaggett dsrw boniface bhargav-gulavani skrishnappa prabha-veerubhotla chrisdail shashank734 shiveshr danrutz triggerants eolivelli jesusgmad abresting addprs fortm maddisondavid tianhonglouis shuang-shuang abhijeetdhumal anilkyommi spiegela askewj appsecai-test zjpjohn zhengyangchang moizsj giorgio-v florianschmidt1994 smarthi suhasmane02 mrbeen25 yevhen anandv4444 mozinrat inevity evilmcjerkface vijikarthi haiderny jdmaguire terry1504 fahedhijazi rahulmod warrenzhu25 pradeepadi88 tillrohrmann raulgracia ack72tdp yew1eb jkhalack claudiofahey huide9 agarella andreipaduroiu yuanooo chinpeng mcgg nathas1 sowensnc adrianmo aparnarr mechgouki lingya lydpolaris lanchongyizu xiashuijun zwx14700 kevinhan88 bissont cloudxtreme ajesh05js mejodev1 sachin-j-joshi anishakj iuliandumitru crestofwave jicius gavinljj codelipenghui b-xiang javalibrary quqibing shiyezhang liumihust ravisharda realforce1024 swainl realaaronwu sumonst21 mapbased yatian

pravega's Issues

Ability to define QoS for tenants (min MB/s per stream segment guarantees)

QoS guarantees can be in the form of messages/sec or similar metric. Ensures that the tenant's streams/topics are not starved. Amazon Kinesis guarantees: One shard provides a capacity of 1MB/sec data input and 2MB/sec data output. One shard can support up to 1000 PUT records per second.

Multi-protocol access to messages in Tier 2 storage

Ability to consume Events in Streams via S3, HDFS and NFS. Once events are persisted Tier 2, they should be accessible in read-only mode via S3, HDFS and NFS heads by other services such as MapReduce Job. This must be enforced by platform.

Multiple replicas of the same data is avoided.

HDFS is still a powerful paradigm for batch analytics. This helps in analyzing messages using traditional Hadoop frameworks like MapRed, Hive..

DSSD D5 as an optional high-speed persistence backend for Streaming Storage layer (Tier 1)

Multi-instance: controller should register itself with zookeeper

Automated elastic scaling of Streams based on Throughput

Pravega provides a durable transaction log for high-speed writes and tailing reads of data 'Streams'. The events in the streaming storage layer will 'tier' to a long term deep storage (like HDFS) through configurable policies.

It is desirable for Pravega to be able to handle varying throughput of streams elegantly. Rather than statically defining certain number of Streaming Node to handle particular Streams, Pravega should be able to adjust it's surface based on the throughput that Stream is currently experiencing.

Sub-tasks:

Preliminary design: #378
Design complete:
Implementation: PR #444

Stream events which have been tiered to Tier 2 should adhere to all data protection/replication/compaction guarantees

XOR-compaction across geos
Strong-consistency guarantees
Access during outage
Compliance
Cross-protocol access
...etc

Ability to define ACLs/permissions on Stream level

Ability to define ACLs/permissions on Stream level, which apply to both Tier 1 and Tier 2 storage systems

Ability to enable reads/writes to/from Streams across all data-centers

Pravega should have the ability to handle geo-replicated Streams with reads and writes enabled on all replicated sites. We refer to this configuration as Active-Active ( Contrasts to an Active-Passive configuration where Streams are actively written to from only one Primary site. Others are read-only.

This req is part of #21

Note: Active-Passive should be included in EAP2 releases, not Active-Active.

Ability to deploy Pravega on VM sandbox

The sandbox must be a simple downloadable environment with minimal prerequisites to run (e.g. VirtualBox / VMWare OAV). It should have the following features:

Self contained with Pravega functionality enabled
Production grade stability
Example configuration pre-configured with realistic environment
Including existing accounts, buckets, topics, sample data…

Ability to support multiple Tier 2 backends

HDFS is an attractive option because of it's popularity and penetration. Pravega should support Hadoop compatible interface.

Ability to select geo-replication policy per Stream

Ability to choose one (of many possible) pre-created Replication Groups for defining the geo-replication policy for each stream seperately. As part of #21 feature

API Compatibility (Producer and Consumer APIs)

A vast majority of organizations with Data Pipelines in production today use Kafka as their primary messaging cluster. To enable them easily to migrate to Pravega, it is essential to give them a Kafka compatible API on top of Streaming Storage so that they don't have to recode the production producers/consumers.

Pravega should be 100% compatible with Kafka 0.10.X API for clients (Producers/Consumers - Data plane)
Drop in replacement for existing Kafka clusters.
Reuse developer familiarity with Kafka APIs.
Leverage existing tools like Kafka Connect, Schema Registry, Mirror Maker

Some APIs might not be supportable (like getMetrics()). Those should be documented.

Pravega should be installable on commodity off-the-shelf servers/networking gear

Reliable and Guaranteed message persistence

Events persisted in the system should be reliable/fault-tolerant. We do not want to provide an option of non-guaranteed (best-effort) delivery. Expectation is that the Streaming Storage layer persists reliably to disk before the Producer gets ACKd.

When tiered to the long-term HDFS or compatible Storage engine protection mechanisms are leveraged.

Messages are always available after being ACKd and maintain ordering guarantees of Streaming API.
Have not heard cust need possibly-lossy persistence. Cust scenarios rarely tolerate losses.

Concurrent appendSetup requests from the same connection.

Some requests implicitly assume they will not be interleaved with others.
IE: The server implementation assumes that a client will not issue multiple setupAppend requests concurrently from the same connection, as it may become confused if this happens.
Note that the client as-written should not do this, but the server should be better protected.

Multi-instance: controllers should take action once a pravega node/controller is down

Deployment surface configuration (no. of streaming nodes)

Operator needs to be able to define the streams deployment surface (no. of streaming nodes).

Ability to define/configure number of Streaming Nodes that will be available to service tenants' requests.

Ex: Out of 100 nodes, operator needs 25 instances for Pravega

Streams related network traffic secured by TLS/SSL

PM: Ensure that we get a paid version of Travis CI after the trial expires

Currently we have Travis in trial version which allows 100 builds. After that we need to pay for it. We should have some plan for it. Either emccode is paying for it OR we are :)

Collect/publish Producer & Consumer level metrics

Apart from metrics/performance of Streams, it is useful to display Producer/Consumer metrics like:

How many Producers are producing to any particular Stream
How many messages have they attempted to add to the stream (but possibly failed due to timeout...etc)
How many Consumer Groups are pulling from a particular stream
How many consumers exist in each of those Consumer Groups
How many messages have been consumed by a particular consumer group (are they all caught up? or lagging by 2000 messages)
(Possibly) health of Producers/Consumer instances

These metrics would help gain insights like: "Stream OilRefinerySensor124 has 15 producers. Producer 5 is probably dead because it hasn't produced messages in a while. Consumer Group C1 and C2 are all caught-up. Consumer Group C2 is lagging by 2000 events."

Multi-instance: controllers should implement heartbeat mechanism to indicate availability

Ability to assign Streams human-readable names

We have received feedback that it would be useful to have Partitions queried and identified by strings instead of IDs.
Stream names need to be unique within a Tenant. Between tenants, Stream names don't have to be unique.

Stream consumption/read of a Stream from multiple sites via the Streaming API, HDFS and NFS APIs

Streaming Service enables consumption/read of a Stream from multiple sites via the Streaming API, HDFS and NFS APIs

Ability to track/share schema of Stream message payload (Schema Registry or equivalent)

Ability to store and manage schemas for messages to be shared between Producers/Consumers - Schema Registry
Schemas are expected to be used in the following ways:

Pravega will validate messages against schema when they are received from producers
Pravega will add metadata to each message referencing the appropriate schema
Consumers will be able to reference schema to deserialize messages
Index and Search capabilities will use schema to select fields for indexing
Analytics capabilities will use schema to facilitate selecting relevant fields and deserializing them
Gracefully propagating schema changes as they evolve

Ability to logically group multiple Streams

Ability to logically group multiple Streams which share same policies/configurations (aka Stream Groups)

Ability to specify minimum consumer parallelism

Auto scaling will allow streams segments to scale up and down based on events throughput rate. From consumer perspective, to avoid starving while auto scaling happens, user should be able to specify minimum parallelism when creating a stream

Create build files for Pravega Controller

Publish sample Java Producer/Consumer applications

To demonstrate the capabilities and advantages of Pravega, we need to publish sample applications. These will be provided to early users to try out the service. Some examples could be reading a Twitter firehose into a Stream and a consumer which drains it into a Spark or Flink cluster.

Mesos-based deployment & resource management

It is expected that the Pravega will utilize Mesos for its resource management, thus providing a true converged infrastructure experience. Will probably imply a 'Streaming' Mesos framework needs to be implemented.

Document threading model

Multi-instance: pravega nodes should implement heartbeat mechanism to indicate availability

Implement Exponential Backoff for reconnection

We need exponential backoff and reconnect logic to be built in. The current code does not implement this properly.
This can be done with the threadPool of the connection:
private void doConnect() {
Bootstrap b = ...;
b.connect().addListener((ChannelFuture f) -> {
if (!f.isSuccess()) {
long nextRetryDelay = nextRetryDelay(...);
f.channel().eventLoop().schedule(nextRetryDelay, ..., () ->
{ doConnect(); }
); // or you can give up at some point by just doing nothing.
}
});
}
But this still requires a custom event to reset the clock (because we don't want to until it is working), and a message up the stack to notify of any reconnects.
However realistically, we probably want to handle this at a higher layer, because segments can move between hosts.

Pravega software stack should be available for consumption on DC/OS Universe

It can first be made available first in private universe and then public.

Policy driven tiering of Streams from Streaming Storage to Long-term storage

It is desirable to have some configurable policies which dictate how/when events move between the two tiers. Could be time-based/memory utilization based - for example.

TBD: Spec out details with customers

Predictable lag (ceil) between times in event landing from Tier 1 to Tier 2

Predictable lag (ceil) between event landing in Tier1 and the time event lands in Tier 2

Ability to set a flag to indicate ‘ack-to-producer-only-after-event-is-in-Tier2'

Tier 2 consistency mechanisdm should ensure the message is 'visible' to all readers across geos
This will ensure remote-reader will always have the latest messages written from other sites

Multi-node DC/OS deployment with Pravega

This does not include any work for distributed log. We need to do this for Pravega controller and streaming service nodes. Pravega artifacts should be available on Universe for any dcos users can install.

Pravega should be install-able on public clouds

Pravega should be install-able on public clouds (via template-styled tools like CloudFormation)

Publish a well documented and versioned Streaming API for Developer

To establish Pravega as a credible streaming platform, we will need to publish and document a well-defined set of APIs which will let apps provision, monitor, configure and manage Pravega.

This will be distinct from the Kafka API which also will be supported in the future releases.

Ability to assign arbitrary Key-Value pairs to streams

User-defined metadata attached to streams, which provides further information or categorization of the the stream. Custom metadata is formatted as key-value pairs that are set when creating a stream. E.g.

Client = Dell EMC
Event = Dell EMC World 2017
Billing ID = 123
..

Performance should be comparable to Kafka when tuned for equivalent reliability/scale

To be competitive with existing messaging infrastructures today (like Kafka), we should be able to provide write and read latency (tail and catch up reads) which are comparable or preferably better.

TBD: Exact numbers, end-to-end versus broker latency...etc

Logical isolation of streams provisioned by different tenants

There is a expectation of logical isolation of streams/events/long term data storage between the tenants.
Each tenant can have same stream with name "Foo". Multi tenancy must be supported both in both tier 1 and tier 2 layers. The events put by a tenant should be isolated from other tenants on the same host.

Stream content with its segments
Stream configuration, metadata and acl as well #22