Git Product home page Git Product logo

pravega's Introduction

Pravega Build Status codecov License Version CII Best Practices

Pravega is an open source distributed storage service implementing Streams. It offers Stream as the main primitive for the foundation of reliable storage systems: a high-performance, durable, elastic, and unlimited append-only byte stream with strict ordering and consistency.

To learn more about Pravega, visit https://pravega.io

Prerequisites

  • Java 11+

In spite of the requirements of using JDK 11+ to build this project, client artifacts (and its dependencies) must be compatible with a Java 8 runtime. All other components are built and ran using JDK11+.

The clientJavaVersion project property determines the version used to build the client (defaults to 8).

Building Pravega

Checkout the source code:

git clone https://github.com/pravega/pravega.git
cd pravega

Build the pravega distribution:

./gradlew distribution

Install pravega jar files into the local maven repository. This is handy for running the pravega-samples locally against a custom version of pravega.

./gradlew install

Running unit tests:

./gradlew test

Setting up your IDE

Pravega uses Project Lombok so you should ensure you have your IDE setup with the required plugins. Using IntelliJ is recommended.

To import the source into IntelliJ:

  1. Import the project directory into IntelliJ IDE. It will automatically detect the gradle project and import things correctly.
  2. Enable Annotation Processing by going to Build, Execution, Deployment -> Compiler > Annotation Processors and checking 'Enable annotation processing'.
  3. Install the Lombok Plugin. This can be found in Preferences -> Plugins. Restart your IDE.
  4. Pravega should now compile properly.

For eclipse, you can generate eclipse project files by running ./gradlew eclipse.

Note: Some unit tests will create (and delete) a significant amount of files. For improved performance on Windows machines, be sure to add the appropriate 'Microsoft Defender' exclusion.

Releases

The latest pravega releases can be found on the Github Release project page.

Snapshot artifacts

All snapshot artifacts from master and release branches are available in GitHub Packages Registry

Add the following to your repositories list and import dependencies as usual.

maven {
    url "https://maven.pkg.github.com/pravega/pravega"
    credentials {
        username = "pravega-public"
        password = "\u0067\u0068\u0070\u005F\u0048\u0034\u0046\u0079\u0047\u005A\u0031\u006B\u0056\u0030\u0051\u0070\u006B\u0079\u0058\u006D\u0035\u0063\u0034\u0055\u0033\u006E\u0032\u0065\u0078\u0039\u0032\u0046\u006E\u0071\u0033\u0053\u0046\u0076\u005A\u0049"
    }
}

Note GitHub Packages requires authentication to download packages thus credentials above are required. Use the provided password as is, please do not decode it.

If you need a dedicated token to use in your repository (and GitHub Actions) please reach out to us.

As alternative option you can use JitPack (https://jitpack.io/#pravega/pravega) to get pre-release artifacts.

Quick Start

Read Getting Started page for more information, and also visit sample-apps repo for more applications.

Running Pravega

Pravega can be installed locally or in a distributed environment. The installation and deployment of pravega is covered in the Running Pravega guide.

Support

Don’t hesitate to ask! Contact the developers and community on slack (signup) if you need any help. Open an issue if you found a bug on Github Issues.

Documentation

The Pravega documentation is hosted on the website: https://pravega.io/docs/latest or in the documentation directory of the source code.

Contributing

Become one of the contributors! We thrive to build a welcoming and open community for anyone who wants to use the system or contribute to it. Here we describe how to contribute to Pravega! You can see the roadmap document here.

About

Pravega is 100% open source and community-driven. All components are available under Apache 2 License on GitHub.

pravega's People

Contributors

a6dulaleem avatar abhijeet-jadhav avatar abhinb avatar adrianmo avatar andreipaduroiu avatar anirudhkovuru avatar anishakj avatar aparnarr avatar arvindkandhare avatar bhargav-gulavani avatar bhupender-y avatar co-jo avatar derekm avatar eolivelli avatar fpj avatar jiazhai avatar kevinhan88 avatar kotlasaicharanreddy avatar medvedevigorek avatar pbelgundi avatar prabha-veerubhotla avatar raulgracia avatar sachin-j-joshi avatar shiveshr avatar shrids avatar shshashwat avatar shwethasnayak avatar skrishnappa avatar tkaitchuck avatar tristan1900 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pravega's Issues

Multi-protocol access to messages in Tier 2 storage

Ability to consume Events in Streams via S3, HDFS and NFS. Once events are persisted Tier 2, they should be accessible in read-only mode via S3, HDFS and NFS heads by other services such as MapReduce Job. This must be enforced by platform.

Multiple replicas of the same data is avoided.

HDFS is still a powerful paradigm for batch analytics. This helps in analyzing messages using traditional Hadoop frameworks like MapRed, Hive..

Automated elastic scaling of Streams based on Throughput

Pravega provides a durable transaction log for high-speed writes and tailing reads of data 'Streams'. The events in the streaming storage layer will 'tier' to a long term deep storage (like HDFS) through configurable policies.

It is desirable for Pravega to be able to handle varying throughput of streams elegantly. Rather than statically defining certain number of Streaming Node to handle particular Streams, Pravega should be able to adjust it's surface based on the throughput that Stream is currently experiencing.

Sub-tasks:

  • Preliminary design: #378
  • Design complete:
  • Implementation: PR #444

Ability to enable reads/writes to/from Streams across all data-centers

Pravega should have the ability to handle geo-replicated Streams with reads and writes enabled on all replicated sites. We refer to this configuration as Active-Active ( Contrasts to an Active-Passive configuration where Streams are actively written to from only one Primary site. Others are read-only.

This req is part of #21

Note: Active-Passive should be included in EAP2 releases, not Active-Active.

Ability to deploy Pravega on VM sandbox

The sandbox must be a simple downloadable environment with minimal prerequisites to run (e.g. VirtualBox / VMWare OAV). It should have the following features:

  • Self contained with Pravega functionality enabled
  • Production grade stability
  • Example configuration pre-configured with realistic environment
    Including existing accounts, buckets, topics, sample data…

API Compatibility (Producer and Consumer APIs)

A vast majority of organizations with Data Pipelines in production today use Kafka as their primary messaging cluster. To enable them easily to migrate to Pravega, it is essential to give them a Kafka compatible API on top of Streaming Storage so that they don't have to recode the production producers/consumers.

Pravega should be 100% compatible with Kafka 0.10.X API for clients (Producers/Consumers - Data plane)
Drop in replacement for existing Kafka clusters.
Reuse developer familiarity with Kafka APIs.
Leverage existing tools like Kafka Connect, Schema Registry, Mirror Maker


Some APIs might not be supportable (like getMetrics()). Those should be documented.

Reliable and Guaranteed message persistence

Events persisted in the system should be reliable/fault-tolerant. We do not want to provide an option of non-guaranteed (best-effort) delivery. Expectation is that the Streaming Storage layer persists reliably to disk before the Producer gets ACKd.

When tiered to the long-term HDFS or compatible Storage engine protection mechanisms are leveraged.

Messages are always available after being ACKd and maintain ordering guarantees of Streaming API.
Have not heard cust need possibly-lossy persistence. Cust scenarios rarely tolerate losses.

Concurrent appendSetup requests from the same connection.

Some requests implicitly assume they will not be interleaved with others.
IE: The server implementation assumes that a client will not issue multiple setupAppend requests concurrently from the same connection, as it may become confused if this happens.
Note that the client as-written should not do this, but the server should be better protected.

Deployment surface configuration (no. of streaming nodes)

Operator needs to be able to define the streams deployment surface (no. of streaming nodes).

Ability to define/configure number of Streaming Nodes that will be available to service tenants' requests.

Ex: Out of 100 nodes, operator needs 25 instances for Pravega

Collect/publish Producer & Consumer level metrics

Apart from metrics/performance of Streams, it is useful to display Producer/Consumer metrics like:

  • How many Producers are producing to any particular Stream
  • How many messages have they attempted to add to the stream (but possibly failed due to timeout...etc)
  • How many Consumer Groups are pulling from a particular stream
  • How many consumers exist in each of those Consumer Groups
  • How many messages have been consumed by a particular consumer group (are they all caught up? or lagging by 2000 messages)
  • (Possibly) health of Producers/Consumer instances

These metrics would help gain insights like: "Stream OilRefinerySensor124 has 15 producers. Producer 5 is probably dead because it hasn't produced messages in a while. Consumer Group C1 and C2 are all caught-up. Consumer Group C2 is lagging by 2000 events."

Ability to assign Streams human-readable names

We have received feedback that it would be useful to have Partitions queried and identified by strings instead of IDs.
Stream names need to be unique within a Tenant. Between tenants, Stream names don't have to be unique.

Ability to track/share schema of Stream message payload (Schema Registry or equivalent)

Ability to store and manage schemas for messages to be shared between Producers/Consumers - Schema Registry
Schemas are expected to be used in the following ways:

  • Pravega will validate messages against schema when they are received from producers
  • Pravega will add metadata to each message referencing the appropriate schema
  • Consumers will be able to reference schema to deserialize messages
  • Index and Search capabilities will use schema to select fields for indexing
  • Analytics capabilities will use schema to facilitate selecting relevant fields and deserializing them
  • Gracefully propagating schema changes as they evolve

Ability to specify minimum consumer parallelism

Auto scaling will allow streams segments to scale up and down based on events throughput rate. From consumer perspective, to avoid starving while auto scaling happens, user should be able to specify minimum parallelism when creating a stream

Publish sample Java Producer/Consumer applications

To demonstrate the capabilities and advantages of Pravega, we need to publish sample applications. These will be provided to early users to try out the service. Some examples could be reading a Twitter firehose into a Stream and a consumer which drains it into a Spark or Flink cluster.

Mesos-based deployment & resource management

It is expected that the Pravega will utilize Mesos for its resource management, thus providing a true converged infrastructure experience. Will probably imply a 'Streaming' Mesos framework needs to be implemented.

Implement Exponential Backoff for reconnection

We need exponential backoff and reconnect logic to be built in. The current code does not implement this properly.
This can be done with the threadPool of the connection:
private void doConnect() {
Bootstrap b = ...;
b.connect().addListener((ChannelFuture f) -> {
if (!f.isSuccess()) {
long nextRetryDelay = nextRetryDelay(...);
f.channel().eventLoop().schedule(nextRetryDelay, ..., () ->
{ doConnect(); }
); // or you can give up at some point by just doing nothing.
}
});
}
But this still requires a custom event to reset the clock (because we don't want to until it is working), and a message up the stack to notify of any reconnects.
However realistically, we probably want to handle this at a higher layer, because segments can move between hosts.

Multi-node DC/OS deployment with Pravega

This does not include any work for distributed log. We need to do this for Pravega controller and streaming service nodes. Pravega artifacts should be available on Universe for any dcos users can install.

Publish a well documented and versioned Streaming API for Developer

To establish Pravega as a credible streaming platform, we will need to publish and document a well-defined set of APIs which will let apps provision, monitor, configure and manage Pravega.

This will be distinct from the Kafka API which also will be supported in the future releases.

Ability to assign arbitrary Key-Value pairs to streams

User-defined metadata attached to streams, which provides further information or categorization of the the stream. Custom metadata is formatted as key-value pairs that are set when creating a stream. E.g.

  • Client = Dell EMC
  • Event = Dell EMC World 2017
  • Billing ID = 123
  • ..

Logical isolation of streams provisioned by different tenants

There is a expectation of logical isolation of streams/events/long term data storage between the tenants.
Each tenant can have same stream with name "Foo". Multi tenancy must be supported both in both tier 1 and tier 2 layers. The events put by a tenant should be isolated from other tenants on the same host.

Ability for Operator to define soft quotas for tenants

To ensure rogue tenants are not abusing the Pravega. Admin should have the ability to define soft quotas which based of aggregate throughput seen by the Tenant's streams. If the soft quota is breached, alerts/notifications will help the tenant react.

Ability to manually scale up/down the Pravega surface

The number of Pravega Nodes needed in the system might vary depending on the number of tenants/streams/segments. Operator needs the ability to scale the streaming surface up/down to maintain QoS guarantees and quotas.

Ability to Geo-Replicate events

Ability to geo-replicate events in Streams in Tier 2 storage

  1. Stream content with its segments
  2. Stream configuration, metadata and acl as well #22

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.