streamsets / tutorials Goto Github PK

StreamSets Tutorials

License: Apache License 2.0

Java 72.21% Dockerfile 1.98% Shell 25.82%

tutorials's Introduction

StreamSets DataOps Platform Tutorials

The following tutorials demonstrate features of StreamSets Data Collector, StreamSets Transformer, StreamSets Control Hub and StreamSets SDK For Python.

StreamSets Data Collector -- Basic Tutorials

Log Shipping to Elasticsearch - Read weblog files from a local filesystem directory, decorate some of the fields (e.g. GeoIP Lookup), and write them to Elasticsearch.
Simple Kafka Enablement using StreamSets Data Collector
What’s the Biggest Lot in the City of San Francisco? - Read city lot data from JSON, calculate lot areas in JavaScript, and write them to Hive.
Ingesting Local Data into Azure Data Lake Store - Read records from a local CSV-formatted file, mask out PII (credit card numbers) and send them to a JSON-formatted file in Azure Data Lake Store.
Working with StreamSets Data Collector and Microsoft Azure - Integrate Azure Blob Storage, Apache Kafka on HDInsight, Azure SQL Data Warehouse and Apache Hive backed by Azure Blob Storage.

StreamSets Data Collector -- Writing Custom Pipeline Stages

Creating a Custom StreamSets Origin - Build a simple custom origin that reads a Git repository's commit log and produces the corresponding records.
Creating a Custom Multithreaded StreamSets Origin - A more advanced tutorial focusing on building an origin that supports parallel execution, so the pipeline can run in multiple threads.
Creating a Custom StreamSets Processor - Build a simple custom processor that reads metadata tags from image files and writes them to the records as fields.
Creating a Custom StreamSets Destination - Build a simple custom destination that writes batches of records to a webhook.

We have a DataCollector API Java Docs to share in case of need, please reach out to us if you need them.

StreamSets Data Collector -- Advanced Features

Ingesting Drifting Data into Hive and Impala - Build a pipeline that handles schema changes in MySQL, creating and altering Hive tables accordingly.
Creating a StreamSets Spark Transformer in Java - Build a simple Java Spark Transformer that computes a credit card's issuing network from its number.
Creating a StreamSets Spark Transformer in Scala - Build a simple Scala Spark Transformer that computes a credit card's issuing network from its number.
Creating a CRUD Microservice Pipeline - Build a microservice pipeline to implement a RESTful web service that reads from and writes to a database via JDBC.

The Data Collector documentation also includes an extended tutorial that walks through basic Data Collector functionality, including creating, previewing and running a pipeline, and creating alerts.

StreamSets Data Collector -- Kubernetes-based Deployment

Kubernetes-based Deployment - Example configurations for Kubernetes-based deployments of StreamSets Data Collector.

StreamSets Control Hub

Creating Custom Data Protector Procedure - Create, build and deploy your own custom data protector procedure that you can use as protection method to apply to record fields.

StreamSets Transformer

Creating a Custom Processor for StreamSets Transformer - Create a simple custom processor, using Java and Scala, that will compute the type of a credit card from its number, and configure Transformer to use it.
Creating a Custom Scala Project For StreamSets Transformer - This tutorial explains how to create a custom Scala project and import the compiled jar into StreamSets Transformer.

StreamSets SDK for Python

Common

Find SDK methods and fields of an object available - Object examples can be instances of a pipeline or SCH job or a stage under the pipeline.

Control Hub

Getting started with StreamSets SDK for Python - Design and publish a pipeline. Then create, start, and stop a job using StreamSets SDK for Python.
Jobs related tutorials
- Sample ways to fetch one or more jobs - Sample ways to fetch one or more jobs.
- Start a job and monitor that specific job - Start a job and monitor that specific job using metrics and time series metrics.
- Move jobs from dev to prod using data_collector_labels - Move jobs from dev to prod by updating data_collector label.
- Generate a report for a specific job - Generate a report for a specific job and then; fetch and download it.
- See logs for a data-collector where a job is running - Get the DataCollector where a job is running and then see its logs.
Pipelines related tutorials
- Common pipeline methods - Common operations for StreamSets Control Hub pipelines like update, duplicate , import, export.
- Loop over pipelines and stages and make an edit to stages - When there are many pipelines and stages that need an update, SDK for Python makes it easy to update them with just a few lines of code.
- Create CI CD pipeline used in demo - This covers the steps to create CI CD pipeline as used in the SCH CI CD demo. The steps include how to add stages like JDBC, some processors and Kineticsearch; and how to set stage configurations. Also shows, the use of runtime parameters.

License

StreamSets Data Collector and its tutorials are built on open source technologies; the tutorials and accompanying code are licensed with the Apache License 2.0.

Contributing Tutorials

We welcome contributors! Please check out our guidelines to get started.

tutorials's People

Contributors

Stargazers

Watchers

Forkers

lchen89 jobigeorge metadaddy rushah codeaudit vaseem-mv dachinzo sksundaram-learning hemanthmareddy prasadbmvs dogvile-gr rpatil524 mkristiansen irinamax iasindev mezianehadjadj chengsh7 dppmackay9 abhi10901 pawanrana billthebest bravecharge vikki1107 lynch0227 virtualramblas mrugesh1989 parrondo olummy bcoryat spendyala avvaramesh sureddy iwrigley jasonto87 forestlzj apurvaanand boghbogh cammette kioco olivertheobald maduhu deepaktewani shaweifeng adityakonda sergeevo froberts71 chattg1 suixingbugai ankdadoo rayfirez iamontheinet maymieeiei nightwatchers martinhum gfn9cho lupingqiu jeahyunkim dougsfo thuylevn emediacode pedlopes rmarshasatx gridl tarunsinghal92 tyyzqmf imsaaks mccoyzhu stevenflood dusts66 jackljohnson hrdkgtm sonalirungta szulijun 2415897-ontarioinc easel andyzheng wangyi008 danish-moengage razvansimion bestnaja myselfmohsin sunnyzyc rhofiratos aaatu khmoez07 sathishnatarajan2 mohanmithran qiurongzheng president-zhou chenshijielt fzeeshan ndjido vannazux junweiyu leezs1991 qianqiancan oleggorj c0rnholi0 coolbrise ajaytrivedi

tutorials's Issues

Is it possible to send error out record in separate pipline while using custom processor

Hi,

I am new for StreamSets development. I implemented custom processor successfully. But I am stuck with to send custom processor error records to other destination pipeline.

Thanks,
Ajit

ConfigIssue deprecated

Tag 3.1.0.0-RC2 has ConfigIssue marked deprecated, but it appears in the sample project and in readme.md. Consider updating to reflect best practices in SDC 3.x?

how to support multi tenants ?

hi , does streamsets have plan to support multi tenants ?

each tenant has its own workspace ,can only see their own pipelines, but not others

Please add the junit test cases

It would be great if you could add some junit test cases for the custom stream processor as they are missing in the code .
Currently , I am trying to write the unit test case for one of my custom processor code but its giving me error .
Below is the error which I am facing :-
Caused by: java.lang.IllegalStateException: fileRefsBaseDir has not been set

Dicrectory Preview shows only filepath

Hi,
When performing preview on directory it is just showing the path of json file. Following are the links of snapshots of configuration of directory stage.
[https://drive.google.com/open?id=1Vr49NgsLQ0vaTB7Nxxo4Nv8oCV1Zy9Ya](Snapshot 1)
[https://drive.google.com/open?id=1zKKLzvSBSdP_1dC2vxCoMf0UkWiMqGLm](Snapshot 2)
[https://drive.google.com/open?id=1Qvjj3-pwxMFAPxaCROGObMmaYDGduiXd](Snapshot 3)

Replace Hive Streaming in tutorials with Hive Sync

The original tutorial uses hive streaming which is over 2 years old and not well supported. Switch tutorial to hive sync solution.

SPARK_01 - Specified class: 'com.streamset.spark.processor.CustomTransformer' was not found in classpath

Hi Team,

I am trying to run the Spark Evaluator in streamset datacollector but i am getting error like "SPARK_01 - Specified class: 'com.streamset.spark.processor.CustomTransformer' was not found in classpath" in Spark Transformer Class argument

.

Please help me to fix.

Downloaded the maven data collector project - but seems incomplete

SO I followed the directions to get the project -- and brought it up in IntelliJ -- but it seems to be incomplete (or maybe I am doing something wrong) -- here is a screenshot of the project:

Tutorial not working with latest version of streamset

'Creating a Custom StreamSets Destination' tutorial problems

I wish to create a 'StreamSets Destination' to interface with my firms software, so I am walking through the 'Creating a Custom StreamSets Destination' tutorial. When I get to the 'Extract the tarball to SDC’s user-libs directory, restart SDC, and you should see the sample stages in the stage library' step ... I don't see 'Sample' or 'sample' listed in the 'Select Destination to connect' dropdown.

Some details:

I have downloaded and built version 3.7.1 of datacollector, datacollector-api, datacollector-plugin-api ... and pulled the latest version of datacollector-edge from git.
I untarred the tarballs all into ~/work. I also 'created a new custom stage project using the Maven archetype' in ~/work creating the ~/work/samplestage directory.
After I built datacollector, I did 'cp ~/work/samplestage/target/samplestage-1.0-SNAPSHOT.jar ~/work/datacollector/dist/target/streamsets-datacollector-3.7.1/streamsets-datacollector-3.7.1/user-libs' to copy the jar into the datacollector tree.
I then ran datacollector with "~/work/datacollector/dist/target/streamsets-datacollector-3.7.1/streamsets-datacollector-3.7.1/bin/streamsets dc"
I then logged into streamsets as 'admin/admin', created a new 'Data Collector Pipeline', performed 'Select Origin->Directory' to create an origin, then pulled down the 'Select Destination to connect' dropdown ... in which I could not find 'Sample' or 'sample'.

Tutorial not working with ver 3.5

I am trying to work with the basic tutorial involving Kafka:
https://github.com/streamsets/tutorials/blob/master/tutorial-2/directory_to_kafkaproducer.md

It's not working. It seems that you user interface(UI) has changed or evolved but the tutorial is still referring to the old user interface and steps.

The moment I am at this step:
Defining the Kafka Producer

There is no way I can follow it as the user interface and the navigation is not in sync with this documentation.
Products like stream sets which are having sucha heavy duty user interface and the navigation should be having almost the perfect documentation and updated as per the latest versions. Or may be the tutorial should clearly specify with what versions it is going to work?

JDBC_16 - Table 'xxx' does not exist or PDB is incorrect. Make sure the correct PDB was specified

JDBC Producer
Insert to oracle error
use jar ojdbc8 or ojdbc6

JDBC_16 - Table 'xxx' does not exist or PDB is incorrect. Make sure the correct PDB was specified

tutorial-processor

It's throwing Nullpointerexception in test TestSampleProcessor while compiling code. It's working fine with the below code in

for (String fieldPath : record.getEscapedFieldPaths()) {
      Field field = record.get(fieldPath);
      if (field.getType() == Field.Type.STRING) {
        String reversed = (new StringBuilder(field.getValueAsString())).reverse().toString();
        record.set(fieldPath + ".reversed", Field.create(reversed));
      }
    }

But while writing code for whole file transfer throwing Nullpointerexception.

FileRef fileRef = record.get("/fileRef").getValueAsFileRef();
    Metadata metadata;
    try {
      metadata = ImageMetadataReader.readMetadata(fileRef.createInputStream(getContext(), InputStream.class));
    } catch (ImageProcessingException | IOException e) {
      String filename = record.get("/fileInfo/filename").getValueAsString();
      LOG.info("Exception getting metadata from {}", filename, e);
      throw new OnRecordErrorException(record, Errors.SAMPLE_02, e);
    }

    // A Metadata object contains multiple Directory objects
    for (Directory directory : metadata.getDirectories()) {
      // Each Directory stores values in Tag objects
      for (Tag tag : directory.getTags()) {
        LOG.info("TAG: {}", tag);
      }
      // Each Directory may also contain error messages
      if (directory.hasErrors()) {
        for (String error : directory.getErrors()) {
          LOG.info("ERROR: {}", error);
        }
      }
    }

how to speed up "File Tail" ？

I have a log named access.log size 3.8G

I create a simple pipeline filetail-trash

not rate limited . but I found Record Throughput only less then 20 records/s

pipeline with default config

how to optimize ？

Hive DRIFT

tutorial needs to be updated to show using post events to invalidate the hive metadata

Pipeline Status: RUN_ERROR: java.lang.NoClassDefFoundError

Hello

I'm trying to follow the spark transformer tutorial but when I come to run my own custom transformer I get class related issues which suggest some sort of config or class path problem but I cannot see how to remedy the situation as the docs suggest when running a pipeline in stand alone mode streamsets will use the built in stage libraries ie I should not have to deploy any dependencies such as xml parsers or spark libs

First run produces the following exception

Pipeline Status: RUN_ERROR: java.lang.NoClassDefFoundError: com/fasterxml/jackson/databind/type/ReferenceType

whilst a subsequent run produces

Pipeline Status: RUN_ERROR: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$

I have followed the readme and used the supplied code and pom file so I'm obviously missing something here

any thoughts?

thanks

Steve

Please add a license to the repo

Hi,

Could you please add an explicit LICENSE.txt file to the repo so that it's clear under what terms the content is provided, and under what terms user contributions are licensed?

Per GitHub docs on licensing:

Generally speaking, the absence of a license means that the default copyright laws apply. This means that you retain all rights to your source code and that nobody else may reproduce, distribute, or create derivative works from your work. This might not be what you intend.

Thanks!

curl -XPUT syntax error

used the following with a single quote at the end and it worked

'{
"mappings": {
"logs" : {
"properties" : {
"timestamp": {"type": "date"},
"geo": {"type": "geo_point"},
"city": {
"type": "string",
"index": "not_analyzed"}
}
}
}
}'

does the JDBCQueryConsumer support hive JDBC?

com.streamsets.pipeline.api.StageException: JDBC_06 - Failed to initialize connection pool: com.zaxxer.hikari.pool.PoolInitializationException: Exception during pool initialization: Method not supported
at com.streamsets.pipeline.lib.jdbc.JdbcUtil.createDataSourceForRead(JdbcUtil.java:774)
at com.streamsets.pipeline.stage.origin.jdbc.JdbcSource.init(JdbcSource.java:216)
at com.streamsets.pipeline.api.base.BaseStage.init(BaseStage.java:48)
at com.streamsets.pipeline.configurablestage.DStage.init(DStage.java:36)
at com.streamsets.datacollector.runner.StageRuntime.init(StageRuntime.java:159)
at com.streamsets.datacollector.runner.StagePipe.init(StagePipe.java:101)
at com.streamsets.datacollector.runner.StagePipe.init(StagePipe.java:49)
at com.streamsets.datacollector.runner.Pipeline.initPipe(Pipeline.java:382)
at com.streamsets.datacollector.runner.Pipeline.init(Pipeline.java:295)
at com.streamsets.datacollector.runner.preview.PreviewPipeline.run(PreviewPipeline.java:49)
at com.streamsets.datacollector.execution.preview.sync.SyncPreviewer.start(SyncPreviewer.java:206)
at com.streamsets.datacollector.execution.preview.async.AsyncPreviewer.lambda$start$0(AsyncPreviewer.java:94)
at com.streamsets.pipeline.lib.executor.SafeScheduledExecutorService$SafeCallable.lambda$call$0(SafeScheduledExecutorService.java:249)
at com.streamsets.datacollector.security.GroupsInScope.execute(GroupsInScope.java:33)
at com.streamsets.pipeline.lib.executor.SafeScheduledExecutorService$SafeCallable.call(SafeScheduledExecutorService.java:245)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.zaxxer.hikari.pool.PoolInitializationException: Exception during pool initialization: Method not supported
at com.zaxxer.hikari.pool.HikariPool.initializeConnections(HikariPool.java:581)
at com.zaxxer.hikari.pool.HikariPool.(HikariPool.java:152)
at com.zaxxer.hikari.HikariDataSource.(HikariDataSource.java:73)
at com.streamsets.pipeline.lib.jdbc.JdbcUtil.createDataSourceForRead(JdbcUtil.java:767)
... 20 more
Caused by: java.sql.SQLException: Method not supported
at org.apache.hive.jdbc.HiveConnection.setReadOnly(HiveConnection.java:1198)
at com.zaxxer.hikari.pool.PoolElf.setupConnection(PoolElf.java:184)
at com.zaxxer.hikari.pool.HikariPool.addConnection(HikariPool.java:497)
at com.zaxxer.hikari.pool.HikariPool.initializeConnections(HikariPool.java:565)
... 23 more

how to load PipelineConfiguration in origin

I created a new lib follow the tutorials,then i do something like load schema from the origin database,i wish the data can display on web.But i can't find the way to transform the schema data.

Are Field types immutable?

I've built a working 'streamsets destination' to feed data into the analytics platform my team is working on, but I have a question about 'schema'.

With respect to the 'field type' (ie what is returned from Field.getType()), are these types immutable for a given fieldPath ... or might these types change over time for any single given fieldPath?

Thanks in advance.

proxy support

I've been through the tutorial from behind a corporate firewall so had to add proxy support. In case anyone else wants/needs to do this I had to:

in SampleTarget.java
import org.apache.http.HttpHost
add viaproxy line in Request.Post (before the bodyString) e.g.
.viaProxy(new HttpHost("my.proxy.host", 8080))

in sdc-security.policy
change sdc permission to my proxy host rather than requestb.in

tutorials-1:java.lang.RuntimeException: error while performing request

2017-03-02 11:28:36,736 [user:*admin] [pipeline:tutorial01/3a5c571e-2d7a-49ce-aa5f-61006e68ea72] [thread:preview-pool-1-thread-1] WARN Pipeline - Stage 'com_streamsets_pipeline_stage_origin_spooldir_SpoolDirDSource_1' initialization error: java.lang.RuntimeException: error while performing request java.lang.RuntimeException: error while performing request at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:638) at org.elasticsearch.client.RestClient.performRequest(RestClient.java:212) at org.elasticsearch.client.RestClient.performRequest(RestClient.java:184) at org.elasticsearch.client.RestClient.performRequest(RestClient.java:146) at com.streamsets.pipeline.stage.destination.elasticsearch.ElasticSearchTarget.init(ElasticSearchTarget.java:290) at com.streamsets.pipeline.api.base.BaseStage.init(BaseStage.java:52) at com.streamsets.pipeline.configurablestage.DStage.init(DStage.java:40) at com.streamsets.datacollector.runner.StageRuntime.init(StageRuntime.java:156) at com.streamsets.datacollector.runner.StagePipe.init(StagePipe.java:105) at com.streamsets.datacollector.runner.StagePipe.init(StagePipe.java:53) at com.streamsets.datacollector.runner.Pipeline.initPipe(Pipeline.java:260) at com.streamsets.datacollector.runner.Pipeline.init(Pipeline.java:252) at com.streamsets.datacollector.runner.Pipeline.validateConfigs(Pipeline.java:171) at com.streamsets.datacollector.runner.preview.PreviewPipeline.validateConfigs(PreviewPipeline.java:63) at com.streamsets.datacollector.execution.preview.sync.SyncPreviewer.validateConfigs(SyncPreviewer.java:122) at com.streamsets.datacollector.execution.preview.async.AsyncPreviewer$1.call(AsyncPreviewer.java:76) at com.streamsets.pipeline.lib.executor.SafeScheduledExecutorService$SafeCallable.call(SafeScheduledExecutorService.java:233) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.http.ProtocolException: Not a valid protocol version: This is not a HTTP port at org.apache.http.impl.nio.codecs.AbstractMessageParser.parse(AbstractMessageParser.java:209) at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:245) at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81) at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39) at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114) at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162) at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337) at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315) at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276) at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588) ... 1 more Caused by: org.apache.http.ParseException: Not a valid protocol version: This is not a HTTP port at org.apache.http.message.BasicLineParser.parseProtocolVersion(BasicLineParser.java:148) at org.apache.http.message.BasicLineParser.parseStatusLine(BasicLineParser.java:366) at org.apache.http.impl.nio.codecs.DefaultHttpResponseParser.createMessage(DefaultHttpResponseParser.java:112) at org.apache.http.impl.nio.codecs.DefaultHttpResponseParser.createMessage(DefaultHttpResponseParser.java:50) at org.apache.http.impl.nio.codecs.AbstractMessageParser.parseHeadLine(AbstractMessageParser.java:156) at org.apache.http.impl.nio.codecs.AbstractMessageParser.parse(AbstractMessageParser.java:207) ... 11 more 2017-03-02 11:28:37,749 [user:admin] [pipeline:tutorial01/3a5c571e-2d7a-49ce-aa5f-61006e68ea72] [thread:webserver-21] WARN StandaloneAndClusterPipelineManager - Evicting idle previewer '3a5c571e-2d7a-49ce-aa5f-61006e68ea72::null'::'61816234-927e-4ac1-bc87-8572d47fdf97' in status 'INVALID'

Build Fails, missing dependency

When trying to follow the sampleprocessor tutorial, when maven is downloading dependencies from this:
mvn clean package -DskipTests
This line is returned:
[WARNING] The POM for com.streamsets:streamsets-datacollector-api:jar:1.3.0.0-SNAPSHOT is missing, no dependency information available
This then causes the build to fail.

RequestBin destination is not created.

Despite building the projects and copied the jar file to desired location custom requestbin not created.

does your solution support k8s or mesos

i am currently evaluating rancher for our internal infrastructure ,so here comes question
does streamset support k8s,mesos or rancher such docker container architecture?

Single pipeline for CRUD operations

I have been trying reading data from mysql db and writing to another table in mysql db.But insertion is happening but update and deleted records is not reflecting in my db.

How can I use JDBC Query Consumer to query data from hive?

Hi,i want to query data from hive by JDBC Query Consumer .But it didn't work .
Error is "JJDBC_00 - Cannot connect to specified database: com.streamsets.pipeline.api.StageException: JDBC_06 - Failed to initialize connection pool: com.zaxxer.hikari.pool.PoolInitializationException: Exception during pool initialization: Method not supported"
How can i solve it?

how to build streamsets-datacollector-core ?

Exception during pool initialization: Method not supported

JDBC_00 - Cannot connect to specified database: com.streamsets.pipeline.api.StageException: JDBC_06 - Failed to initialize connection pool: com.zaxxer.hikari.pool.PoolInitializationException: Exception during pool initialization: Method not supported

Not able to read multiple files from directory origin

I am trying to read multiple json files from directory using directory origin. But its reading only first file from directory. Please help me on the same..

stateful calculations processors ?

can the streamsets on yarn streaming cluster model do stateful calculations?

Such as real-time statistics PV, UV or certain fields aggregation ?

will you give some advice and code sample ?

mvn clean install for 2.1.0.0 DC fails under Windows

Under Windows:
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.7:
run (rat-report) on project streamsets-datacollector: An Ant BuildException has
occured: Execute failed: java.io.IOException: Cannot run program "\bin\bash": Cr
eateProcess error=2, The system cannot find the file specified
[ERROR] around Ant part ...<exec dir="C:\Users\madison\Downloads\2.1.0.0_dc\d atacollector-2.1.0.0" executable="/bin/bash" failonerror="true">... @ 10:120 in
C:\Users\madison\Downloads\2.1.0.0_dc\datacollector-2.1.0.0\target\antrun\bui
ld-main.xml

save all pipeline json config kv to mysql ?

as I know streamsets save/update piepline config info and offset as json file at local dir

does streamsets support save them into mysql ?

how to keep High Availability ?

version 3.0.0 not support kafka 0.8 ?

Elasticsearch Update Operation Tutorial

Hi,

Is there a tutorial on ElasticSearch update operation?

Pipeline Flow:
Have create a minimal pipeline to read data from ES 6.2.2 (origin) and push it to RDBMS (destination 1). Thereafter we update same ES document (destination 2) to update a field @Version from 1 to 2. So that on the next run of the pipeline this document from ES will not be picked for moving to RDBMS.

Can someone point to the right direction as we are getting below error, after following the Streamset documents at the Destination (Also if someone could point on debugging the flow it would be helpful as the error is not clear where exactly the issue is) :

2018-05-28 08:27:35,171 [user:*admin] [pipeline:ESUpdatePipeline/ESUpdatePipeline5fdde3c4-2c84-480b-a853-f2085333df2c] [runner:] [thread:ProductionPipelineRunnable-ESUpdatePipeline5fdde3c4-2c84-480b-a853-f2085333df2c-ESUpdatePipeline] INFO Pipeline - Destroying pipeline with reason=FAILURE
2018-05-28 08:27:35,177 [user:*admin] [pipeline:ESUpdatePipeline/ESUpdatePipeline5fdde3c4-2c84-480b-a853-f2085333df2c] [runner:] [thread:ProductionPipelineRunnable-ESUpdatePipeline5fdde3c4-2c84-480b-a853-f2085333df2c-ESUpdatePipeline] INFO Pipeline - Processing lifecycle stop event
2018-05-28 08:27:35,177 [user:*admin] [pipeline:ESUpdatePipeline/ESUpdatePipeline5fdde3c4-2c84-480b-a853-f2085333df2c] [runner:] [thread:ProductionPipelineRunnable-ESUpdatePipeline5fdde3c4-2c84-480b-a853-f2085333df2c-ESUpdatePipeline] INFO Pipeline - Pipeline finished destroying with final reason=FAILURE
2018-05-28 08:27:35,185 [user:*admin] [pipeline:ESUpdatePipeline/ESUpdatePipeline5fdde3c4-2c84-480b-a853-f2085333df2c] [runner:] [thread:ProductionPipelineRunnable-ESUpdatePipeline5fdde3c4-2c84-480b-a853-f2085333df2c-ESUpdatePipeline] ERROR ProductionPipelineRunnable - An exception occurred while running the pipeline, com.streamsets.pipeline.api.StageException: ELASTICSEARCH_17 - Could not index '347' records: One or more operations failed
com.streamsets.pipeline.api.StageException: ELASTICSEARCH_17 - Could not index '347' records: One or more operations failed
at com.streamsets.pipeline.stage.destination.elasticsearch.ElasticsearchTarget.write(ElasticsearchTarget.java:342)
at com.streamsets.pipeline.configurablestage.DTarget.write(DTarget.java:34)
at com.streamsets.datacollector.runner.StageRuntime.lambda$execute$2(StageRuntime.java:249)
at com.streamsets.datacollector.runner.StageRuntime.execute(StageRuntime.java:195)
at com.streamsets.datacollector.runner.StageRuntime.execute(StageRuntime.java:257)
at com.streamsets.datacollector.runner.StagePipe.process(StagePipe.java:219)
at com.streamsets.datacollector.execution.runner.common.ProductionPipelineRunner.processPipe(ProductionPipelineRunner.java:801)
at com.streamsets.datacollector.execution.runner.common.ProductionPipelineRunner.lambda$executeRunner$3(ProductionPipelineRunner.java:846)
at com.streamsets.datacollector.runner.PipeRunner.executeBatch(PipeRunner.java:136)
at com.streamsets.datacollector.execution.runner.common.ProductionPipelineRunner.executeRunner(ProductionPipelineRunner.java:845)
at com.streamsets.datacollector.execution.runner.common.ProductionPipelineRunner.runSourceLessBatch(ProductionPipelineRunner.java:823)
at com.streamsets.datacollector.execution.runner.common.ProductionPipelineRunner.processBatch(ProductionPipelineRunner.java:472)
at com.streamsets.datacollector.runner.StageRuntime$3.run(StageRuntime.java:321)
at java.security.AccessController.doPrivileged(Native Method)
at com.streamsets.datacollector.runner.StageRuntime.processBatch(StageRuntime.java:317)
at com.streamsets.datacollector.runner.StageContext.processBatch(StageContext.java:252)
at com.streamsets.pipeline.stage.origin.elasticsearch.ElasticsearchSource$ElasticsearchTask.run(ElasticsearchSource.java:231)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Spark Transformer tutorials should send some output to the log file

private static final Logger LOG = LoggerFactory.getLogger(Transformer.class);

LOG.info("Number Five is alive!");

how to define general label for Custom Stage Libraries

i follow the tutorials to create my own hbase stage library, and i want to support different version for
hbase，i guess i can do it by re-define the general label,

but i don't konw how to do that,

Not able to find maven dependacies

Hi I am going through tutorial "Creating a Custom StreamSets Destination", in that two object instances 1) DataGenerator and 2) DelimitedCharDataGenerator could not able to resolved.


import com.streamsets.pipeline.lib.generator.DataGenerator;
import com.streamsets.pipeline.lib.generator.delimited.DelimitedCharDataGenerator;

both are not able to resolved.

The mentioned maven repository is not available so adding below mentioned dependency in pom.xml file cause error too,

<dependency>
      <groupId>com.streamsets</groupId>
      <artifactId>streamsets-datacollector-commonlib</artifactId>
      <version>1.2.2.0</version>
      <scope>provided</scope>
    </dependency>

On maven repository available version is 1.1.3 which is quite older. Your help will be appreciated.
Thanks in advance.

Issue building cdh_6_0-lib subproject

Maven CDH downloads for the datacollector build kept hanging until I went in and changed the CDH repo to the following in cdh_6_0-lib/pom.xml:

<!--<url>http://bits.cloudera.com/f93c6c9d/cdh6/6.0.0-beta1/maven-repository/</url>-->
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>

Custom processor Field value getting turncated

Hi,

I developed custom processor based on tutorial. Pipline is working good. But when i print JSON value from field its getting truncated json.
Is any length limit for field.getValueAsString()?

Below is code:
for (String fieldPath : record.getEscapedFieldPaths()) {
Field field = record.get(fieldPath);
if (field.getType() == Field.Type.STRING) {
String json = field.getValueAsString();
LOG.info("Field Value:: = " + json);
}
}

ThankYou,
Ajit

Hadoop FS destination File Type

The tutorial is not clear on what file type should be used for the output file when AVRO is selected as a data format. Also the tutorial needs to be updated to show current configuration windows and options as it's out of date.

This is for the HIVE Drift solution

Requirements

Please note:
-The maven/ant build can only run on Linux (not Windows)
-On Linux the maven build needs at least a version of 3.2.1 or higher
-On Linux the maven build only works on jdk1.8 (jdk1.7 did not work for me)

Create a tutorial on how to setup tls for kafka

Is a future tutorial on setting TLS for kafka planned? I couldn't find any information on how to set it up.

Rename the Hive Drift Solution => Drift Synchronization Solution for Hive

filetail, log file move to history dir ,how to configure file generation and archive strategies

I have nginx log file to collect in "/data/logs/nginx/xyz.log"
every day ,mv the log and compress to /data/logs/nginx/2017/05/xyz.log.tar.gz
and recreate new log file /data/logs/nginx/xyz.log
how to configure file generation and archive strategies ?
i try use default "Active File with Reverse Counter " naming type ,but i found streamsets try to collect /data/logs/nginx/2017 ,but that is a dir

hivetarget【hive streaming】 Opened a connection to metastore, current connections too much , which made pipeline failed

HIVE_05 - Hive Metastore error: Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused (Connection refused) at org.apache.thrift.transport.TSocket.open(TSocket.java:187) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:414) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:234) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:179) at com.streamsets.pipeline.stage.destination.hive.HiveTarget$1.run(HiveTarget.java:251) at com.streamsets.pipeline.stage.destination.hive.HiveTarget$1.run(HiveTarget.java:245) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at com.streamsets.pipeline.stage.destination.hive.HiveTarget.initHiveMetaStoreClient(HiveTarget.java:244) at com.streamsets.pipeline.stage.destination.hive.HiveTarget.init(HiveTarget.java:195) at com.streamsets.pipeline.api.base.BaseStage.init(BaseStage.java:52) at com.streamsets.pipeline.configurablestage.DStage.init(DStage.java:40) at com.streamsets.datacollector.runner.StageRuntime.init(StageRuntime.java:136) at com.streamsets.datacollector.runner.StagePipe.init(StagePipe.java:104) at com.streamsets.datacollector.runner.StagePipe.init(StagePipe.java:53) at com.streamsets.datacollector.runner.Pipeline.init(Pipeline.java:158) at com.streamsets.datacollector.execution.runner.common.ProductionPipeline.run(ProductionPipeline.java:97) at com.streamsets.datacollector.execution.runner.common.ProductionPipelineRunnable.run(ProductionPipelineRunnable.java:72) at com.streamsets.datacollector.execution.runner.standalone.StandaloneRunner.start(StandaloneRunner.java:761) at com.streamsets.datacollector.execution.runner.common.AsyncRunner$4.call(AsyncRunner.java:147) at com.streamsets.pipeline.lib.executor.SafeScheduledExecutorService$SafeCallable.call(SafeScheduledExecutorService.java:233) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at org.apache.thrift.transport.TSocket.open(TSocket.java:182) ... 27 more

how to send jobs to spark cluster

spark evaluator runs in local, is there any way to send jobs to a spark cluster to execute ? I have a spark cluster managed in standalone mode, how to send jobs to it ? I call setMaster() method to point spark's master to my spark cluster in init method of my spark transformer, but it seems not work .

ip address not found in geo database

Trying to do the logs to elasticsearch tutorial but get ip address not found in the geo database:

The mmdb is in the resources directory:

Uses of OnRecordError API

Hi,

Can i use OnRecordError API on my custom processor to to send the error records to other pipeline.

Any docs or URL for this solutions, Please share with me.