Git Product home page Git Product logo

tutorials's Introduction

StreamSets DataOps Platform Tutorials

The following tutorials demonstrate features of StreamSets Data Collector, StreamSets Transformer, StreamSets Control Hub and StreamSets SDK For Python.

StreamSets Data Collector -- Basic Tutorials

StreamSets Data Collector -- Writing Custom Pipeline Stages

We have a DataCollector API Java Docs to share in case of need, please reach out to us if you need them.

StreamSets Data Collector -- Advanced Features

The Data Collector documentation also includes an extended tutorial that walks through basic Data Collector functionality, including creating, previewing and running a pipeline, and creating alerts.

StreamSets Data Collector -- Kubernetes-based Deployment

StreamSets Control Hub

StreamSets Transformer

StreamSets SDK for Python

Common

Control Hub

License

StreamSets Data Collector and its tutorials are built on open source technologies; the tutorials and accompanying code are licensed with the Apache License 2.0.

Contributing Tutorials

We welcome contributors! Please check out our guidelines to get started.

tutorials's People

Contributors

alejandrosanchezcabana avatar dougsfo avatar iamontheinet avatar iwrigley avatar j-mcavoy avatar kiritbasu avatar kirtiv1 avatar kunickiaj avatar lchen89 avatar metadaddy avatar onefoursix avatar rushah avatar sameersrinivas avatar timgrossmann avatar virtualramblas avatar xverges avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tutorials's Issues

ConfigIssue deprecated

Tag 3.1.0.0-RC2 has ConfigIssue marked deprecated, but it appears in the sample project and in readme.md. Consider updating to reflect best practices in SDC 3.x?

how to support multi tenants ?

hi , does streamsets have plan to support multi tenants ?

each tenant has its own workspace ,can only see their own pipelines, but not others

Please add the junit test cases

It would be great if you could add some junit test cases for the custom stream processor as they are missing in the code .
Currently , I am trying to write the unit test case for one of my custom processor code but its giving me error .
Below is the error which I am facing :-
Caused by: java.lang.IllegalStateException: fileRefsBaseDir has not been set

Dicrectory Preview shows only filepath

Hi,
When performing preview on directory it is just showing the path of json file. Following are the links of snapshots of configuration of directory stage.
[https://drive.google.com/open?id=1Vr49NgsLQ0vaTB7Nxxo4Nv8oCV1Zy9Ya](Snapshot 1)
[https://drive.google.com/open?id=1zKKLzvSBSdP_1dC2vxCoMf0UkWiMqGLm](Snapshot 2)
[https://drive.google.com/open?id=1Qvjj3-pwxMFAPxaCROGObMmaYDGduiXd](Snapshot 3)

'Creating a Custom StreamSets Destination' tutorial problems

I wish to create a 'StreamSets Destination' to interface with my firms software, so I am walking through the 'Creating a Custom StreamSets Destination' tutorial. When I get to the 'Extract the tarball to SDC’s user-libs directory, restart SDC, and you should see the sample stages in the stage library' step ... I don't see 'Sample' or 'sample' listed in the 'Select Destination to connect' dropdown.

Some details:

  • I have downloaded and built version 3.7.1 of datacollector, datacollector-api, datacollector-plugin-api ... and pulled the latest version of datacollector-edge from git.
  • I untarred the tarballs all into ~/work. I also 'created a new custom stage project using the Maven archetype' in ~/work creating the ~/work/samplestage directory.
  • After I built datacollector, I did 'cp ~/work/samplestage/target/samplestage-1.0-SNAPSHOT.jar ~/work/datacollector/dist/target/streamsets-datacollector-3.7.1/streamsets-datacollector-3.7.1/user-libs' to copy the jar into the datacollector tree.
  • I then ran datacollector with "~/work/datacollector/dist/target/streamsets-datacollector-3.7.1/streamsets-datacollector-3.7.1/bin/streamsets dc"
  • I then logged into streamsets as 'admin/admin', created a new 'Data Collector Pipeline', performed 'Select Origin->Directory' to create an origin, then pulled down the 'Select Destination to connect' dropdown ... in which I could not find 'Sample' or 'sample'.

Tutorial not working with ver 3.5

I am trying to work with the basic tutorial involving Kafka:
https://github.com/streamsets/tutorials/blob/master/tutorial-2/directory_to_kafkaproducer.md

It's not working. It seems that you user interface(UI) has changed or evolved but the tutorial is still referring to the old user interface and steps.

The moment I am at this step:
Defining the Kafka Producer

There is no way I can follow it as the user interface and the navigation is not in sync with this documentation.
Products like stream sets which are having sucha heavy duty user interface and the navigation should be having almost the perfect documentation and updated as per the latest versions. Or may be the tutorial should clearly specify with what versions it is going to work?

tutorial-processor

It's throwing Nullpointerexception in test TestSampleProcessor while compiling code. It's working fine with the below code in

for (String fieldPath : record.getEscapedFieldPaths()) {
      Field field = record.get(fieldPath);
      if (field.getType() == Field.Type.STRING) {
        String reversed = (new StringBuilder(field.getValueAsString())).reverse().toString();
        record.set(fieldPath + ".reversed", Field.create(reversed));
      }
    }

But while writing code for whole file transfer throwing Nullpointerexception.

FileRef fileRef = record.get("/fileRef").getValueAsFileRef();
    Metadata metadata;
    try {
      metadata = ImageMetadataReader.readMetadata(fileRef.createInputStream(getContext(), InputStream.class));
    } catch (ImageProcessingException | IOException e) {
      String filename = record.get("/fileInfo/filename").getValueAsString();
      LOG.info("Exception getting metadata from {}", filename, e);
      throw new OnRecordErrorException(record, Errors.SAMPLE_02, e);
    }

    // A Metadata object contains multiple Directory objects
    for (Directory directory : metadata.getDirectories()) {
      // Each Directory stores values in Tag objects
      for (Tag tag : directory.getTags()) {
        LOG.info("TAG: {}", tag);
      }
      // Each Directory may also contain error messages
      if (directory.hasErrors()) {
        for (String error : directory.getErrors()) {
          LOG.info("ERROR: {}", error);
        }
      }
    }

how to speed up "File Tail" ?

I have a log named access.log size 3.8G

I create a simple pipeline filetail-trash

not rate limited . but I found Record Throughput only less then 20 records/s

pipeline with default config

how to optimize ?

Hive DRIFT

tutorial needs to be updated to show using post events to invalidate the hive metadata

Pipeline Status: RUN_ERROR: java.lang.NoClassDefFoundError

Hello

I'm trying to follow the spark transformer tutorial but when I come to run my own custom transformer I get class related issues which suggest some sort of config or class path problem but I cannot see how to remedy the situation as the docs suggest when running a pipeline in stand alone mode streamsets will use the built in stage libraries ie I should not have to deploy any dependencies such as xml parsers or spark libs

First run produces the following exception

Pipeline Status: RUN_ERROR: java.lang.NoClassDefFoundError: com/fasterxml/jackson/databind/type/ReferenceType

whilst a subsequent run produces

Pipeline Status: RUN_ERROR: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$

I have followed the readme and used the supplied code and pom file so I'm obviously missing something here

any thoughts?

thanks

Steve

Please add a license to the repo

Hi,

Could you please add an explicit LICENSE.txt file to the repo so that it's clear under what terms the content is provided, and under what terms user contributions are licensed?

Per GitHub docs on licensing:

Generally speaking, the absence of a license means that the default copyright laws apply. This means that you retain all rights to your source code and that nobody else may reproduce, distribute, or create derivative works from your work. This might not be what you intend.

Thanks!

curl -XPUT syntax error

used the following with a single quote at the end and it worked

'{
"mappings": {
"logs" : {
"properties" : {
"timestamp": {"type": "date"},
"geo": {"type": "geo_point"},
"city": {
"type": "string",
"index": "not_analyzed"}
}
}
}
}'

does the JDBCQueryConsumer support hive JDBC?

does the JDBCQueryConsumer support hive JDBC?
image

com.streamsets.pipeline.api.StageException: JDBC_06 - Failed to initialize connection pool: com.zaxxer.hikari.pool.PoolInitializationException: Exception during pool initialization: Method not supported
at com.streamsets.pipeline.lib.jdbc.JdbcUtil.createDataSourceForRead(JdbcUtil.java:774)
at com.streamsets.pipeline.stage.origin.jdbc.JdbcSource.init(JdbcSource.java:216)
at com.streamsets.pipeline.api.base.BaseStage.init(BaseStage.java:48)
at com.streamsets.pipeline.configurablestage.DStage.init(DStage.java:36)
at com.streamsets.datacollector.runner.StageRuntime.init(StageRuntime.java:159)
at com.streamsets.datacollector.runner.StagePipe.init(StagePipe.java:101)
at com.streamsets.datacollector.runner.StagePipe.init(StagePipe.java:49)
at com.streamsets.datacollector.runner.Pipeline.initPipe(Pipeline.java:382)
at com.streamsets.datacollector.runner.Pipeline.init(Pipeline.java:295)
at com.streamsets.datacollector.runner.preview.PreviewPipeline.run(PreviewPipeline.java:49)
at com.streamsets.datacollector.execution.preview.sync.SyncPreviewer.start(SyncPreviewer.java:206)
at com.streamsets.datacollector.execution.preview.async.AsyncPreviewer.lambda$start$0(AsyncPreviewer.java:94)
at com.streamsets.pipeline.lib.executor.SafeScheduledExecutorService$SafeCallable.lambda$call$0(SafeScheduledExecutorService.java:249)
at com.streamsets.datacollector.security.GroupsInScope.execute(GroupsInScope.java:33)
at com.streamsets.pipeline.lib.executor.SafeScheduledExecutorService$SafeCallable.call(SafeScheduledExecutorService.java:245)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.zaxxer.hikari.pool.PoolInitializationException: Exception during pool initialization: Method not supported
at com.zaxxer.hikari.pool.HikariPool.initializeConnections(HikariPool.java:581)
at com.zaxxer.hikari.pool.HikariPool.(HikariPool.java:152)
at com.zaxxer.hikari.HikariDataSource.(HikariDataSource.java:73)
at com.streamsets.pipeline.lib.jdbc.JdbcUtil.createDataSourceForRead(JdbcUtil.java:767)
... 20 more
Caused by: java.sql.SQLException: Method not supported
at org.apache.hive.jdbc.HiveConnection.setReadOnly(HiveConnection.java:1198)
at com.zaxxer.hikari.pool.PoolElf.setupConnection(PoolElf.java:184)
at com.zaxxer.hikari.pool.HikariPool.addConnection(HikariPool.java:497)
at com.zaxxer.hikari.pool.HikariPool.initializeConnections(HikariPool.java:565)
... 23 more

how to load PipelineConfiguration in origin

I created a new lib follow the tutorials,then i do something like load schema from the origin database,i wish the data can display on web.But i can't find the way to transform the schema data.

Are Field types immutable?

I've built a working 'streamsets destination' to feed data into the analytics platform my team is working on, but I have a question about 'schema'.

With respect to the 'field type' (ie what is returned from Field.getType()), are these types immutable for a given fieldPath ... or might these types change over time for any single given fieldPath?

Thanks in advance.

proxy support

I've been through the tutorial from behind a corporate firewall so had to add proxy support. In case anyone else wants/needs to do this I had to:

in SampleTarget.java
import org.apache.http.HttpHost
add viaproxy line in Request.Post (before the bodyString) e.g.
.viaProxy(new HttpHost("my.proxy.host", 8080))

in sdc-security.policy
change sdc permission to my proxy host rather than requestb.in

tutorials-1:java.lang.RuntimeException: error while performing request

2017-03-02 11:28:36,736 [user:*admin] [pipeline:tutorial01/3a5c571e-2d7a-49ce-aa5f-61006e68ea72] [thread:preview-pool-1-thread-1] WARN Pipeline - Stage 'com_streamsets_pipeline_stage_origin_spooldir_SpoolDirDSource_1' initialization error: java.lang.RuntimeException: error while performing request java.lang.RuntimeException: error while performing request at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:638) at org.elasticsearch.client.RestClient.performRequest(RestClient.java:212) at org.elasticsearch.client.RestClient.performRequest(RestClient.java:184) at org.elasticsearch.client.RestClient.performRequest(RestClient.java:146) at com.streamsets.pipeline.stage.destination.elasticsearch.ElasticSearchTarget.init(ElasticSearchTarget.java:290) at com.streamsets.pipeline.api.base.BaseStage.init(BaseStage.java:52) at com.streamsets.pipeline.configurablestage.DStage.init(DStage.java:40) at com.streamsets.datacollector.runner.StageRuntime.init(StageRuntime.java:156) at com.streamsets.datacollector.runner.StagePipe.init(StagePipe.java:105) at com.streamsets.datacollector.runner.StagePipe.init(StagePipe.java:53) at com.streamsets.datacollector.runner.Pipeline.initPipe(Pipeline.java:260) at com.streamsets.datacollector.runner.Pipeline.init(Pipeline.java:252) at com.streamsets.datacollector.runner.Pipeline.validateConfigs(Pipeline.java:171) at com.streamsets.datacollector.runner.preview.PreviewPipeline.validateConfigs(PreviewPipeline.java:63) at com.streamsets.datacollector.execution.preview.sync.SyncPreviewer.validateConfigs(SyncPreviewer.java:122) at com.streamsets.datacollector.execution.preview.async.AsyncPreviewer$1.call(AsyncPreviewer.java:76) at com.streamsets.pipeline.lib.executor.SafeScheduledExecutorService$SafeCallable.call(SafeScheduledExecutorService.java:233) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.http.ProtocolException: Not a valid protocol version: This is not a HTTP port at org.apache.http.impl.nio.codecs.AbstractMessageParser.parse(AbstractMessageParser.java:209) at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:245) at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81) at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39) at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114) at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162) at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337) at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315) at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276) at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588) ... 1 more Caused by: org.apache.http.ParseException: Not a valid protocol version: This is not a HTTP port at org.apache.http.message.BasicLineParser.parseProtocolVersion(BasicLineParser.java:148) at org.apache.http.message.BasicLineParser.parseStatusLine(BasicLineParser.java:366) at org.apache.http.impl.nio.codecs.DefaultHttpResponseParser.createMessage(DefaultHttpResponseParser.java:112) at org.apache.http.impl.nio.codecs.DefaultHttpResponseParser.createMessage(DefaultHttpResponseParser.java:50) at org.apache.http.impl.nio.codecs.AbstractMessageParser.parseHeadLine(AbstractMessageParser.java:156) at org.apache.http.impl.nio.codecs.AbstractMessageParser.parse(AbstractMessageParser.java:207) ... 11 more 2017-03-02 11:28:37,749 [user:admin] [pipeline:tutorial01/3a5c571e-2d7a-49ce-aa5f-61006e68ea72] [thread:webserver-21] WARN StandaloneAndClusterPipelineManager - Evicting idle previewer '3a5c571e-2d7a-49ce-aa5f-61006e68ea72::null'::'61816234-927e-4ac1-bc87-8572d47fdf97' in status 'INVALID'

Build Fails, missing dependency

When trying to follow the sampleprocessor tutorial, when maven is downloading dependencies from this:
mvn clean package -DskipTests
This line is returned:
[WARNING] The POM for com.streamsets:streamsets-datacollector-api:jar:1.3.0.0-SNAPSHOT is missing, no dependency information available
This then causes the build to fail.

does your solution support k8s or mesos

i am currently evaluating rancher for our internal infrastructure ,so here comes question
does streamset support k8s,mesos or rancher such docker container architecture?

Single pipeline for CRUD operations

I have been trying reading data from mysql db and writing to another table in mysql db.But insertion is happening but update and deleted records is not reflecting in my db.

How can I use JDBC Query Consumer to query data from hive?

Hi,i want to query data from hive by JDBC Query Consumer .But it didn't work .
Error is "JJDBC_00 - Cannot connect to specified database: com.streamsets.pipeline.api.StageException: JDBC_06 - Failed to initialize connection pool: com.zaxxer.hikari.pool.PoolInitializationException: Exception during pool initialization: Method not supported"
How can i solve it?

Exception during pool initialization: Method not supported

JDBC_00 - Cannot connect to specified database: com.streamsets.pipeline.api.StageException: JDBC_06 - Failed to initialize connection pool: com.zaxxer.hikari.pool.PoolInitializationException: Exception during pool initialization: Method not supported

stateful calculations processors ?

can the streamsets on yarn streaming cluster model do stateful calculations?

Such as real-time statistics PV, UV or certain fields aggregation ?

will you give some advice and code sample ?

mvn clean install for 2.1.0.0 DC fails under Windows

Under Windows:
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.7:
run (rat-report) on project streamsets-datacollector: An Ant BuildException has
occured: Execute failed: java.io.IOException: Cannot run program "\bin\bash": Cr
eateProcess error=2, The system cannot find the file specified
[ERROR] around Ant part ...<exec dir="C:\Users\madison\Downloads\2.1.0.0_dc\d atacollector-2.1.0.0" executable="/bin/bash" failonerror="true">... @ 10:120 in
C:\Users\madison\Downloads\2.1.0.0_dc\datacollector-2.1.0.0\target\antrun\bui
ld-main.xml

save all pipeline json config kv to mysql ?

as I know streamsets save/update piepline config info and offset as json file at local dir

does streamsets support save them into mysql ?

how to keep High Availability ?

Elasticsearch Update Operation Tutorial

Hi,

Is there a tutorial on ElasticSearch update operation?

Pipeline Flow:
Have create a minimal pipeline to read data from ES 6.2.2 (origin) and push it to RDBMS (destination 1). Thereafter we update same ES document (destination 2) to update a field @Version from 1 to 2. So that on the next run of the pipeline this document from ES will not be picked for moving to RDBMS.

Can someone point to the right direction as we are getting below error, after following the Streamset documents at the Destination (Also if someone could point on debugging the flow it would be helpful as the error is not clear where exactly the issue is) :

2018-05-28 08:27:35,171 [user:*admin] [pipeline:ESUpdatePipeline/ESUpdatePipeline5fdde3c4-2c84-480b-a853-f2085333df2c] [runner:] [thread:ProductionPipelineRunnable-ESUpdatePipeline5fdde3c4-2c84-480b-a853-f2085333df2c-ESUpdatePipeline] INFO Pipeline - Destroying pipeline with reason=FAILURE
2018-05-28 08:27:35,177 [user:*admin] [pipeline:ESUpdatePipeline/ESUpdatePipeline5fdde3c4-2c84-480b-a853-f2085333df2c] [runner:] [thread:ProductionPipelineRunnable-ESUpdatePipeline5fdde3c4-2c84-480b-a853-f2085333df2c-ESUpdatePipeline] INFO Pipeline - Processing lifecycle stop event
2018-05-28 08:27:35,177 [user:*admin] [pipeline:ESUpdatePipeline/ESUpdatePipeline5fdde3c4-2c84-480b-a853-f2085333df2c] [runner:] [thread:ProductionPipelineRunnable-ESUpdatePipeline5fdde3c4-2c84-480b-a853-f2085333df2c-ESUpdatePipeline] INFO Pipeline - Pipeline finished destroying with final reason=FAILURE
2018-05-28 08:27:35,185 [user:*admin] [pipeline:ESUpdatePipeline/ESUpdatePipeline5fdde3c4-2c84-480b-a853-f2085333df2c] [runner:] [thread:ProductionPipelineRunnable-ESUpdatePipeline5fdde3c4-2c84-480b-a853-f2085333df2c-ESUpdatePipeline] ERROR ProductionPipelineRunnable - An exception occurred while running the pipeline, com.streamsets.pipeline.api.StageException: ELASTICSEARCH_17 - Could not index '347' records: One or more operations failed
com.streamsets.pipeline.api.StageException: ELASTICSEARCH_17 - Could not index '347' records: One or more operations failed
at com.streamsets.pipeline.stage.destination.elasticsearch.ElasticsearchTarget.write(ElasticsearchTarget.java:342)
at com.streamsets.pipeline.configurablestage.DTarget.write(DTarget.java:34)
at com.streamsets.datacollector.runner.StageRuntime.lambda$execute$2(StageRuntime.java:249)
at com.streamsets.datacollector.runner.StageRuntime.execute(StageRuntime.java:195)
at com.streamsets.datacollector.runner.StageRuntime.execute(StageRuntime.java:257)
at com.streamsets.datacollector.runner.StagePipe.process(StagePipe.java:219)
at com.streamsets.datacollector.execution.runner.common.ProductionPipelineRunner.processPipe(ProductionPipelineRunner.java:801)
at com.streamsets.datacollector.execution.runner.common.ProductionPipelineRunner.lambda$executeRunner$3(ProductionPipelineRunner.java:846)
at com.streamsets.datacollector.runner.PipeRunner.executeBatch(PipeRunner.java:136)
at com.streamsets.datacollector.execution.runner.common.ProductionPipelineRunner.executeRunner(ProductionPipelineRunner.java:845)
at com.streamsets.datacollector.execution.runner.common.ProductionPipelineRunner.runSourceLessBatch(ProductionPipelineRunner.java:823)
at com.streamsets.datacollector.execution.runner.common.ProductionPipelineRunner.processBatch(ProductionPipelineRunner.java:472)
at com.streamsets.datacollector.runner.StageRuntime$3.run(StageRuntime.java:321)
at java.security.AccessController.doPrivileged(Native Method)
at com.streamsets.datacollector.runner.StageRuntime.processBatch(StageRuntime.java:317)
at com.streamsets.datacollector.runner.StageContext.processBatch(StageContext.java:252)
at com.streamsets.pipeline.stage.origin.elasticsearch.ElasticsearchSource$ElasticsearchTask.run(ElasticsearchSource.java:231)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Not able to find maven dependacies

Hi I am going through tutorial "Creating a Custom StreamSets Destination", in that two object instances 1) DataGenerator and 2) DelimitedCharDataGenerator could not able to resolved.


import com.streamsets.pipeline.lib.generator.DataGenerator;
import com.streamsets.pipeline.lib.generator.delimited.DelimitedCharDataGenerator;

both are not able to resolved.

The mentioned maven repository is not available so adding below mentioned dependency in pom.xml file cause error too,

<dependency>
      <groupId>com.streamsets</groupId>
      <artifactId>streamsets-datacollector-commonlib</artifactId>
      <version>1.2.2.0</version>
      <scope>provided</scope>
    </dependency>

On maven repository available version is 1.1.3 which is quite older. Your help will be appreciated.
Thanks in advance.

Issue building cdh_6_0-lib subproject

Maven CDH downloads for the datacollector build kept hanging until I went in and changed the CDH repo to the following in cdh_6_0-lib/pom.xml:

<!--<url>http://bits.cloudera.com/f93c6c9d/cdh6/6.0.0-beta1/maven-repository/</url>-->
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>

Custom processor Field value getting turncated

Hi,

I developed custom processor based on tutorial. Pipline is working good. But when i print JSON value from field its getting truncated json.
Is any length limit for field.getValueAsString()?

Below is code:
for (String fieldPath : record.getEscapedFieldPaths()) {
Field field = record.get(fieldPath);
if (field.getType() == Field.Type.STRING) {
String json = field.getValueAsString();
LOG.info("Field Value:: = " + json);
}
}

ThankYou,
Ajit

Hadoop FS destination File Type

The tutorial is not clear on what file type should be used for the output file when AVRO is selected as a data format. Also the tutorial needs to be updated to show current configuration windows and options as it's out of date.

This is for the HIVE Drift solution

Requirements

Please note:
-The maven/ant build can only run on Linux (not Windows)
-On Linux the maven build needs at least a version of 3.2.1 or higher
-On Linux the maven build only works on jdk1.8 (jdk1.7 did not work for me)

filetail, log file move to history dir ,how to configure file generation and archive strategies

I have nginx log file to collect in "/data/logs/nginx/xyz.log"
every day ,mv the log and compress to /data/logs/nginx/2017/05/xyz.log.tar.gz
and recreate new log file /data/logs/nginx/xyz.log
how to configure file generation and archive strategies ?
i try use default "Active File with Reverse Counter " naming type ,but i found streamsets try to collect /data/logs/nginx/2017 ,but that is a dir

hivetarget【hive streaming】 Opened a connection to metastore, current connections too much , which made pipeline failed

image
HIVE_05 - Hive Metastore error: Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused (Connection refused) at org.apache.thrift.transport.TSocket.open(TSocket.java:187) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:414) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:234) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:179) at com.streamsets.pipeline.stage.destination.hive.HiveTarget$1.run(HiveTarget.java:251) at com.streamsets.pipeline.stage.destination.hive.HiveTarget$1.run(HiveTarget.java:245) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at com.streamsets.pipeline.stage.destination.hive.HiveTarget.initHiveMetaStoreClient(HiveTarget.java:244) at com.streamsets.pipeline.stage.destination.hive.HiveTarget.init(HiveTarget.java:195) at com.streamsets.pipeline.api.base.BaseStage.init(BaseStage.java:52) at com.streamsets.pipeline.configurablestage.DStage.init(DStage.java:40) at com.streamsets.datacollector.runner.StageRuntime.init(StageRuntime.java:136) at com.streamsets.datacollector.runner.StagePipe.init(StagePipe.java:104) at com.streamsets.datacollector.runner.StagePipe.init(StagePipe.java:53) at com.streamsets.datacollector.runner.Pipeline.init(Pipeline.java:158) at com.streamsets.datacollector.execution.runner.common.ProductionPipeline.run(ProductionPipeline.java:97) at com.streamsets.datacollector.execution.runner.common.ProductionPipelineRunnable.run(ProductionPipelineRunnable.java:72) at com.streamsets.datacollector.execution.runner.standalone.StandaloneRunner.start(StandaloneRunner.java:761) at com.streamsets.datacollector.execution.runner.common.AsyncRunner$4.call(AsyncRunner.java:147) at com.streamsets.pipeline.lib.executor.SafeScheduledExecutorService$SafeCallable.call(SafeScheduledExecutorService.java:233) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at org.apache.thrift.transport.TSocket.open(TSocket.java:182) ... 27 more

how to send jobs to spark cluster

spark evaluator runs in local, is there any way to send jobs to a spark cluster to execute ? I have a spark cluster managed in standalone mode, how to send jobs to it ? I call setMaster() method to point spark's master to my spark cluster in init method of my spark transformer, but it seems not work .

Uses of OnRecordError API

Hi,

Can i use OnRecordError API on my custom processor to to send the error records to other pipeline.

Any docs or URL for this solutions, Please share with me.

Thanks,
Ajit

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.