apache / seatunnel Goto Github PK

View Code? Open in Web Editor NEW

7.4K 172.0 1.6K 28.89 MB

SeaTunnel is a next-generation super high-performance, distributed, massive data integration tool.

Home Page: https://seatunnel.apache.org/

License: Apache License 2.0

Shell 0.25% Java 99.46% Python 0.08% JavaScript 0.05% Batchfile 0.16%

data-integration high-performance offline real-time apache batch cdc change-data-capture data-ingestion elt

seatunnel's Issues

Spark Streaming & Spark SQL

RDD vs DataFrame vs Dataset:

https://stackoverflow.com/questions/37301226/difference-between-dataset-api-and-dataframe
https://dzone.com/articles/the-dominant-apis-of-spark-datasets-dataframes-and

http://spark.apache.org/docs/latest/streaming-programming-guide.html
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
http://www.csdn.net/article/2015-04-03/2824407
https://stackoverflow.com/questions/29383578/how-to-convert-rdd-object-to-dataframe-in-spark
https://indatalabs.com/blog/data-engineering/convert-spark-rdd-to-dataframe-dataset
https://databricks.com/blog/2016/01/04/introducing-apache-spark-datasets.html

Input主要插件开发

M1 TODO

配置解析
- ~~spark common config(spark.*，还有appname, duration)解析~~
- ~~配置文件错误提示和定位~~
- 【非必需】实现if..else逻辑的代码【与插件流程体系直接相关】
- 【非必需】用户预定义模版变量，系统环境变量替换
插件流程体系
- ~~确定 BaseFilter最终接口定义（重点：filter(包括其他开发者的filter)根据需要自动注册为UDF）~~
- ~~确定BaseInput, BaseOutput的接口定义(考虑到broadcast, accumulator 的应用；与spark input,output format的关系)~~
- ~~在流程代码中支持多个 input, output~~
- Serializer 与其他Plugin的关系
- ~~能够集成外部开发者的插件（支持：Java/Scala）~~
- 【非必需】Field Reference
- 【非必需】支持if..else逻辑
Input，Filter，Output插件开发
- ~~Input 插件~~
- Filter 插件
- ~~Output插件~~
- Input, Filter, Output插件功能测试(spark on Yarn[client, cluster]模式，spark on Mesos, Local)
全流程简化
- 区分不同的build.sbt
- 接管整个spark + waterdrop 的流程。同时允许waterdrop以最简单spark job方式运行。
- 安装
- 部署(3种部署方式)
- 插件集成
- 配置
- 运行
中英文文档
- ~~统一的插件定义的文档格式~~
- 完整的中文文档（重点插件文档）
- 完整的英文文档（重点插件文档）

[在这个节点上线]

性能报告
- 【非必需】大数据量的稳定性，处理性能，一致性的测试。
- 【非必需】性能报告
- 【非必需】性能调优（并行度，filter体系代码）

同时支持国内外IP库

各种IP库，我们除了要支持geoip2,还需要挑选1个国内的ip库做支持：
ipip.net【国外不准，国内准】
纯真ip数据库【国外不准，国内准】 http://www.cz88.net/， http://www.cnblogs.com/anpengapple/p/5384985.html
geoip2【国外准，国内不准】

Document 增加内部原理的介绍

filterConfObj.foreach遍历顺序不可控

filterConfObj.foreach遍历顺序不是按照配置文件中插件顺序从上至下遍历

Structured Streaming & Window Operations

https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

https://spark-summit.org/2016/events/a-deep-dive-into-structured-streaming/

https://databricks.com/blog/2017/05/18/taking-apache-sparks-structured-structured-streaming-to-production.html

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

https://www.slideshare.net/julesdamji/a-deep-dive-into-structured-streaming-in-apache-spark

http://www.jianshu.com/p/2673a6e4254f

配置中支持复杂配置逻辑（如：if else 的逻辑，模版变量，预定义变量）

conditional and expression in configurtion file:

example:

if [a.b.c != 3] {
	
} else {
	
}

references:

logstash configuration
http://enear.github.io/2016/03/31/parser-combinators/
https://github.com/sirthias/parboiled2
http://www.lihaoyi.com/fastparse/
https://gist.github.com/nicerobot/4189552
https://dzone.com/articles/rolling-your-own-dsl-in-scala
https://www.slideshare.net/abhijit.sharma/writing-dsls-in-scala
https://www.slideshare.net/holograph/a-field-guide-to-dsl-design-in-scala
http://jmespath.org/
http://www.antlr.org/

Search keyword: Grammars, workflows, DSL, expression, eval

Output插件开发

插件体系构建(包括UDF, UDAF体系)

插件体系和具体的实现可以参考

Cloudera Morphlines
Embulk
Fluent
Hangout
Logstash
Kinesis
Spark UDF
Hive UDF

Filter插件需求：
Distinct，sample

用户需求：
repartition 增加或减少并行度，输出文件个数。

Document

https://interestinglab.github.io/waterdrop/#/

Description

配置中无法指定\s+等类似的正则表达式

主要涉及到的插件包括filter: split, table

Filter插件开发

Structured Streaming

https://databricks.com/blog/2016/07/28/continuous-applications-evolving-streaming-in-apache-spark-2-0.html

configParse丢掉配置中最后一个字符

filter.Sql插件中sql字段读缺失最后一个字符

app.conf

spark {
  spark.streaming.batchDuration = 5
  spark.master = "local[2]"
  spark.app.name = "Waterdrop-1"
  spark.ui.port = 13000
}

input {
  kafka {
        topics = "sinabip_test"
        consumer.auto.offset.reset = "largest"
    }
}

filter {
	Split {
		source_field = "raw_message"
		fields = ["times", "info"]
	}
	Sql {
		table_name = "test",
		sql = "select info from test where info='hello'"
	}
	
}

output {
  Stdout {}
}

Main函数中打印的配置

[INFO] Parsed Config: 
{
    "filter" : [
        {
            "entries" : {
                "fields" : [
                    "times",
                    "info"
                ],
                "source_field" : "raw_message"
            },
            "name" : "Split"
        },
        {
            "entries" : {
                "sql" : "select info from test where info='hello",
                "table_name" : "test"
            },
            "name" : "Sql"
        }
    ],
    "input" : [
        {           
            "name" : "kafka"
        }
    ],
    "output" : [
        {
            "entries" : {},
            "name" : "Stdout"
        }
    ],
    "spark" : {
        "spark" : {
            "app" : {
                "name" : "Waterdrop-1"
            },
            "master" : "local[2]",
            "streaming" : {
                "batchDuration" : 5
            },
            "ui" : {
                "port" : 13000
            }
        }
    }
}

Antlr4 tutorials

Antlr tutorial:

https://tomassetti.me/antlr-mega-tutorial/
http://sqtds.github.io/tags/antlr4/
https://alexecollins.com/antlr4-and-maven-tutorial/
http://meri-stuff.blogspot.com/2011/09/antlr-tutorial-expression-language.html#LexerBasics
http://progur.com/2016/09/how-to-create-language-using-antlr4.html
https://yq.aliyun.com/articles/11366
http://www.cnblogs.com/sld666666/p/6145854.html
http://blog.csdn.net/dc_726/article/details/45399371
https://github.com/antlr/antlr4/blob/master/doc/index.md
https://github.com/antlr/grammars-v4/blob/master/json/JSON.g4
https://plugins.jetbrains.com/plugin/7358-antlr-v4-grammar-plugin
https://stackoverflow.com/questions/21534316/is-there-a-simple-example-of-using-antlr4-to-create-an-ast-from-java-source-code
https://stackoverflow.com/questions/23092081/antlr4-visitor-pattern-on-simple-arithmetic-example

https://stackoverflow.com/questions/6487593/what-does-fragment-mean-in-antlr
http://floris.briolas.nl/floris/2008/10/antlr-common-pittfals/
https://github.com/odiszapc/nginx-java-parser
https://codevomit.wordpress.com/2015/04/25/antlr4-project-with-maven-tutorial-episode-3/
https://stackoverflow.com/questions/1931307/antlr-is-there-a-simple-example
https://stackoverflow.com/questions/29971097/how-to-create-ast-with-antlr4

Listener vs Vistor:

https://stackoverflow.com/questions/20714492/antlr4-listeners-and-visitors-which-to-implement?rq=1

http://jakubdziworski.github.io/java/2016/04/01/antlr_visitor_vs_listener.html

ANTLRv4: How to read double quote escaped double quotes in string?

https://stackoverflow.com/questions/17897651/antlrv4-how-to-read-double-quote-escaped-double-quotes-in-string

nested boolean expression parsing:

https://stackoverflow.com/questions/25096713/parser-lexer-logical-expression
https://stackoverflow.com/questions/30976962/nested-boolean-expression-parser-using-antlr

parsing comment:

https://stackoverflow.com/questions/7070763/parse-comment-line?rq=1
https://stackoverflow.com/questions/28674875/antlr-4-how-to-parse-comments
http://meri-stuff.blogspot.com/2012/09/tackling-comments-in-antlr-compiler.html

design pattern: visitor

https://dzone.com/articles/design-patterns-visitor

books:

"The Definitive Antlr4 Reference"

自动化测试体系

Scala Unit Test:

http://www.scalatest.org/
https://alvinalexander.com/scala/writing-tdd-unit-tests-with-scalatest
https://semaphoreci.com/community/tutorials/a-hands-on-introduction-to-scalatest

scala code generation and runtime compile

http://yefremov.net/blog/scala-code-generation/

http://eed3si9n.com/treehugger/

https://stackoverflow.com/questions/23874281/scala-how-to-compile-code-from-an-external-file-at-runtime

https://gist.github.com/metasim/7492509

https://stackoverflow.com/questions/39137175/dynamically-compiling-scala-class-files-at-runtime-in-scala-2-11

https://stackoverflow.com/questions/12122939/generating-a-class-from-string-and-instantiating-it-in-scala-2-10/12123609#12123609

https://scalerablog.wordpress.com/2016/06/20/scala-code-interpretation-at-runtime/

https://scala-lms.github.io/

https://github.com/codacy/codacy-macros

20170904Week TODO

我列举一下本周可以做的事情：

(1) Filter UDF：

a) 找到Spark SQL自带的所有UDF列表，看看这些UDF都能做什么，将来我们可以在Waterdrop的文档里引用这些用户可以直接用的UDF；

b) 我们计划实现的那些Filter，能不能同时提供对应的UDF，如果能该怎么做？

c) 我们的Filter与SparkSQL自带的UDF有没有重复，能不能复用。

(2) 确定 BaseFilter的最终接口定义；

一个思路是把所有的Filter插件整理一遍，看它们需要什么样的BaseFilter接口定义；

另一个思路是想明白，如果将来有一个用户要开发他自己的插件，他该如何利用BaseFilter的接口开发出自己的插件。

(3) 在流程代码中支持多个 input, output

(4) [先搞定前3个再看此条] 确定BaseInput, BaseOutput的接口定义，这个涉及到几个麻烦的技术点，下周再跟你说。

文档的书写格式确定及文档生成工具的确定

markdown + docsify

插件体系和具体的实现可以参考

logstash
hangout
Spark SQL UDF,UDAF
Hive UDF
flume
fluent
kafka stream

https://github.com/onurakpolat/awesome-bigdata#data-ingestion

Filter插件需求：
Distinct，sample

用户需求：
repartition 增加或减少并行度，输出文件个数。

scala code format

scala code format:
http://scalameta.org/scalafmt/

自动Build体系

travis.ci

易用性提升的几点备注

debug 模式：能让用户很容易知道每个环节的数据变化
本地模式：利用spark 的local模式，方便用户debug和本地开发。
Filter过程可视化。
帮用户想好应用场景，并简化对应的部署和运行流程。

Spark RDD处理多语言支持

You can use the pipe() function on RDDs to call external code. It passes data to an external program through stdin / stdout. For Spark Streaming, you would do dstream.transform(rdd => rdd.pipe(...)) to call it on each RDD.

未开发插件

Input

Flume, Http

Filter

Clone, Dict, Geoip

Output

Http

Spark Datasource API Integration

https://docs.databricks.com/spark/latest/data-sources/index.html

https://www.slideshare.net/databricks/yin-huai-20150325meetupwithdemos

https://mapr.com/blog/spark-data-source-api-extending-our-spark-sql-query-engine/

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-sql-datasource-api.html

http://www.spark.tc/exploring-the-apache-spark-datasource-api/

https://mapr.com/blog/how-integrate-custom-data-sources-apache-spark/

scala good coding style

Codacy Coding Style: Enforce usage of the Try object[disabled]

http://www.scala-lang.org/api/current/scala/util/Try.html

https://stackoverflow.com/questions/25588387/scala-util-try-how-to-get-throwable-value-pattern-matching

Document

https://interestinglab.github.io/waterdrop/#/zh-cn/configuration/output-plugins/Stdout

Description

Document

https://interestinglab.github.io/waterdrop/#/zh-cn/configuration/filter-plugins/SQL

Description

Scala Unit Test

http://www.scalatest.org/
https://alvinalexander.com/scala/writing-tdd-unit-tests-with-scalatest
https://semaphoreci.com/community/tutorials/a-hands-on-introduction-to-scalatest

项目框架(Event, 数据流, Dataframe,UDF, UDAF)的开发与测试

当前遇到的困难

Derive multiple columns from a single column in a Spark DataFrame／Assign the result of UDF to multiple dataframe columns:

https://stackoverflow.com/questions/32196207/derive-multiple-columns-from-a-single-column-in-a-spark-dataframe

https://stackoverflow.com/questions/35322764/apache-spark-assign-the-result-of-udf-to-multiple-dataframe-columns

https://blog.cloudera.com/blog/2017/02/working-with-udfs-in-apache-spark/

https://stackoverflow.com/questions/32196207/derive-multiple-columns-from-a-single-column-in-a-spark-dataframe

支持在start-waterdrop.sh命令参数中使用spark配置

支持在start-waterdrop.sh命令参数中使用spark配置，如运行资源的大小，优先级高于application.conf ，类似spark-submit。

conditional and expression in configurtion file

example:

if [a.b.c != 3] {
	
} else {
	
}

references:

Search keyword: Grammars, workflows, DSL, expression, eval

中文／英文文档

中文文档完成度：

欢迎页 https://interestinglab.github.io/waterdrop
介绍 https://interestinglab.github.io/waterdrop/#/zh-cn/README
(Garyelephant, 1月3日) 快速开始 https://interestinglab.github.io/waterdrop/#/zh-cn/quick-start
(Garyelephant, 1月3日) 安装 https://interestinglab.github.io/waterdrop/#/zh-cn/installation

英文文档完成度：

Roadmap

Waterdrop 未来5个重要的发展方向：

支持Flink/Bean 计算引擎
支持基于Flink 的【有状态】【实时】【聚合】计算(用户可指定时间粒度，纬度，指标)
交互式 UI（支持Pipeline的交互式构建、交互式的SQL执行，功能和性能诊断可视化工具）
基于应用场景的深入Spark/Flink 底层做性能优化。
扩大生产环境使用Waterdrop的公司规模（国内公司技术支持，英文社区推广）
自助化和交互式的问题诊断和性能优化，参考Alibaba Arthas

document generator

document:
https://github.com/QingWei-Li/docsify/#showcase

网站文档体验优化

增加页面评论／讨论功能，如disqus

增加实时反馈：
https://www.hotjar.com

Spark UI(统计功能，各阶段耗时，Pipeline Viewer)

Spark Logging Configuration

https://community.hortonworks.com/articles/138849/how-to-capture-spark-driver-and-executor-logs-in-y.html

https://hackernoon.com/how-to-log-in-apache-spark-f4204fad78a

https://mapr.com/blog/how-log-apache-spark/

monitoring

JMX
Coda Hale Metrics

https://developer.lightbend.com/docs/monitoring/2.3.x/plugins/chmetrics/intro.html

https://www.slideshare.net/IzzetMustafaiev/metrics-by-coda-hale

http://metrics.dropwizard.io/3.2.3/getting-started.html

从一个udf中生成多个新的column

应用场景grok, split等filter

支持用户指定字段类型StructType

coding style 优化

加入java的checkstyle检查，
scalastyle配置更严格的coding style
codacy做对应的配置

configParse对正则表达式支持不完善

使用\s

pattern = "\s"

运行报错
使用\\s

pattern = "\s"

configParse会对\转义

插件体系的结构问题

当前目录结构

filter/
├── BaseFilter.scala
└── Split.scala
└── Sql.scala

在具体文件上面加一级document会不会好点，因为有的插件代码不可能都在一个文件里面完成，如果都放在一起的话，不好管理。以这个grok为例

filter/
├── BaseFilter.scala
├── grok
│   ├── Grok.scala
│   └── PatternGrok.scala
├── split
│   └── Split.scala
└── sql
    └── Sql.scala

JSON插件目前的问题

当target_field为***ROOT***时，由于需要从原始数据获取schema，导致目前没有解决办法
嵌套JSON结构无法多级解析

向后兼容性

spark升级到2.2之后由于scala升级到2.11导致无法兼容spark1.6
兼容jdk 1.7

新增src/test/java目录，其下main方法运行报错

想新建一个测试目录 src/test/java，新建一个类的Main方法

org.interestinglab.waterdrop.WaterDropTest

运行报错

Error:(13, 14) BoolExprBaseVisitor is already defined as object BoolExprBaseVisitor
public class BoolExprBaseVisitor<T> extends AbstractParseTreeVisitor<T> implements BoolExprVisitor<T> {
             ^

Document

https://interestinglab.github.io/waterdrop/#/zh-cn/README

Description

apache / seatunnel Goto Github PK

seatunnel's Issues

conditional and expression in configurtion file:

当前遇到的困难

Recommend Projects

Recommend Topics

Recommend Org