Git Product home page Git Product logo

apache / seatunnel Goto Github PK

View Code? Open in Web Editor NEW
7.4K 172.0 1.6K 28.89 MB

SeaTunnel is a next-generation super high-performance, distributed, massive data integration tool.

Home Page: https://seatunnel.apache.org/

License: Apache License 2.0

Shell 0.25% Java 99.46% Python 0.08% JavaScript 0.05% Batchfile 0.16%
data-integration high-performance offline real-time apache batch cdc change-data-capture data-ingestion elt

seatunnel's Issues

Spark Streaming & Spark SQL

M1 TODO

  • 配置解析

    • spark common config(spark.*,还有appname, duration)解析
    • 配置文件错误提示和定位
    • 【非必需】实现if..else逻辑的代码【与插件流程体系直接相关】
    • 【非必需】用户预定义模版变量,系统环境变量替换
  • 插件流程体系

    • 确定 BaseFilter最终接口定义(重点:filter(包括其他开发者的filter)根据需要自动注册为UDF)
    • 确定BaseInput, BaseOutput的接口定义(考虑到broadcast, accumulator 的应用;与spark input,output format的关系)
    • 在流程代码中支持多个 input, output
    • Serializer 与其他Plugin的关系
    • 能够集成外部开发者的插件(支持:Java/Scala)
    • 【非必需】Field Reference
    • 【非必需】支持if..else逻辑
  • Input,Filter,Output插件开发

    • Input 插件
    • Filter 插件
    • Output插件
    • Input, Filter, Output插件功能测试(spark on Yarn[client, cluster]模式,spark on Mesos, Local)
  • 全流程简化

    • 区分不同的build.sbt
    • 接管整个spark + waterdrop 的流程。同时允许waterdrop以最简单spark job方式运行。
    • 安装
    • 部署(3种部署方式)
    • 插件集成
    • 配置
    • 运行
  • 中英文文档

    • 统一的插件定义的文档格式
    • 完整的中文文档(重点插件文档)
    • 完整的英文文档(重点插件文档)

[在这个节点上线]


  • 性能报告
    • 【非必需】大数据量的稳定性,处理性能,一致性的测试。
    • 【非必需】性能报告
    • 【非必需】性能调优(并行度,filter体系代码)

Structured Streaming & Window Operations

配置中支持复杂配置逻辑(如:if else 的逻辑,模版变量,预定义变量)

configParse丢掉配置中最后一个字符

filter.Sql插件中sql字段读缺失最后一个字符

app.conf

spark {
  spark.streaming.batchDuration = 5
  spark.master = "local[2]"
  spark.app.name = "Waterdrop-1"
  spark.ui.port = 13000
}

input {
  kafka {
        topics = "sinabip_test"
        consumer.auto.offset.reset = "largest"
    }
}

filter {
	Split {
		source_field = "raw_message"
		fields = ["times", "info"]
	}
	Sql {
		table_name = "test",
		sql = "select info from test where info='hello'"
	}
	
}

output {
  Stdout {}
}

Main函数中打印的配置

[INFO] Parsed Config: 
{
    "filter" : [
        {
            "entries" : {
                "fields" : [
                    "times",
                    "info"
                ],
                "source_field" : "raw_message"
            },
            "name" : "Split"
        },
        {
            "entries" : {
                "sql" : "select info from test where info='hello",
                "table_name" : "test"
            },
            "name" : "Sql"
        }
    ],
    "input" : [
        {           
            "name" : "kafka"
        }
    ],
    "output" : [
        {
            "entries" : {},
            "name" : "Stdout"
        }
    ],
    "spark" : {
        "spark" : {
            "app" : {
                "name" : "Waterdrop-1"
            },
            "master" : "local[2]",
            "streaming" : {
                "batchDuration" : 5
            },
            "ui" : {
                "port" : 13000
            }
        }
    }
}

Antlr4 tutorials

Antlr tutorial:

https://tomassetti.me/antlr-mega-tutorial/
http://sqtds.github.io/tags/antlr4/
https://alexecollins.com/antlr4-and-maven-tutorial/
http://meri-stuff.blogspot.com/2011/09/antlr-tutorial-expression-language.html#LexerBasics
http://progur.com/2016/09/how-to-create-language-using-antlr4.html
https://yq.aliyun.com/articles/11366
http://www.cnblogs.com/sld666666/p/6145854.html
http://blog.csdn.net/dc_726/article/details/45399371
https://github.com/antlr/antlr4/blob/master/doc/index.md
https://github.com/antlr/grammars-v4/blob/master/json/JSON.g4
https://plugins.jetbrains.com/plugin/7358-antlr-v4-grammar-plugin
https://stackoverflow.com/questions/21534316/is-there-a-simple-example-of-using-antlr4-to-create-an-ast-from-java-source-code
https://stackoverflow.com/questions/23092081/antlr4-visitor-pattern-on-simple-arithmetic-example

https://stackoverflow.com/questions/6487593/what-does-fragment-mean-in-antlr
http://floris.briolas.nl/floris/2008/10/antlr-common-pittfals/
https://github.com/odiszapc/nginx-java-parser
https://codevomit.wordpress.com/2015/04/25/antlr4-project-with-maven-tutorial-episode-3/
https://stackoverflow.com/questions/1931307/antlr-is-there-a-simple-example
https://stackoverflow.com/questions/29971097/how-to-create-ast-with-antlr4

Listener vs Vistor:

https://stackoverflow.com/questions/20714492/antlr4-listeners-and-visitors-which-to-implement?rq=1

http://jakubdziworski.github.io/java/2016/04/01/antlr_visitor_vs_listener.html

ANTLRv4: How to read double quote escaped double quotes in string?

https://stackoverflow.com/questions/17897651/antlrv4-how-to-read-double-quote-escaped-double-quotes-in-string

nested boolean expression parsing:

https://stackoverflow.com/questions/25096713/parser-lexer-logical-expression
https://stackoverflow.com/questions/30976962/nested-boolean-expression-parser-using-antlr

parsing comment:

https://stackoverflow.com/questions/7070763/parse-comment-line?rq=1
https://stackoverflow.com/questions/28674875/antlr-4-how-to-parse-comments
http://meri-stuff.blogspot.com/2012/09/tackling-comments-in-antlr-compiler.html

design pattern: visitor

https://dzone.com/articles/design-patterns-visitor

books:

"The Definitive Antlr4 Reference"

scala code generation and runtime compile

20170904Week TODO

我列举一下本周可以做的事情:

(1) Filter UDF:

a) 找到Spark SQL自带的所有UDF列表,看看这些UDF都能做什么,将来我们可以在Waterdrop的文档里引用这些用户可以直接用的UDF;

b) 我们计划实现的那些Filter,能不能同时提供对应的UDF,如果能该怎么做?

c) 我们的Filter与SparkSQL自带的UDF有没有重复,能不能复用。

(2) 确定 BaseFilter的最终接口定义;

一个思路是把所有的Filter插件整理一遍,看它们需要什么样的BaseFilter接口定义;

另一个思路是想明白,如果将来有一个用户要开发他自己的插件,他该如何利用BaseFilter的接口开发出自己的插件。

(3) 在流程代码中支持多个 input, output

(4) [先搞定前3个再看此条] 确定BaseInput, BaseOutput的接口定义,这个涉及到几个麻烦的技术点,下周再跟你说。

易用性提升的几点备注

  1. debug 模式:能让用户很容易知道每个环节的数据变化

  2. 本地模式:利用spark 的local模式,方便用户debug和本地开发。

  3. Filter过程可视化。

  4. 帮用户想好应用场景,并简化对应的部署和运行流程。

Spark RDD处理多语言支持

You can use the pipe() function on RDDs to call external code. It passes data to an external program through stdin / stdout. For Spark Streaming, you would do dstream.transform(rdd => rdd.pipe(...)) to call it on each RDD.

项目框架(Event, 数据流, Dataframe,UDF, UDAF)的开发与测试

conditional and expression in configurtion file

中文/英文文档

中文文档完成度:


  • 配置

  • (Garyelephant) 通用配置

  • Input插件

    • Fake
    • File
    • Hdfs
    • Kafka
    • S3
    • Socket
  • Filter插件

    • (Rickyhuo)Add
    • Checksum
    • (Rickyhuo)Convert
    • (Rickyhuo)Date
    • Drop
    • Geoip
    • Grok
    • Json
    • Kv
    • (Rickyhuo)Lowercase
    • (Rikcyhuo)Remove
    • (Rickyhuo)Rename
    • Repartition
    • Replace
    • Sample
    • Split
    • SQL
    • Table
    • Truncate
    • (Rickyhuo)Uppercase
    • Uuid
  • Output插件

    • Elasticsearch
    • File
    • Hdfs
    • Jdbc
    • (Rickyhuo)Kafka
    • MySQL
    • S3
    • Stdout

  • 部署(Garyelephant, 1月10日)
  • 监控(Garyelephant)
  • 性能与调优
  • (Rickyhuo, 1月2日) 插件开发
  • (Garyelephant, 1月10日) Roadmap
  • 贡献代码

英文文档完成度:

Roadmap

Waterdrop 未来5个重要的发展方向:

  • 支持Flink/Bean 计算引擎

  • 支持基于Flink 的【有状态】【实时】【聚合】计算(用户可指定时间粒度,纬度,指标)

  • 交互式 UI(支持Pipeline的交互式构建、交互式的SQL执行,功能和性能诊断可视化工具)

  • 基于应用场景的深入Spark/Flink 底层做性能优化。

  • 扩大生产环境使用Waterdrop的公司规模(国内公司技术支持,英文社区推广)

  • 自助化和交互式的问题诊断和性能优化,参考Alibaba Arthas

coding style 优化

  • 加入java的checkstyle检查,
  • scalastyle配置更严格的coding style
  • codacy做对应的配置

插件体系的结构问题

当前目录结构

filter/
├── BaseFilter.scala
└── Split.scala
└── Sql.scala

在具体文件上面加一级document会不会好点, 因为有的插件代码不可能都在一个文件里面完成,如果都放在一起的话, 不好管理。以这个grok为例

filter/
├── BaseFilter.scala
├── grok
│   ├── Grok.scala
│   └── PatternGrok.scala
├── split
│   └── Split.scala
└── sql
    └── Sql.scala

JSON插件目前的问题

  1. target_field为***ROOT***时, 由于需要从原始数据获取schema,导致目前没有解决办法
  2. 嵌套JSON结构无法多级解析

向后兼容性

  • spark升级到2.2之后由于scala升级到2.11导致无法兼容spark1.6

  • 兼容jdk 1.7

新增src/test/java目录,其下main方法运行报错

想新建一个测试目录 src/test/java,新建一个类的Main方法

org.interestinglab.waterdrop.WaterDropTest

运行报错

Error:(13, 14) BoolExprBaseVisitor is already defined as object BoolExprBaseVisitor
public class BoolExprBaseVisitor<T> extends AbstractParseTreeVisitor<T> implements BoolExprVisitor<T> {
             ^

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.