baidu / bigflow Goto Github PK

Baidu Bigflow is an interface that allows for writing distributed computing programs and provides lots of simple, flexible, powerful APIs. Using Bigflow, you can easily handle data of any scale. Bigflow processes 4P+ data inside Baidu and runs about 10k jobs every day.

Home Page: http://baidu.github.io/bigflow

License: Apache License 2.0

CMake 2.12% C++ 50.44% Shell 0.45% Python 42.60% Scala 1.36% Perl 2.92% C 0.10% Thrift 0.01%

bigflow's Introduction

Bigflow

Bigflow 是什么?

Baidu Bigflow (以下简称 Bigflow)是百度的一套计算框架，它致力于提供一套简单易用的接口来描述用户的计算任务，并使同一套代码可以运行在不同的执行引擎之上。

它的设计中有许多**借鉴自 Google FlumeJava以及 Google Cloud Dataflow，另有部分接口设计借鉴自 Apache Spark。

用户基本可以不去关心 Bigflow 的计算真正运行在哪里，可以像写一个单机的程序一样写出自己的逻辑， Bigflow 会将这些计算分发到相应的执行引擎之上执行。

Bigflow 的目标是: 使分布式程序写起来更简单，测起来更方便，跑起来更高效，维护起来更容易，迁移起来成本更小。

目前 Bigflow 在百度公司内部对接了公司内部的批量计算引擎 DCE（与社区 Tez 比较类似），迭代引擎 Spark，以及公司内部的流式计算引擎 Gemini。

在开源版本中，目前仅开放了Bigflow on Spark。

为什么要使用 Bigflow?

高性能 Bigflow 的接口设计使得 Bigflow 可以感知更多的用户需求的细节属性，并且 Bigflow 会根据计算的属性进行作业的优化；另其执行层使用 C++ 实现，用户的一些代码逻辑会被翻译为 C++ 执行，有较大的性能提升。在公司内部的实际业务测试来看，其性能远高于用户手写的作业。根据一些从现有业务改写过来的作业平均来看，其性能都比原用户代码提升了 100%+。开源版本的 benchmark 正在准备中。
简单易用 Bigflow 的接口表面看起来很像 Spark，但实际实用之后会发现 Bigflow 使用一些独特的设计使得 Bigflow 的代码更像是单机程序，例如，屏蔽了 partitioner 的概念，支持嵌套的分布式数据集等，使得其接口更加易于理解，并且拥有更强的代码可复用性。特别的，在许多需要优化的场景中，因为 Bigflow 可以进行自动的性能以及内存占用优化，所以用户可以避免许多因 OOM 或性能不足而必须进行的优化工作，降低用户的使用成本。
在这里，Python 是一等公民 我们目前原生支持的语言是 Python。使用 PySpark 时，有不少用户都困扰于 PySpark 的低效，或困扰于其不支持某些 CPython 库，或困扰于一些仅功能仅仅在 Scala 和 Java 中可用，在 PySpark 中暂时处于不可用状态。而在 Bigflow 中，Python 是一等公民（毕竟当前我们仅仅支持 Python），以上问题在 Bigflow 中都不是问题，性能、功能、易用性都对 Python 用户比较友好。

在线试用

在线试用网页（passwd:bigflow）包含了一些简单的例子介绍Bigflow的概念和API用法，同时也可以在线编写Python代码尝试Bigflow的功能，可智能提示。

注：该页面仅提供试用功能，并没有做安全防护，相关机器每隔一段时间会被清空一次，请不要做代码存储等操作。

Bigflow详细文档

论文

http://jcst.ict.ac.cn/EN/10.1007/s11390-020-9702-3

联系我们

需要加入Bigflow微信技术讨论群的，请加微信号：iacmol 或 himddheart，然后备注一下：加入Bigflow技术讨论群

bigflow's People

Contributors

Stargazers

Watchers

Forkers

advancedxy chunyang-wen yshysh zhukaisjtu xiongji flyingwen tobycc1990 drinktee himdd chidouhu anpark luciferyang qianqiuyitongbupt shizuozhi realsun xiapistudio nlpoasis supdizh fiendark ybbgpy chiass jiangfeng acmol anttylove fendaq qicny wanglun slayzzzzz fooway statml yangyaoyunshu allensmile yuanqiang01 bigrlab zerolugithub liyuanyaun xuyangyang01 leeoo wanghanfeng woerwin i-spark xyuan sasahou wysamuel zz198808 t-m-l-c xiaohei-info yanchaomars zmyer live0717 lkunxyz bb7133 majinzhou007 shuyunqi mmm311 heavenlxj lokihjl heroming hellovivi cloud-bear code-goblin-for-jack jiangshide chaihua483 xiaomj yangwei024 jarlene wtx626 wzhiqing yinxiuhe chain78 ezhangle hewen1125 sugarbearwei cherish3620 soargg inkstone2006 mloves0824 lonely7345 zhaoshiling1017 maniacs-ops devopsmi ahnufy dongzhaoyu zhaonaiy spencerx certnet lukw00heck lilonghua1987 zibo1996 yayawcx bigdot123456 witchiman monsoonzeng ljzzju kitter wangjieboy wall-eeeeeee sbsb3838 myjeffxie cnsuhao

bigflow's Issues

Better handle LINK_ALL_SYMBOLS option in cmake/generic.cmake

We implemented two methods(cc_library, cc_binary) that accept LINK_ALL_SYMBOLS (which means export all symbols when linking, by wrapping the libs with gcc options "-Wl,--whole-archive" and "-Wl,--no-whole-archive" around) as argument in cmake/generic.cmake.

A binary or a library can use ALL_SYMBOLS_DEPS to tell the linker that it needs all the symbols in its deps.
A library who have LINK_ALL_SYMBOLS attribute should export all symbols to the ones that depend on it. We expect these symbols can be exported recursively, however only the ones that depend this library directly will link all the symbols currently.
cc_test don't have similar functionality.

So, cmake/generic.cmake's cc_library/cc_binary/cc_test needs improvements to handle LINK_ALL_SYMBOLS.

Could bigflow support either SQL or a familiar DataFrame API to query structured data.

Translate readme into English

Maybe both English and Chinese are needed in this page.

关于 bigflow.transforms.join 的一些疑问

join 函数在数据量多大的时候程序会挂掉？

Is it possible to run sklearn on bigflow?

manage byproducts

While running bigflow program, I find it will output some byproducts, e.g.,
entity-*
.flume
...
After several times, it will make the folder in a mess.
Could you please put those byproducts into a pre-specified subfolder, in order to conveniently manage.

Online demo isn't working

Server responded with HTTP ERROR 500.

又一个烂尾的开源项目

bigflow用来文本挖掘可以提高速度吗？

Compiling error due to mvn

Some error happens when executing command involving mvn.

After I add .m2/setting.xml, the building process continues. I am not sure what happens.

I think there should be instructions in build documentation.

A type in the introduction

"Bigflow is a interface" -> "Bigflow is an interface"

urllib2.URLError

Environment

OS: Centos7.2
Spark: HDP-2.1
Hadoop: HDP-2.7

I have set SPARK_HOME, HADOOP_HOME and BIGFLOW_PYTHON_HOME.

Description

I have error after running print pipeline.read(input.TextFile("hdfs://localhost:8020/data/test/new.txt")).get():

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "bigflow/ptype.py", line 87, in get
    if not self._is_readable():
  File "bigflow/ptype.py", line 127, in _is_readable
    return requests.is_node_cached(self._pipeline.id(), self._node.id())
  File "bigflow/rpc/requests.py", line 68, in _wrapper
    result, status = func(*args, **kw)
  File "bigflow/rpc/requests.py", line 318, in is_node_cached
    response = _service.request(request, "is_node_cached")
  File "bigflow/rpc/service.py", line 89, in request
    response = urllib2.urlopen(req, request_json, 365 * 24 * 60 * 60)
  File "/opt/bigflow/python_runtime/lib/python2.7/urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "/opt/bigflow/python_runtime/lib/python2.7/urllib2.py", line 429, in open
    response = self._open(req, data)
  File "/opt/bigflow/python_runtime/lib/python2.7/urllib2.py", line 447, in _open
    '_open', req)
  File "/opt/bigflow/python_runtime/lib/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
  File "/opt/bigflow/python_runtime/lib/python2.7/urllib2.py", line 1228, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/opt/bigflow/python_runtime/lib/python2.7/urllib2.py", line 1198, in do_open
    raise URLError(err)
urllib2.URLError: <urlopen error [Errno 111] Connection refused>

I tried hdfs://data/test/new.txt and /data/test/new.txt, but still not work.

Hive Support

Read/write InputFormat/OutputFormat, SerDe from/to Hive Metastore.
Read/write data from/to Hive Table or Partition.

Enable travis' build matrix

Some combinations are:

OS	Compiler
Ubuntu	gcc
Ubuntu	clang
Mac	clang
Docker(CentOS)	gcc
Docker(CentOS)	clang
Windows(WSL)	gcc(tbd)
Windows(WSL)	clang(tbd)
Windows	x

A benchmark system is needed

We should have a benchmark on this opensource version of Bigflow.
And we should go even further to have a benchmark system, which can give us benchmark results everyday.

readline install failed

As illustrated in this picture, pip install readline failed but the build continued.

The building scripts should be improved to build readline success and to detect this kind of errors.

Continuous Integration Support

Continuous integration is required for bigflow, and we should have a system to support continuos integration.

Maybe we can use travis-ci.org, or teamcity, or we can set up a jenkins at BCC (Baidu Cloud Compute)?

English documents are required

We have some documents in English, but there are so many mistakes and they're outdated.

They need to be refined.

Our documents are generated by sphinx-doc, and sphinx-doc has an internationalization support.

So, I think, all the comments should be written in English especially the ones which are the sources of the generated documents. Chinese documents should be generated with the help of sphinx-doc's internationalization support.

在线试用网页挂掉了？

希望增加group_by之后的统计函数？

在对key group_by之后，希望可以方便做求均值，求方差，排序再遍历这样的操作；
希望可以提供类似这样的内置函数

lower the required cmake version

Installing cmake 3.9 costs so much time.
cmake 2.8.12.2 will be a better version, since it can be installed by yum in centos 7.1.

API Plan optimizations is needed

We should add a optimization layer in the current API layer, it could be called API Plan layer.

At this moment, API layer will transform the user's code to a LogicalPlan directly, and some information is lost, such as LogicalPlan don't know what is join, the LogicalPlan only know that two nodes are cogrouped and then a Processor will process the two cogrouped result.

Eg.

pc1.distinct().join(pc2)

is equal to

pc1.cogroup(pc2) \
       .apply_values(lambda p1, p2: p1.distinct().cartesian(p2)) \
       .flatten_values()

But we can't optimize it automatically without the help of API Plan.

So, API Plan is meant to keep all the information we can get from user's code, and optimize the plan by the information.

build failed

Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.0.2:compile (default-compile) on project flume-runtime: Compilation failure

[ 46%] Built target flume-runtime-spark-profile_scripts

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.0.2:compile (default-compile) on project flume-runtime: Compilation failure
[ERROR] Failure executing javac, but could not parse the error:
[ERROR] javac: invalid target release: 1.8
[ERROR] Usage: javac
[ERROR] use -help for a list of possible options
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
make[2]: *** [flume/CMakeFiles/bulid_spark_launcher_jar] Error 1
make[1]: *** [flume/CMakeFiles/bulid_spark_launcher_jar.dir/all] Error 2
make: *** [all] Error 2

会有python3版本的吗

Support structure IO format on Spark

Definitions

Structure input formats specifically mean ORC file and Parquet file.

Current Status

Bigflow on DCE supports ORC file(only reading) and Parquet file with its own loader as DCE doesn't support reading ORC or Parquet natively.

For ORC files, Bigflow uses ORC's c++ API. As the time of adding ORC support, ORC's c++ API only supports reading.

For Parquet files, Bigflow also uses c++ API. Currently, parquet-cpp partially supports nested structure.

Bigflow on Spark doesn't support ORC neither Parquet for now. This doc lists some details how we can support for ORC and Parquet files.

Parquet Support Architecture Overview on DCE

ORC loader follows similar procedure.

How to add support for spark pipeline

Read support

The RecordBatch in the previous arch is an arrow
RecordBatch. Spark already adds supports to transform Dataset to RDD[ArrowPayload]
(see Dataset.scala), though not publicly.

It would be straightforward to add Parquet read support on spark pipeline, even ORC or CSV files.

Impl details to add read support

Use SparkSession to read Parquet or Orc File(spark pipeline currently uses SparkContext)
Implements toArrowPayload in flume-rumtime as Spark doesn't expose that publicly
Reuse and refactoring current PythonFromRecordBatchProcessor
Modify Bigflow's planner to use PythonFromRecordBatchProcessor for Spark pipeline's structure input when constructing Flume task

Write support

Bigflow uses its own sinker impl to write PCollection(or PType) into external target.

Current impl on DCE should also works on Spark. Although, some additional work is
needed, namely:

Refactoring current ParquetSinker and Arrow Schema Converter
Add write support for ORC files. (ORC's cpp API is adding write support incrementally)

References

Apache Arrow is a promising in-memory columnar storage, we can leverage more
power on it. See Arrow SlideShare

cc @himdd @chunyang-wen @bb7133 @acmol for comments and prs are appreciated

Remove code which is used for streaming process

Since we haven't made the streaming process part of Bigflow opensource, we should have removed all the code related to it. We've done most of the work, but there still is some code for streaming process remaining.
So, remove the useless code, and keep the code clean.

Maybe we should build Bigflow to a python egg

So that user can use it in any version of Python.