Git Product home page Git Product logo

baidu / bigflow Goto Github PK

View Code? Open in Web Editor NEW
1.1K 80.0 161.0 11.74 MB

Baidu Bigflow is an interface that allows for writing distributed computing programs and provides lots of simple, flexible, powerful APIs. Using Bigflow, you can easily handle data of any scale. Bigflow processes 4P+ data inside Baidu and runs about 10k jobs every day.

Home Page: http://baidu.github.io/bigflow

License: Apache License 2.0

CMake 2.12% C++ 50.44% Shell 0.45% Python 42.60% Scala 1.36% Perl 2.92% C 0.10% Thrift 0.01%

bigflow's Introduction

Bigflow

Bigflow 是什么?

Baidu Bigflow (以下简称 Bigflow)是百度的一套计算框架, 它致力于提供一套简单易用的接口来描述用户的计算任务,并使同一套代码可以运行在不同的执行引擎之上。

它的设计中有许多**借鉴自 Google FlumeJava以及 Google Cloud Dataflow,另有部分接口设计借鉴自 Apache Spark

用户基本可以不去关心 Bigflow 的计算真正运行在哪里,可以像写一个单机的程序一样写出自己的逻辑, Bigflow 会将这些计算分发到相应的执行引擎之上执行。

Bigflow 的目标是: 使分布式程序写起来更简单,测起来更方便,跑起来更高效,维护起来更容易,迁移起来成本更小。

目前 Bigflow 在百度公司内部对接了公司内部的批量计算引擎 DCE(与社区 Tez 比较类似),迭代引擎 Spark,以及公司内部的流式计算引擎 Gemini。

在开源版本中,目前仅开放了Bigflow on Spark。

为什么要使用 Bigflow?

  • 高性能  Bigflow 的接口设计使得 Bigflow 可以感知更多的用户需求的细节属性,并且 Bigflow 会根据计算的属性进行作业的优化;另其执行层使用 C++ 实现,用户的一些代码逻辑会被翻译为 C++ 执行,有较大的性能提升。 在公司内部的实际业务测试来看,其性能远高于用户手写的作业。根据一些从现有业务改写过来的作业平均来看,其性能都比原用户代码提升了 100%+。开源版本的 benchmark 正在准备中。

  • 简单易用  Bigflow 的接口表面看起来很像 Spark,但实际实用之后会发现 Bigflow 使用一些独特的设计使得 Bigflow 的代码更像是单机程序,例如,屏蔽了 partitioner 的概念,支持嵌套的分布式数据集等,使得其接口更加易于理解,并且拥有更强的代码可复用性。 特别的,在许多需要优化的场景中,因为 Bigflow 可以进行自动的性能以及内存占用优化,所以用户可以避免许多因 OOM 或性能不足而必须进行的优化工作,降低用户的使用成本。

  • 在这里,Python 是一等公民  我们目前原生支持的语言是 Python。 使用 PySpark 时,有不少用户都困扰于 PySpark 的低效,或困扰于其不支持某些 CPython 库,或困扰于一些仅功能仅仅在 Scala 和 Java 中可用,在 PySpark 中暂时处于不可用状态。 而在 Bigflow 中,Python 是一等公民(毕竟当前我们仅仅支持 Python),以上问题在 Bigflow 中都不是问题,性能、功能、易用性都对 Python 用户比较友好。

在线试用

在线试用网页(passwd:bigflow) 包含了一些简单的例子介绍Bigflow的概念和API用法,同时也可以在线编写Python代码尝试Bigflow的功能,可智能提示。

注:该页面仅提供试用功能,并没有做安全防护,相关机器每隔一段时间会被清空一次,请不要做代码存储等操作。

Bigflow详细文档

Bigflow 主页

快速入门

编程指南

API 参考

编译构建

如何贡献

设计文档

论文

http://jcst.ict.ac.cn/EN/10.1007/s11390-020-9702-3

联系我们

需要加入Bigflow微信技术讨论群的,请加微信号:iacmol 或 himddheart,然后备注一下:加入Bigflow技术讨论群

bigflow's People

Contributors

acmol avatar advancedxy avatar chunyang-wen avatar himdd avatar jimmycasey avatar scaugrated avatar tushushu avatar wanglun avatar wzhiqing avatar yshysh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bigflow's Issues

A benchmark system is needed

We should have a benchmark on this opensource version of Bigflow.
And we should go even further to have a benchmark system, which can give us benchmark results everyday.

readline install failed

image

As illustrated in this picture, pip install readline failed but the build continued.

The building scripts should be improved to build readline success and to detect this kind of errors.

Continuous Integration Support

Continuous integration is required for bigflow, and we should have a system to support continuos integration.

Maybe we can use travis-ci.org, or teamcity, or we can set up a jenkins at BCC (Baidu Cloud Compute)?

Better handle LINK_ALL_SYMBOLS option in cmake/generic.cmake

We implemented two methods(cc_library, cc_binary) that accept LINK_ALL_SYMBOLS (which means export all symbols when linking, by wrapping the libs with gcc options "-Wl,--whole-archive" and "-Wl,--no-whole-archive" around) as argument in cmake/generic.cmake.

  1. A binary or a library can use ALL_SYMBOLS_DEPS to tell the linker that it needs all the symbols in its deps.

  2. A library who have LINK_ALL_SYMBOLS attribute should export all symbols to the ones that depend on it. We expect these symbols can be exported recursively, however only the ones that depend this library directly will link all the symbols currently.

  3. cc_test don't have similar functionality.

So, cmake/generic.cmake's cc_library/cc_binary/cc_test needs improvements to handle LINK_ALL_SYMBOLS.

urllib2.URLError

Environment

OS: Centos7.2
Spark: HDP-2.1
Hadoop: HDP-2.7

I have set SPARK_HOME, HADOOP_HOME and BIGFLOW_PYTHON_HOME.

Description

I have error after running print pipeline.read(input.TextFile("hdfs://localhost:8020/data/test/new.txt")).get():

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "bigflow/ptype.py", line 87, in get
    if not self._is_readable():
  File "bigflow/ptype.py", line 127, in _is_readable
    return requests.is_node_cached(self._pipeline.id(), self._node.id())
  File "bigflow/rpc/requests.py", line 68, in _wrapper
    result, status = func(*args, **kw)
  File "bigflow/rpc/requests.py", line 318, in is_node_cached
    response = _service.request(request, "is_node_cached")
  File "bigflow/rpc/service.py", line 89, in request
    response = urllib2.urlopen(req, request_json, 365 * 24 * 60 * 60)
  File "/opt/bigflow/python_runtime/lib/python2.7/urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "/opt/bigflow/python_runtime/lib/python2.7/urllib2.py", line 429, in open
    response = self._open(req, data)
  File "/opt/bigflow/python_runtime/lib/python2.7/urllib2.py", line 447, in _open
    '_open', req)
  File "/opt/bigflow/python_runtime/lib/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
  File "/opt/bigflow/python_runtime/lib/python2.7/urllib2.py", line 1228, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/opt/bigflow/python_runtime/lib/python2.7/urllib2.py", line 1198, in do_open
    raise URLError(err)
urllib2.URLError: <urlopen error [Errno 111] Connection refused>

I tried hdfs://data/test/new.txt and /data/test/new.txt, but still not work.

Hive Support

Read/write InputFormat/OutputFormat, SerDe from/to Hive Metastore.
Read/write data from/to Hive Table or Partition.

Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.0.2:compile (default-compile) on project flume-runtime: Compilation failure

[ 46%] Built target flume-runtime-spark-profile_scripts

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.0.2:compile (default-compile) on project flume-runtime: Compilation failure
[ERROR] Failure executing javac, but could not parse the error:
[ERROR] javac: invalid target release: 1.8
[ERROR] Usage: javac
[ERROR] use -help for a list of possible options
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
make[2]: *** [flume/CMakeFiles/bulid_spark_launcher_jar] Error 1
make[1]: *** [flume/CMakeFiles/bulid_spark_launcher_jar.dir/all] Error 2
make: *** [all] Error 2

English documents are required

We have some documents in English, but there are so many mistakes and they're outdated.

They need to be refined.

Our documents are generated by sphinx-doc, and sphinx-doc has an internationalization support.

So, I think, all the comments should be written in English especially the ones which are the sources of the generated documents. Chinese documents should be generated with the help of sphinx-doc's internationalization support.

Compiling error due to mvn

Some error happens when executing command involving mvn.

After I add .m2/setting.xml, the building process continues. I am not sure what happens.

I think there should be instructions in build documentation.

Enable travis' build matrix

Some combinations are:

OS Compiler
Ubuntu gcc
Ubuntu clang
Mac clang
Docker(CentOS) gcc
Docker(CentOS) clang
Windows(WSL) gcc(tbd)
Windows(WSL) clang(tbd)
Windows x

manage byproducts

While running bigflow program, I find it will output some byproducts, e.g.,
entity-*
.flume
...
After several times, it will make the folder in a mess.
Could you please put those byproducts into a pre-specified subfolder, in order to conveniently manage.

Remove code which is used for streaming process

Since we haven't made the streaming process part of Bigflow opensource, we should have removed all the code related to it. We've done most of the work, but there still is some code for streaming process remaining.
So, remove the useless code, and keep the code clean.

Support structure IO format on Spark

Definitions

Structure input formats specifically mean ORC file and Parquet file.

Current Status

Bigflow on DCE supports ORC file(only reading) and Parquet file with its own loader as DCE doesn't support reading ORC or Parquet natively.

For ORC files, Bigflow uses ORC's c++ API. As the time of adding ORC support, ORC's c++ API only supports reading.

For Parquet files, Bigflow also uses c++ API. Currently, parquet-cpp partially supports nested structure.

Bigflow on Spark doesn't support ORC neither Parquet for now. This doc lists some details how we can support for ORC and Parquet files.

Parquet Support Architecture Overview on DCE

parquet_architecture

ORC loader follows similar procedure.

How to add support for spark pipeline

Read support

The RecordBatch in the previous arch is an arrow
RecordBatch. Spark already adds supports to transform Dataset to RDD[ArrowPayload]
(see Dataset.scala), though not publicly.

It would be straightforward to add Parquet read support on spark pipeline, even ORC or CSV files.

Impl details to add read support

  1. Use SparkSession to read Parquet or Orc File(spark pipeline currently uses SparkContext)
  2. Implements toArrowPayload in flume-rumtime as Spark doesn't expose that publicly
  3. Reuse and refactoring current PythonFromRecordBatchProcessor
  4. Modify Bigflow's planner to use PythonFromRecordBatchProcessor for Spark pipeline's structure input when constructing Flume task

Write support

Bigflow uses its own sinker impl to write PCollection(or PType) into external target.

Current impl on DCE should also works on Spark. Although, some additional work is
needed, namely:

  1. Refactoring current ParquetSinker and Arrow Schema Converter
  2. Add write support for ORC files. (ORC's cpp API is adding write support incrementally)

References

  1. Apache Arrow is a promising in-memory columnar storage, we can leverage more
    power on it. See Arrow SlideShare

cc @himdd @chunyang-wen @bb7133 @acmol for comments and prs are appreciated

API Plan optimizations is needed

We should add a optimization layer in the current API layer, it could be called API Plan layer.

At this moment, API layer will transform the user's code to a LogicalPlan directly, and some information is lost, such as LogicalPlan don't know what is join, the LogicalPlan only know that two nodes are cogrouped and then a Processor will process the two cogrouped result.

Eg.

pc1.distinct().join(pc2)

is equal to

pc1.cogroup(pc2) \
       .apply_values(lambda p1, p2: p1.distinct().cartesian(p2)) \
       .flatten_values()

But we can't optimize it automatically without the help of API Plan.

So, API Plan is meant to keep all the information we can get from user's code, and optimize the plan by the information.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.