Git Product home page Git Product logo

sparkcube's Introduction

SparkCube

SparkCube is an open-source project for extremely fast OLAP data analysis. SparkCube is an extension of Apache Spark.

Build from source

mvn -DskipTests package

The default Spark version used is 2.4.4.

Run tests

mvn test

Use with Apache Spark

There are several configs you should add to your Spark configuration.

config value comment
spark.sql.extensions com.alibaba.sparkcube.SparkCube Add extension. Required
spark.sql.cache.tab.display true To show web UI in the certain application, typically Spark Thriftserver. Required
spark.sql.cache.useDatabase db1,db2,dbn A list of database names separated by comma. Only tables and views from these databases will be considered for cube building. Required
spark.sql.cache.cacheByPartition true/false To store cache by partition. Optional
spark.driver.extraClassPath /path/to/this/jar For web UI resources. Required

With the configurations above set in your Spark thriftserver, you should be able to see "Cube Management" Tab from the UI of Spark Thriftserver after any SELECT command is run. Then you can create/delete/build cubes from this web page.

After you have created appropriate cube, you can query the cube from any spark-sql client using Spark SQL. Note that the cube can be created against table or view, so you can join tables as view to create a complex cube.

If you want a more detailed tutorial for cube creating/building/dropping etc., please refer to https://help.aliyun.com/document_detail/149293.html

Learning materials

(Slides)

https://www.slidestalk.com/AliSpark/SparkRelationalCache78971

https://www.slidestalk.com/AliSpark/SparkRelationalCache2019_57927

(Blogs)

https://yq.aliyun.com/articles/703046

https://yq.aliyun.com/articles/703154

https://yq.aliyun.com/articles/713746

https://yq.aliyun.com/articles/725413

(Blogs In English)

https://community.alibabacloud.com/blog/rewriting-the-execution-plan-in-the-emr-spark-relational-cache_595267

https://www.alibabacloud.com/blog/use-emr-spark-relational-cache-to-synchronize-data-across-clusters_595301

https://www.alibabacloud.com/blog/using-data-preorganization-for-faster-queries-in-spark-on-emr_595599

sparkcube's People

Contributors

adrian-wang avatar alibaba-oss avatar dependabot[bot] avatar frankleaf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sparkcube's Issues

Cannot create raw cache and cube cache.

when to create raw cache,it show below error message:
Cannot persistent default.cube_test into hive metastore as table property keys may not start with 'spark.sql.': [spark.sql.cached.info.raw.enableRewrite, spark.sql.cached.info.raw.cacheColumns, spark.sql.cached.info.raw, spark.sql.cached.info.raw.zorderBy, spark.sql.cached.info.raw.storagePath, spark.sql.cached.info.raw.provider, spark.sql.cached.info.raw.lastUpdatedTime, spark.sql.cached.info.raw.partitionBy, spark.sql.cached];

when to create cube cache,it show below error message:
Cannot persistent default.cube_test into hive metastore as table property keys may not start with 'spark.sql.': [spark.sql.cached.info.cube, spark.sql.cached.info.cube.lastUpdatedTime, spark.sql.cached.info.cube.storagePath, spark.sql.cached.info.cube.provider, spark.sql.cached.info.cube.enableRewrite, spark.sql.cached, spark.sql.cached.info.cube.measures, spark.sql.cached.info.cube.partitionBy, spark.sql.cached.info.cube.dims];

[WEBUI]Can't start web ui

When I run this, I meet a issue about WEB UI, and I have check the code seems ok, any point I have missed?

java.lang.Exception: Could not find resource path for Web UI: com/alibaba/sparkcube/execution/ui/static
	at org.apache.spark.ui.JettyUtils$.createStaticHandler(JettyUtils.scala:197)
	at org.apache.spark.ui.WebUI.addStaticHandler(WebUI.scala:121)
	at org.apache.spark.ui.SparkCubeTab.<init>(SparkCubeTab.scala:42)
	at org.apache.spark.sql.CubeSharedState$$anonfun$3.apply(CubeSharedState.scala:47)
	at org.apache.spark.sql.CubeSharedState$$anonfun$3.apply(CubeSharedState.scala:47)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.sql.CubeSharedState.<init>(CubeSharedState.scala:47)
	at org.apache.spark.sql.CubeSharedState$.get(CubeSharedState.scala:63)
	at com.alibaba.sparkcube.CubeManager.cubeCatalog(CubeManager.scala:213)
	at com.alibaba.sparkcube.CubeManager.listAllCaches(CubeManager.scala:230)
	at org.apache.spark.ui.SparkCubePage.render(SparkCubePage.scala:54)
	at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84)
	at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84)
	at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
	at org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
	at org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584)
	at org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
	at org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
	at org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
	at org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
	at org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
	at org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
	at org.spark_project.jetty.server.Server.handle(Server.java:539)
	at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333)
	at org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
	at org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
	at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108)
	at org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
	at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)

dependency and how to use the jar

Hi all,
I would like to know whether the new release is compatible with Spark 3 or not, and how to use the jar file of SparkCube
I tried to compile the code
There's a problem with the following dependency it's unavailable

		<groupId>com.swoop</groupId>
        <artifactId>spark-alchemy_2.11</artifactId>
		<version>0.3.28</version>

Failed to collect dependencies at com.swoop:spark-alchemy_2.11:jar:0.3.28: Failed to read artifact descriptor for com.swoop:spark-alchemy_2.11:jar:0.3.28: Could not transfer artifact com.swoop:spark-alchemy_2.11:pom:0.3.28 from/to swoop-inc (https://dl.bintray.com/swoop-inc/maven/): Access denied to: https://dl.bintray.com/swoop-inc/maven/com/swoop/spark-alchemy_2.11/0.3.28/spark-alchemy_2.11-0.3.28.pom , ReasonPhrase:Forbidden

If I change the dependency in pom.xml to
spark-alchemy-test_2.12
1.0.1
I have errors (please see below)

private implicit def cacheIdToTableIdent(cacheIdentifier: CacheIdentifier): TableIdentifier = {
[WARNING]                        ^
[ERROR] /home/rym/Downloads/SparkCube-0.3.0/src/main/scala/com/alibaba/sparkcube/execution/PreCountDistinctTransformer.scala:20: object spark is not a member of package com.swoop.alchemy
[ERROR] import com.swoop.alchemy.spark.expressions.hll.HyperLogLogInitSimpleAgg
[ERROR]                          ^
[ERROR] /home/rym/Downloads/SparkCube-0.3.0/src/main/scala/com/alibaba/sparkcube/execution/PreCountDistinctTransformer.scala:44: not found: value HyperLogLogInitSimpleAgg
[ERROR]           HyperLogLogInitSimpleAgg(childExpr, relativeSD)
[ERROR]           ^
[ERROR] /home/rym/Downloads/SparkCube-0.3.0/src/main/scala/com/alibaba/sparkcube/optimizer/GenPlanFromCache.scala:22: object spark is not a member of package com.swoop.alchemy
[ERROR] import com.swoop.alchemy.spark.expressions.hll.{HyperLogLogCardinality, HyperLogLogMerge}
[ERROR]                          ^
[ERROR] /home/rym/Downloads/SparkCube-0.3.0/src/main/scala/com/alibaba/sparkcube/optimizer/GenPlanFromCache.scala:330: not found: value HyperLogLogCardinality
[ERROR]                   case _: CardinalityAfter => HyperLogLogCardinality(attrs.head)
[ERROR]                                               ^
[ERROR] /home/rym/Downloads/SparkCube-0.3.0/src/main/scala/com/alibaba/sparkcube/optimizer/GenPlanFromCache.scala:417: not found: value HyperLogLogCardinality
[ERROR]                   Some(HyperLogLogCardinality(other))
[ERROR]                        ^
[ERROR] /home/rym/Downloads/SparkCube-0.3.0/src/main/scala/com/alibaba/sparkcube/optimizer/GenPlanFromCache.scala:509: not found: value HyperLogLogMerge
[ERROR]       HyperLogLogMerge(args.head, hllpp.mutableAggBufferOffset, hllpp.inputAggBufferOffset)
[ERROR]       ^
[WARNING] one warning found
[ERROR] 6 errors found

thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.