databricks / tensorframes Goto Github PK

View Code? Open in Web Editor NEW

751.0 79.0 162.0 1.79 MB

[DEPRECATED] Tensorflow wrapper for DataFrames on Apache Spark

License: Apache License 2.0

Scala 64.36% Shell 4.77% Python 25.68% PureBasic 0.05% Makefile 2.27% Batchfile 2.17% Dockerfile 0.71%

tensorframes's Introduction

TensorFrames (Deprecated)

Note: TensorFrames is deprecated. You can use pandas UDF instead.

Experimental TensorFlow binding for Scala and Apache Spark.

TensorFrames (TensorFlow on Spark DataFrames) lets you manipulate Apache Spark's DataFrames with TensorFlow programs.

This package is experimental and is provided as a technical preview only. While the interfaces are all implemented and working, there are still some areas of low performance.

Supported platforms:

This package only officially supports linux 64bit platforms as a target. Contributions are welcome for other platforms.

See the file project/Dependencies.scala for adding your own platform.

Officially TensorFrames supports Spark 2.4+ and Scala 2.11.

See the user guide for extensive information about the API.

For questions, see the TensorFrames mailing list.

TensorFrames is available as a Spark package.

Requirements

A working version of Apache Spark (2.4 or greater)
Java 8+
(Optional) python 2.7+/3.6+ if you want to use the python interface.
(Optional) the python TensorFlow package if you want to use the python interface. See the official instructions on how to get the latest release of TensorFlow.
(Optional) pandas >= 0.19.1 if you want to use the python interface

Additionally, for developement, you need the following dependencies:

protoc 3.x
nose >= 1.3

How to run in python

Assuming that SPARK_HOME is set, you can use PySpark like any other Spark package.

$SPARK_HOME/bin/pyspark --packages databricks:tensorframes:0.6.0-s_2.11

Here is a small program that uses TensorFlow to add 3 to an existing column.

import tensorflow as tf
import tensorframes as tfs
from pyspark.sql import Row

data = [Row(x=float(x)) for x in range(10)]
df = sqlContext.createDataFrame(data)
with tf.Graph().as_default() as g:
    # The TensorFlow placeholder that corresponds to column 'x'.
    # The shape of the placeholder is automatically inferred from the DataFrame.
    x = tfs.block(df, "x")
    # The output that adds 3 to x
    z = tf.add(x, 3, name='z')
    # The resulting dataframe
    df2 = tfs.map_blocks(z, df)

# The transform is lazy as for most DataFrame operations. This will trigger it:
df2.collect()

# Notice that z is an extra column next to x

# [Row(z=3.0, x=0.0),
#  Row(z=4.0, x=1.0),
#  Row(z=5.0, x=2.0),
#  Row(z=6.0, x=3.0),
#  Row(z=7.0, x=4.0),
#  Row(z=8.0, x=5.0),
#  Row(z=9.0, x=6.0),
#  Row(z=10.0, x=7.0),
#  Row(z=11.0, x=8.0),
#  Row(z=12.0, x=9.0)]

The second example shows the block-wise reducing operations: we compute the sum of a field containing vectors of integers, working with blocks of rows for more efficient processing.

# Build a DataFrame of vectors
data = [Row(y=[float(y), float(-y)]) for y in range(10)]
df = sqlContext.createDataFrame(data)
# Because the dataframe contains vectors, we need to analyze it first to find the
# dimensions of the vectors.
df2 = tfs.analyze(df)

# The information gathered by TF can be printed to check the content:
tfs.print_schema(df2)
# root
#  |-- y: array (nullable = false) double[?,2]

# Let's use the analyzed dataframe to compute the sum and the elementwise minimum 
# of all the vectors:
# First, let's make a copy of the 'y' column. This will be very cheap in Spark 2.0+
df3 = df2.select(df2.y, df2.y.alias("z"))
with tf.Graph().as_default() as g:
    # The placeholders. Note the special name that end with '_input':
    y_input = tfs.block(df3, 'y', tf_name="y_input")
    z_input = tfs.block(df3, 'z', tf_name="z_input")
    y = tf.reduce_sum(y_input, [0], name='y')
    z = tf.reduce_min(z_input, [0], name='z')
    # The resulting dataframe
    (data_sum, data_min) = tfs.reduce_blocks([y, z], df3)

# The final results are numpy arrays:
print(data_sum)
# [45., -45.]
print(data_min)
# [0., -9.]

Notes

Note the scoping of the graphs above. This is important because TensorFrames finds which DataFrame column to feed to TensorFrames based on the placeholders of the graph. Also, it is good practice to keep small graphs when sending them to Spark.

For small tensors (scalars and vectors), TensorFrames usually infers the shapes of the tensors without requiring a preliminary analysis. If it cannot do it, an error message will indicate that you need to run the DataFrame through tfs.analyze() first.

Look at the python documentation of the TensorFrames package to see what methods are available.

How to run in Scala

The scala support is a bit more limited than python. In scala, operations can be loaded from an existing graph defined in the ProtocolBuffers format, or using a simple scala DSL. The Scala DSL only features a subset of TensorFlow transforms. It is very easy to extend though, so other transforms will be added without much effort in the future.

You simply use the published package:

$SPARK_HOME/bin/spark-shell --packages databricks:tensorframes:0.6.0-s_2.11

Here is the same program as before:

import org.tensorframes.{dsl => tf}
import org.tensorframes.dsl.Implicits._

val df = spark.createDataFrame(Seq(1.0->1.1, 2.0->2.2)).toDF("a", "b")

// As in Python, scoping is recommended to prevent name collisions.
val df2 = tf.withGraph {
    val a = df.block("a")
    // Unlike python, the scala syntax is more flexible:
    val out = a + 3.0 named "out"
    // The 'mapBlocks' method is added using implicits to dataframes.
    df.mapBlocks(out).select("a", "out")
}

// The transform is all lazy at this point, let's execute it with collect:
df2.collect()
// res0: Array[org.apache.spark.sql.Row] = Array([1.0,4.0], [2.0,5.0])

How to compile and install for developers

It is recommended you use Conda Environment to guarantee that the build environment can be reproduced. Once you have installed Conda, you can set the environment from the root of project:

conda create -q -n tensorframes-environment python=$PYTHON_VERSION

This will create an environment for your project. We recommend using Python version 3.7 or 2.7.13. After the environemnt is created, you can activate it and install all dependencies as follows:

conda activate tensorframes-environment
pip install --user -r python/requirements.txt

You also need to compile the scala code. The recommended procedure is to use the assembly:

build/sbt tfs_testing/assembly
# Builds the spark package:
build/sbt distribution/spDist

Assuming that SPARK_HOME is set and that you are in the root directory of the project:

$SPARK_HOME/bin/spark-shell --jars $PWD/target/testing/scala-2.11/tensorframes-assembly-0.6.1-SNAPSHOT.jar

If you want to run the python version:

PYTHONPATH=$PWD/target/testing/scala-2.11/tensorframes-assembly-0.6.1-SNAPSHOT.jar \
$SPARK_HOME/bin/pyspark --jars $PWD/target/testing/scala-2.11/tensorframes-assembly-0.6.1-SNAPSHOT.jar

Acknowledgements

Before TensorFlow released its Java API, this project was built on the great javacpp project, that implements the low-level bindings between TensorFlow and the Java virtual machine.

Many thanks to Google for the release of TensorFlow.

tensorframes's People

Contributors

Stargazers

Watchers

Forkers

mengxr caomw codeaudit perfmjs ljzzju desperado1992 mindis cfregly gdtm86 jay-zeng zjffdu thunterdb praneeth9 darkseed weichenxu123 datastark jisaacso veterun penghuangcn jaceklaskowski jonathanwoodard eliasah solidm vlubarsky miguelperalvo kai2002 wangguojie schevalier drah-kah-ris davbzh yuyongyang800 nikolayvoronchikhin feihugis lossyrob valentinakibuyaga leliaonvidia mariusvniekerk pedrolelis kai-chen manojmallela helloiss xiaoerhei crazyjvm panaali tweicc forestdengtech zhaodonghui3939 ravirajadrangi gjmulder jinyu0310 ctjohnson light44 blackshadowa kholohan codingcat felixcheung phaniac zhuohuwu0603 botbounty kikou2016 hevensun openthings pronix alphaheavy debasish83 shobhit-agarwal sunminghong mykidong albertwh1te jkbradley a3digit hbcbh1999 yupbank nunofernandes-plight phi-dbq durgaprasd thuyntran vaquarkhan harlixxy tinajxx thinkfwd nickbuker helakelas rowhit smurching jiacheng-yao allwefantasy siddharthdani raytsui 1vash semanticbeeng kioco scnakandala unichae northstar chubbymaggie airob saifrahmed maha41 jrger

tensorframes's Issues

groupBy example does not work with spark 2.1

The example in the wiki triggers the following error:

AttributeError: 'GroupedData' object has no attribute '_jdf'

df3 = tfs.aggregate([x, count], gb)

tensorframes/core.py in aggregate(fetches, grouped_data)
    294     fetches = _check_fetches(fetches)
    295     graph = _get_graph(fetches)
--> 296     builder = _java_api().aggregate_blocks(grouped_data._jdf)

Republish the package under the databricks organization

Can not import tensorframes

Hi,

I am trying to load tensorframes from databricks using the following script but could not succeed so far. Can you please let me know what I am doing wrong?

`import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:tensorframes:0.2.3-s_2.10 pyspark-shell'

import pyspark
try:
import tensorflow as tf
import tensorframes as tfs
print("Tensorframes successfully imported")
except:
print("Failed to import tensorframes")
`

Use official javacpp version

Version 1.2 is out.

Add support for floats (FLOAT32)

Write unit tests for ints, longs and doubles

Current unit tests are written for doubles. Abstract out the input and output types to share all the tests against all the supported types

Embed a more recent version of tensorflow

TFrames should embed the latest release from javacpp.

Support for MLlib's Vector and Matrix UDTs as inputs

There should be first-class support for these basic types in MLlib

Implement the different block mapping strategies

For some operations, it does not make sense (and it is more costly) to try to append to existing row content. Here is the suggestion:

mapRows -> appends content
mapBlocks -> overload with a new parameter 'keepExisting: Bool' that indicates if the existing columns should be kept in the output (default is true). Behavior of non-overloaded version may be changed to try to support an auto mode if the sizes are known to work together.

The shape resolution may depend on exact shape information only available at runtime, not analysis time.

Basic support for TensorFlow sessions

For some complex graphs such as neural networks, is is better to open a session across multiple operations. Offer some low-level primitives to open and close sessions on the workers.

These sessions should be resilient to changes on the workers, so using variables this way may be a bit more tricky.

Compatibility with wrappers like Keras/ScikitFlow?

TF's API isn't the friendliest, so I was wondering if this is compatible with the wrappers people have built on top of it like Keras/SciKitFlow, and if not whether that will be on the roadmap?

Investigate support for scala 2.10 in spark packages

Some environments such as databricks cloud only support scala 2.10 for now, and the published package does not seem to show this difference (generic non-scala package)

Using tensorframes on Databricks (Community Edition)

Hi I have installed the Maven Artifact of tensorframes on Databricks. It was successfuly added and was attached to the cluster. However, when I tried to import tensorflow in the code, I got an import error that no module named tensorflow exist. I would like to know if you have any idea to work around this. Thanks.

Merge performance changes

One of the branches includes important performant changes that reduce the cost of I/O between spark and tensorflow.

Transfer ownership to Databricks

The repository of record will be the fork currently owned by the Databricks repository.

I hope this will not affect current users and branches.

NoClassDefFoundError: org/apache/spark/Logging

I am trying to run sample code with $pyspark --packages databricks:tensorframes:0.2.3-s_2.10
I am getting this error (below attached) with spark-1.6, spark-2.0 and custom build on spark with branch-2.0.
I tried to find org.apache.spark.logging jar, which I am unable to find anywhere.
I am using Mac OS ElCapitain.

Please help.

Py4JJavaError Traceback (most recent call last)
in ()
8 # The TensorFlow placeholder that corresponds to column 'x'.
9 # The shape of the placeholder is automatically inferred from the DataFrame.
---> 10 x = tfs.block(df, "x")
11 # The output that adds 3 to x
12 z = tf.add(x, 3, name='z')

/private/var/folders/_f/fzxfs9dj33j0381wvy5bdsdm0000gn/T/spark-1831a94f-7cdd-4691-a32a-4cbc3d9bc3a5/userFiles-66cd46aa-619d-41cf-806d-3dc6e509e13d/databricks_tensorframes-0.2.3-s_2.10.jar/tensorframes/core.py in block(df, col_name, tf_name)
313 :return: a TensorFlow placeholder.
314 """
--> 315 return _auto_placeholder(df, col_name, tf_name, block = True)
316
317 def row(df, col_name, tf_name = None):

/private/var/folders/_f/fzxfs9dj33j0381wvy5bdsdm0000gn/T/spark-1831a94f-7cdd-4691-a32a-4cbc3d9bc3a5/userFiles-66cd46aa-619d-41cf-806d-3dc6e509e13d/databricks_tensorframes-0.2.3-s_2.10.jar/tensorframes/core.py in _auto_placeholder(df, col_name, tf_name, block)
331
332 def _auto_placeholder(df, col_name, tf_name, block):
--> 333 info = _java_api().extra_schema_info(df._jdf)
334 col_shape = [x.shape() for x in info if x.fieldName() == col_name]
335 if len(col_shape) == 0:

/private/var/folders/_f/fzxfs9dj33j0381wvy5bdsdm0000gn/T/spark-1831a94f-7cdd-4691-a32a-4cbc3d9bc3a5/userFiles-66cd46aa-619d-41cf-806d-3dc6e509e13d/databricks_tensorframes-0.2.3-s_2.10.jar/tensorframes/core.py in _java_api()
28 # You cannot simply call the creation of the the class on the _jvm due to classloader issues
29 # with Py4J.
---> 30 return _jvm.Thread.currentThread().getContextClassLoader().loadClass(javaClassName)
31 .newInstance()
32

/usr/local/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in call(self, *args)
931 answer = self.gateway_client.send_command(command)
932 return_value = get_return_value(
--> 933 answer, self.gateway_client, self.target_id, self.name)
934
935 for temp_arg in temp_args:

/usr/local/spark/python/pyspark/sql/utils.pyc in deco(_a, *_kw)
61 def deco(_a, *_kw):
62 try:
---> 63 return f(_a, *_kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()

/usr/local/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
310 raise Py4JJavaError(
311 "An error occurred while calling {0}{1}{2}.\n".
--> 312 format(target_id, ".", name), value)
313 else:
314 raise Py4JError(

Py4JJavaError: An error occurred while calling o41.loadClass.
: java.lang.NoClassDefFoundError: org/apache/spark/Logging
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 22 more

Explore using TensorFlow's official java API

TensorFlow 1.0 comes with an experimental java API. We should explore this API to see if it helps with packaging and integration, especially around GPU integration.

Readme Example throwing Py4J error

I am using Spark 2.0.2, Python 2.7.12, iPython 5.1.0 on macOS 10.12.1.

I am launching pyspark like this

$SPARK_HOME/bin/pyspark --packages databricks:tensorframes:0.2.3-s_2.10

From the demo, this block

with tf.Graph().as_default() as g:
    x = tfs.block(df, "x")
    z = tf.add(x, 3, name='z')
    df2 = tfs.map_blocks(z, df)

crashes with the following traceback:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-3-e7ae284146c3> in <module>()
      4     # The TensorFlow placeholder that corresponds to column 'x'.
      5     # The shape of the placeholder is automatically inferred from the DataFrame.
----> 6     x = tfs.block(df, "x")
      7     # The output that adds 3 to x
      8     z = tf.add(x, 3, name='z')

/private/var/folders/tb/r74wwyk17b3fn_frdb0gd0780000gn/T/spark-b3c869d7-6d28-4bce-9a2e-d46f43fc83df/userFiles-64edc1b1-03db-40e6-9d7f-f062d0491a77/databricks_tensorframes-0.2.3-s_2.10.jar/tensorframes/core.py in block(df, col_name, tf_name)
    313     :return: a TensorFlow placeholder.
    314     """
--> 315     return _auto_placeholder(df, col_name, tf_name, block = True)
    316
    317 def row(df, col_name, tf_name = None):

/private/var/folders/tb/r74wwyk17b3fn_frdb0gd0780000gn/T/spark-b3c869d7-6d28-4bce-9a2e-d46f43fc83df/userFiles-64edc1b1-03db-40e6-9d7f-f062d0491a77/databricks_tensorframes-0.2.3-s_2.10.jar/tensorframes/core.py in _auto_placeholder(df, col_name, tf_name, block)
    331
    332 def _auto_placeholder(df, col_name, tf_name, block):
--> 333     info = _java_api().extra_schema_info(df._jdf)
    334     col_shape = [x.shape() for x in info if x.fieldName() == col_name]
    335     if len(col_shape) == 0:

/private/var/folders/tb/r74wwyk17b3fn_frdb0gd0780000gn/T/spark-b3c869d7-6d28-4bce-9a2e-d46f43fc83df/userFiles-64edc1b1-03db-40e6-9d7f-f062d0491a77/databricks_tensorframes-0.2.3-s_2.10.jar/tensorframes/core.py in _java_api()
     28     # You cannot simply call the creation of the the class on the _jvm due to classloader issues
     29     # with Py4J.
---> 30     return _jvm.Thread.currentThread().getContextClassLoader().loadClass(javaClassName) \
     31         .newInstance()
     32

/Users/damien/spark-2.0.2-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134
   1135         for temp_arg in temp_args:

/Users/damien/spark-2.0.2-bin-hadoop2.7/python/pyspark/sql/utils.py in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

/Users/damien/spark-2.0.2-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    317                 raise Py4JJavaError(
    318                     "An error occurred while calling {0}{1}{2}.\n".
--> 319                     format(target_id, ".", name), value)
    320             else:
    321                 raise Py4JError(

Py4JJavaError: An error occurred while calling o47.loadClass.
: java.lang.NoClassDefFoundError: org/apache/spark/Logging
	at java.lang.ClassLoader.defineClass1(Native Method)
	at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
	at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
	at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 22 more

prepare for 0.2 release

update readme
publish package

Issue building from source

I was trying to work around the GPU requirement in the currently published package (My environment does not have GPUs) and build the package from source with a different build of tensorflow. The import works, but once I try to run a command from the tensorframes package (tfs.block() from the demo) I get a gigantic stack trace, that concludes with
raise Py4JError("Answer from Java side is empty")
Py4JError: Answer from Java side is empty
Exception in thread "Thread-90" java.lang.NoClassDefFoundError: org/apache/spark/sql/RelationalGroupedDataset
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetPublicMethods(Class.java:2902)
at java.lang.Class.getMethods(Class.java:1615)
at py4j.reflection.ReflectionEngine.getMethodsByNameAndLength(ReflectionEngine.java:367)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:319)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.RelationalGroupedDataset
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 12 more
ERROR:py4j.java_gateway:Error while sending or receiving.
Traceback (most recent call last):
File "/usr/hdp/2.5.3.0-37/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 746, in send_command
raise Py4JError("Answer from Java side is empty")
Py4JError: Answer from Java side is empty
Exception in thread "Thread-91" java.lang.NoClassDefFoundError: org/apache/spark/sql/RelationalGroupedDataset
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetPublicMethods(Class.java:2902)
at java.lang.Class.getMethods(Class.java:1615)
at py4j.reflection.ReflectionEngine.getMethodsByNameAndLength(ReflectionEngine.java:367)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:319)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.RelationalGroupedDataset
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 12 more
Exception in thread "Thread-92" ERROR:py4j.java_gateway:Error while sending or receiving.
Traceback (most recent call last):
File "/usr/hdp/2.5.3.0-37/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 746, in send_command
raise Py4JError("Answer from Java side is empty")
Py4JError: Answer from Java side is empty
java.lang.NoClassDefFoundError: org/apache/spark/sql/RelationalGroupedDataset
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetPublicMethods(Class.java:2902)
at java.lang.Class.getMethods(Class.java:1615)
at py4j.reflection.ReflectionEngine.getMethodsByNameAndLength(ReflectionEngine.java:367)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:319)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.RelationalGroupedDataset
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 12 more
Exception in thread "Thread-93" ERROR:py4j.java_gateway:Error while sending or receiving.
Traceback (most recent call last):
File "/usr/hdp/2.5.3.0-37/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 746, in send_command
raise Py4JError("Answer from Java side is empty")
Py4JError: Answer from Java side is empty
^Cjava.lang.NoClassDefFoundError: org/apache/spark/sql/RelationalGroupedDataset
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetPublicMethods(Class.java:2902)
at java.lang.Class.getMethods(Class.java:1615)
at py4j.reflection.ReflectionEngine.getMethodsByNameAndLength(ReflectionEngine.java:367)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:319)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.RelationalGroupedDataset
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 12 more
Exception in thread "Thread-94" java.lang.NoClassDefFoundError: org/apache/spark/sql/RelationalGroupedDataset
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetPublicMethods(Class.java:2902)
at java.lang.Class.getMethods(Class.java:1615)
at py4j.reflection.ReflectionEngine.getMethodsByNameAndLength(ReflectionEngine.java:367)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:319)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.RelationalGroupedDataset
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 12 more
Traceback (most recent call last):
File "", line 1, in
File "/app/home/nzshiffm/tensorframes/target/scala-2.11/tensorframes-assembly-0.2.4.jar/tensorframes/core.py", line 315, in block
File "/app/home/nzshiffm/tensorframes/target/scala-2.11/tensorframes-assembly-0.2.4.jar/tensorframes/core.py", line 333, in _auto_placeholder
File "/usr/hdp/2.5.3.0-37/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 811, in call
File "/usr/hdp/2.5.3.0-37/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 631, in send_command
File "/usr/hdp/2.5.3.0-37/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 749, in send_command
File "/usr/lib64/python2.7/logging/init.py", line 1182, in exception
self.error(msg, *args, **kwargs)
File "/usr/lib64/python2.7/logging/init.py", line 1175, in error
self._log(ERROR, msg, args, **kwargs)
File "/usr/lib64/python2.7/logging/init.py", line 1268, in _log
self.handle(record)
File "/usr/lib64/python2.7/logging/init.py", line 1278, in handle
self.callHandlers(record)
File "/usr/lib64/python2.7/logging/init.py", line 1318, in callHandlers
hdlr.handle(record)
File "/usr/lib64/python2.7/logging/init.py", line 749, in handle
self.emit(record)
File "/usr/lib64/python2.7/logging/init.py", line 879, in emit
self.handleError(record)
File "/usr/lib64/python2.7/logging/init.py", line 802, in handleError
None, sys.stderr)
File "/usr/lib64/python2.7/traceback.py", line 125, in print_exception
print_tb(tb, limit, file)
File "/usr/lib64/python2.7/traceback.py", line 69, in print_tb
line = linecache.getline(filename, lineno, f.f_globals)
File "/usr/lib64/python2.7/linecache.py", line 14, in getline
lines = getlines(filename, module_globals)
File "/usr/lib64/python2.7/linecache.py", line 40, in getlines
return updatecache(filename, module_globals)
File "/usr/lib64/python2.7/linecache.py", line 128, in updatecache
lines = fp.readlines()
RuntimeError: maximum recursion depth exceeded while calling a Python object

The noclassdef found errors don't occur when I run with the precompiled version included with --packages. I couldn't get build/sbt tfPackage to work, there was a 'no key found tfPackage' message when I tried to run that. Have I not completed the build if I just ran build/sbt assembly?

Add support for Spark 2.1.0

Py4JError("Answer from Java side is empty") while testing

I have been experimenting with TensorFrames from quite some days. I have spark-1.6.1 and openjdk7 installed on my ubuntu 14.04 64bit machine. I am using IPython notebook for testing.

import tensorframes as tfs command is working perfectly fine, but when i do tfs.print_schema(df), where df is a dataframe. The below error pops recursively till max. depth is reached.

ERROR:py4j.java_gateway:Error while sending or receiving. Traceback (most recent call last): File "/home/prakhar/utilities/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 746, in send_command raise Py4JError("Answer from Java side is empty") Py4JError: Answer from Java side is empty

Does not work with Python3

I just started using this with Python3, these are my commands run and the output messages.

$SPARK_HOME/bin/pyspark --packages databricks:tensorframes:0.2.3-s_2.10

Python 3.4.3 (default, Mar 26 2015, 22:03:40)
[GCC 4.9.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/opt/spark-1.5.2/assembly/target/scala-2.10/spark-assembly-1.5.2-hadoop2.2.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
databricks#tensorframes added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found databricks#tensorframes;0.2.3-s_2.10 in spark-packages
found org.apache.commons#commons-lang3;3.4 in central
:: resolution report :: resolve 98ms :: artifacts dl 4ms
:: modules in use:
databricks#tensorframes;0.2.3-s_2.10 from spark-packages in [default]
org.apache.commons#commons-lang3;3.4 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 2 | 0 | 0 | 0 || 2 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 2 already retrieved (0kB/3ms)
Welcome to
____ __
/ / ___ / /
\ / _ / _ `/ __/ '/
/ / ._/,// //_\ version 1.5.2
/_/

Using Python version 3.4.3 (default, Mar 26 2015 22:03:40)
SparkContext available as sc, SQLContext available as sqlContext.

import tensorflow as tf
import tensorframes as tfs

Traceback (most recent call last):
File "", line 1, in
File "/tmp/spark-349c9955-ccd8-4fcd-938a-7e719fc45653/userFiles-bb935142-224f-4238-a144-f1cece7a5aa2/databricks_tensorframes-0.2.3-s_2.10.jar/tensorframes/init.py", line 36, in
ImportError: No module named 'core'

Error kmeans example

Hi,
When I run the k-means example I got the following error

Py4JJavaError: An error occurred while calling o109.buildDF.
: java.lang.AssertionError: assertion failed: Op type not registered 'StridedSlice'

Does anyone help me?
Thanks

Support a feed_dict for the mapping and reducing operations

In iterative algorithms, it is useful to broadcast some extra tensors, for example in k-means. This can be combined with leaving sessions opened and provide a performance boost.

The current workaround is to compute a new graph with a different constant.

Couldn't run the example

Hi All,

I am trying to use TensorFrames. While I am running the python example, I am facing the error like below.

Traceback (most recent call last):
  File "/home_bunch/jeahn/projects/spark-examples/test_tensorframes.py", line 18, in <module>
    df2 = tfs.map_blocks(z, df)
  File "/home/jeahn/.ivy2/jars/databricks_tensorframes-0.2.3-s_2.10.jar/tensorframes/core.py", line 217, in map_blocks
  File "/cm/shared/apps/spark/spark-1.6.1-bin-without-hadoop/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/cm/shared/apps/spark/spark-1.6.1-bin-without-hadoop/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
  File "/cm/shared/apps/spark/spark-1.6.1-bin-without-hadoop/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o77.buildDF.
: java.lang.UnsatisfiedLinkError: no jnitensorflow in java.library.path
        at java.lang.ClassLoader.loadLibrary(Unknown Source)
        at java.lang.Runtime.loadLibrary0(Unknown Source)
        at java.lang.System.loadLibrary(Unknown Source)
        at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:654)
        at org.bytedeco.javacpp.Loader.load(Loader.java:492)
        at org.bytedeco.javacpp.Loader.load(Loader.java:409)
        at org.bytedeco.javacpp.tensorflow.<clinit>(tensorflow.java:10)
        at org.tensorframes.impl.TensorFlowOps$._init$lzycompute(TensorFlowOps.scala:21)
        at org.tensorframes.impl.TensorFlowOps$._init(TensorFlowOps.scala:19)
        at org.tensorframes.impl.TensorFlowOps$.initTensorFlow(TensorFlowOps.scala:27)
        at org.tensorframes.impl.TensorFlowOps$.analyzeGraph(TensorFlowOps.scala:85)
        at org.tensorframes.impl.DebugRowOps.mapBlocks(DebugRowOps.scala:307)
        at org.tensorframes.impl.DebugRowOps.mapBlocks(DebugRowOps.scala:290)
        at org.tensorframes.impl.PythonOpBuilder.buildDF(PythonInterface.scala:126)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:209)
        at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.UnsatisfiedLinkError: /tmp/javacpp3334753250106708/libjnitensorflow.so: libcudart.so.7.5: cannot open shared object file: No such file or directory
        at java.lang.ClassLoader$NativeLibrary.load(Native Method)
        at java.lang.ClassLoader.loadLibrary0(Unknown Source)
        at java.lang.ClassLoader.loadLibrary(Unknown Source)
        at java.lang.Runtime.load0(Unknown Source)
        at java.lang.System.load(Unknown Source)
        at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:637)
        ... 21 more

It seems like that the TensorFlow built in the system should have GPU-enabled version. I am wondering the python example assumes that GPU-enabled TensorFlow or not. Actually, my system does not have any GPU.

If not, does anyone experience the same issue? I am currently using Spark 1.6 and Tensorflow r0.12.

Thanks,
Jeongseob

demo does not work in pyCharm

i'm currently trying to run the demo in pyCharm my file looks like this:
the attached pyFile is the location of my tensorframes jar

import tensorflow as tf
from pyspark.sql import Row
from pyspark.sql import SparkSession

spark = SparkSession
.builder
.appName("PythonSQL")
.getOrCreate()

spark.sparkContext.addPyFile('/home/hdiuser/PycharmProjects/TensorFramesTest/tensorframes-0.2.3-s_2.10.jar')
import tensorframes as tfs

data = [Row(x=float(x)) for x in range(10)]
df = spark.createDataFrame(data)
print df.show()
with tf.Graph().as_default() as g:
#The TensorFlow placeholder that corresponds to column 'x'.
#The shape of the placeholder is automatically inferred from the DataFrame.
x = tfs.block(df, "x")
# The output that adds 3 to x
z = tf.add(x, 3, name='z')
# The resulting dataframe
df2 = tfs.map_blocks(z, df)

The transform is lazy as for most DataFrame operations. This will trigger it:

df2.collect()

the output is as follows:

| x|
+---+
|0.0|
|1.0|
|2.0|
|3.0|
|4.0|
|5.0|
|6.0|
|7.0|
|8.0|
|9.0|
+---+

None
Traceback (most recent call last):
File "/home/hdiuser/PycharmProjects/TensorFramesTest/tensor_frame_test.py", line 40, in
x = tfs.block(df, "x")
File "/tmp/spark-8785f696-946f-4b70-8c4f-11d5b1e1ecbc/userFiles-bb752893-c175-4844-968e-5ee426ef7e03/tensorframes-0.2.3-s_2.10.jar/tensorframes/core.py", line 315, in block
File "/tmp/spark-8785f696-946f-4b70-8c4f-11d5b1e1ecbc/userFiles-bb752893-c175-4844-968e-5ee426ef7e03/tensorframes-0.2.3-s_2.10.jar/tensorframes/core.py", line 333, in _auto_placeholder
File "/tmp/spark-8785f696-946f-4b70-8c4f-11d5b1e1ecbc/userFiles-bb752893-c175-4844-968e-5ee426ef7e03/tensorframes-0.2.3-s_2.10.jar/tensorframes/core.py", line 30, in _java_api
File "/home/hdiuser/anaconda2/envs/tensorflow/lib/python2.7/site-packages/py4j/java_gateway.py", line 1133, in call
answer, self.gateway_client, self.target_id, self.name)
File "/usr/hdp/current/spark-client/python/pyspark/sql/utils.py", line 63, in deco
return f(_a, *_kw)
File "/home/hdiuser/anaconda2/envs/tensorflow/lib/python2.7/site-packages/py4j/protocol.py", line 319, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o67.loadClass.
: java.lang.ClassNotFoundException: org.tensorframes.impl.DebugRowOps
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)

write k-means notebook

Publish new version of tensorframes

This version should target:

scala 2.10 / 2.11
spark 2.0 RC2+
tensorflow 0.9+

Aggregate interface

Add an aggregate interface to express tensorflow UDAFs

Add some tests marked as TODOs

Some unit tests for corner cases are left as TODOs in the code base, finish to implement them

Support pandas dataframe as input

Letting a pandas dataframes as an input simplifies the debugging, and provides a simple (non-distributed implementation).

For this task, all the tensorframe python API should support the df argument being a pandas dataframe, and should use the python tensorflow package instead of passing it to java. This should be a simple wrapper around creating a new TF session, running it and returning the result values.

Running Second README.md example (block-wise reducing operations) fails with `java.lang.AssertionError: assertion failed: NodeDef mentions attr 'Tidx'`

While the first example worked, the second example on block-wise reducing operations failed. The code snippet:

with tf.Graph().as_default() as g:
    # The placeholders. Note the special name that end with '_input':
    y_input = tfs.block(df3, 'y', tf_name="y_input")
    z_input = tfs.block(df3, 'z', tf_name="z_input")
    y = tf.reduce_sum(y_input, [0], name='y')
    z = tf.reduce_min(z_input, [0], name='z')
    # The resulting dataframe
    (data_sum, data_min) = tfs.reduce_blocks([y, z], df3)

# The final results are numpy arrays:
print data_sum
# [45.0, -45.0]
print data_min
# [0.0, -9.0]

results in the error:

Py4JJavaError: An error occurred while calling o2619.buildRow. : java.lang.AssertionError: assertion failed: NodeDef mentions attr 'Tidx' not in Op<name=Sum; signature=input:T, reduction_indices:int32 -> output:T; attr=keep_dims:bool,default=false; attr=T:type,allowed=[DT_FLOAT, DT_DOUBLE, DT_INT64, DT_INT32, DT_UINT8, DT_UINT16, DT_INT16, DT_INT8, DT_COMPLEX64, DT_COMPLEX128, DT_QINT8, DT_QUINT8, DT_QINT32, DT_HALF]>; NodeDef: y = Sum[T=DT_DOUBLE, Tidx=DT_INT32, keep_dims=false](y_input, y/reduction_indices) at scala.Predef$.assert(Predef.scala:179) at org.tensorframes.impl.TensorFlowOps$$anonfun$analyzeGraph$4.apply(TensorFlowOps.scala:100) at org.tensorframes.impl.TensorFlowOps$$anonfun$analyzeGraph$4.apply(TensorFlowOps.scala:97) at org.tensorframes.impl.TensorFlowOps$.withSession(TensorFlowOps.scala:58) at org.tensorframes.impl.TensorFlowOps$.analyzeGraph(TensorFlowOps.scala:97) at org.tensorframes.impl.SchemaTransforms$class.reduceBlocksSchema(DebugRowOps.scala:81) at org.tensorframes.impl.SchemaTransforms$.reduceBlocksSchema(DebugRowOps.scala:272) at org.tensorframes.impl.PythonOpBuilder.buildRow(PythonInterface.scala:112) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:209) at java.lang.Thread.run(Thread.java:745)

Add style checker

Analysis step does not clear nullability

This example:

data = [Row(y=[float(y), float(-y)]) for y in range(10)]
df = sqlContext.createDataFrame(data)
df2 = tfs.analyze(df)

shows that the array is still nullable

java.lang.ClassNotFoundException: org.tensorframes.impl.DebugRowOps

I build the jar by follow the readme,
and then run it in pycharm
https://www.dropbox.com/s/qmrs72l0p8p4bc2/Screen%20Shot%202016-07-06%20at%2011.40.26%20PM.png?dl=0
I add the self build jar as content root, I guess that's cause the error,

line 11 is x = tfs.block(df, "x")

code:

import tensorflow as tf
import tensorframes as tfs
from pyspark.shell import sqlContext
from pyspark.sql import Row

data = [Row(x=float(x)) for x in range(10)]
df = sqlContext.createDataFrame(data)

with tf.Graph().as_default() as g:
    # The TensorFlow placeholder that corresponds to column 'x'.
    # The shape of the placeholder is automatically inferred from the DataFrame.
    x = tfs.block(df, "x")
    # The output that adds 3 to x
    z = tf.add(x, 3, name='z')
    # The resulting dataframe
    df2 = tfs.map_blocks(z, df)

# The transform is lazy as for most DataFrame operations. This will trigger it:
df2.collect()

log

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/07/06 23:28:43 INFO SparkContext: Running Spark version 1.6.1
16/07/06 23:28:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/07/06 23:28:44 INFO SecurityManager: Changing view acls to: julian_qian
16/07/06 23:28:44 INFO SecurityManager: Changing modify acls to: julian_qian
16/07/06 23:28:44 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(julian_qian); users with modify permissions: Set(julian_qian)
16/07/06 23:28:44 INFO Utils: Successfully started service 'sparkDriver' on port 60597.
16/07/06 23:28:45 INFO Slf4jLogger: Slf4jLogger started
16/07/06 23:28:45 INFO Remoting: Starting remoting
16/07/06 23:28:45 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:60598]
16/07/06 23:28:45 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 60598.
16/07/06 23:28:45 INFO SparkEnv: Registering MapOutputTracker
16/07/06 23:28:45 INFO SparkEnv: Registering BlockManagerMaster
16/07/06 23:28:45 INFO DiskBlockManager: Created local directory at /private/var/folders/9c/h8czn5n53yd69xz45wjhk0fw0000gn/T/blockmgr-5174cef3-29d9-4d2a-a84e-279a0e3d2f83
16/07/06 23:28:45 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
16/07/06 23:28:45 INFO SparkEnv: Registering OutputCommitCoordinator
16/07/06 23:28:45 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
16/07/06 23:28:45 INFO Utils: Successfully started service 'SparkUI' on port 4041.
16/07/06 23:28:45 INFO SparkUI: Started SparkUI at http://10.63.21.172:4041
16/07/06 23:28:45 INFO Executor: Starting executor ID driver on host localhost
16/07/06 23:28:45 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 60599.
16/07/06 23:28:45 INFO NettyBlockTransferService: Server created on 60599
16/07/06 23:28:45 INFO BlockManagerMaster: Trying to register BlockManager
16/07/06 23:28:45 INFO BlockManagerMasterEndpoint: Registering block manager localhost:60599 with 511.1 MB RAM, BlockManagerId(driver, localhost, 60599)
16/07/06 23:28:45 INFO BlockManagerMaster: Registered BlockManager
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.1
      /_/

Using Python version 2.7.10 (default, Dec  1 2015 20:00:13)
SparkContext available as sc, HiveContext available as sqlContext.
16/07/06 23:28:46 INFO HiveContext: Initializing execution hive, version 1.2.1
16/07/06 23:28:46 INFO ClientWrapper: Inspected Hadoop version: 2.6.0
16/07/06 23:28:46 INFO ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0
16/07/06 23:28:46 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
16/07/06 23:28:46 INFO ObjectStore: ObjectStore, initialize called
16/07/06 23:28:46 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
16/07/06 23:28:46 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
16/07/06 23:28:46 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/07/06 23:28:47 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/07/06 23:28:48 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
16/07/06 23:28:48 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/07/06 23:28:48 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/07/06 23:28:49 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/07/06 23:28:49 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/07/06 23:28:49 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
16/07/06 23:28:49 INFO ObjectStore: Initialized ObjectStore
16/07/06 23:28:49 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/07/06 23:28:49 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
16/07/06 23:28:49 INFO HiveMetaStore: Added admin role in metastore
16/07/06 23:28:49 INFO HiveMetaStore: Added public role in metastore
16/07/06 23:28:49 INFO HiveMetaStore: No user is added in admin role, since config is empty
16/07/06 23:28:49 INFO HiveMetaStore: 0: get_all_databases
16/07/06 23:28:49 INFO audit: ugi=julian_qian   ip=unknown-ip-addr  cmd=get_all_databases   
16/07/06 23:28:49 INFO HiveMetaStore: 0: get_functions: db=default pat=*
16/07/06 23:28:49 INFO audit: ugi=julian_qian   ip=unknown-ip-addr  cmd=get_functions: db=default pat=* 
16/07/06 23:28:49 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
16/07/06 23:28:49 INFO SessionState: Created HDFS directory: /tmp/hive/julian_qian
16/07/06 23:28:49 INFO SessionState: Created local directory: /var/folders/9c/h8czn5n53yd69xz45wjhk0fw0000gn/T/julian_qian
16/07/06 23:28:49 INFO SessionState: Created local directory: /var/folders/9c/h8czn5n53yd69xz45wjhk0fw0000gn/T/f9d3c8e6-6b5d-4a0c-b2cf-a50aa101fb62_resources
16/07/06 23:28:49 INFO SessionState: Created HDFS directory: /tmp/hive/julian_qian/f9d3c8e6-6b5d-4a0c-b2cf-a50aa101fb62
16/07/06 23:28:49 INFO SessionState: Created local directory: /var/folders/9c/h8czn5n53yd69xz45wjhk0fw0000gn/T/julian_qian/f9d3c8e6-6b5d-4a0c-b2cf-a50aa101fb62
16/07/06 23:28:49 INFO SessionState: Created HDFS directory: /tmp/hive/julian_qian/f9d3c8e6-6b5d-4a0c-b2cf-a50aa101fb62/_tmp_space.db
16/07/06 23:28:49 INFO HiveContext: default warehouse location is /user/hive/warehouse
16/07/06 23:28:49 INFO HiveContext: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
16/07/06 23:28:49 INFO ClientWrapper: Inspected Hadoop version: 2.6.0
16/07/06 23:28:49 INFO ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0
16/07/06 23:28:50 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
16/07/06 23:28:50 INFO ObjectStore: ObjectStore, initialize called
16/07/06 23:28:50 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
16/07/06 23:28:50 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
16/07/06 23:28:50 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/07/06 23:28:50 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/07/06 23:28:51 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
16/07/06 23:28:51 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/07/06 23:28:51 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/07/06 23:28:52 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/07/06 23:28:52 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/07/06 23:28:52 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
16/07/06 23:28:52 INFO ObjectStore: Initialized ObjectStore
16/07/06 23:28:52 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/07/06 23:28:52 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
16/07/06 23:28:52 INFO HiveMetaStore: Added admin role in metastore
16/07/06 23:28:52 INFO HiveMetaStore: Added public role in metastore
16/07/06 23:28:52 INFO HiveMetaStore: No user is added in admin role, since config is empty
16/07/06 23:28:52 INFO HiveMetaStore: 0: get_all_databases
16/07/06 23:28:52 INFO audit: ugi=julian_qian   ip=unknown-ip-addr  cmd=get_all_databases   
16/07/06 23:28:53 INFO HiveMetaStore: 0: get_functions: db=default pat=*
16/07/06 23:28:53 INFO audit: ugi=julian_qian   ip=unknown-ip-addr  cmd=get_functions: db=default pat=* 
16/07/06 23:28:53 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
16/07/06 23:28:53 INFO SessionState: Created local directory: /var/folders/9c/h8czn5n53yd69xz45wjhk0fw0000gn/T/77eb618d-61cc-470e-abb4-18d356833efb_resources
16/07/06 23:28:53 INFO SessionState: Created HDFS directory: /tmp/hive/julian_qian/77eb618d-61cc-470e-abb4-18d356833efb
16/07/06 23:28:53 INFO SessionState: Created local directory: /var/folders/9c/h8czn5n53yd69xz45wjhk0fw0000gn/T/julian_qian/77eb618d-61cc-470e-abb4-18d356833efb
16/07/06 23:28:53 INFO SessionState: Created HDFS directory: /tmp/hive/julian_qian/77eb618d-61cc-470e-abb4-18d356833efb/_tmp_space.db

error log:

Traceback (most recent call last):
  File "/Users/julian_qian/PycharmProjects/tensorflow/tfs.py", line 11, in <module>
    x = tfs.block(df, "x")
  File "/Users/julian_qian/etc/work/python/tensorframes/tensorframes-assembly-0.2.3.jar/tensorframes/core.py", line 315, in block
  File "/Users/julian_qian/etc/work/python/tensorframes/tensorframes-assembly-0.2.3.jar/tensorframes/core.py", line 333, in _auto_placeholder
  File "/Users/julian_qian/etc/work/python/tensorframes/tensorframes-assembly-0.2.3.jar/tensorframes/core.py", line 30, in _java_api
  File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
  File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o32.loadClass.
: java.lang.ClassNotFoundException: org.tensorframes.impl.DebugRowOps
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:745)

pyspark --packages databricks:tensorframes:0.2.3-s_2.10 not work

seems not yet publish

Store high-order tensors as 1d vectors

Because catalyst is optimized for 0d and 1d tensors, all the tensors should be stored this way. Of course, users can still input some arrays of arrays at the inputs, but the outputs should be optimized for 1d arrays. It should be the recommended output for anything above 3d tensors.

This can be done only with a more flexible interpretation of the metadata.

One concern is that the data storage as seen by sql may be different from the interpretation seen by tensorframes. On the positive side, it will simplify the low-level operations.

Expected modifications:

default storage layout is row major (but with consideration to a potential option to column major)
all operations should accept at ingest imbricated arrays or flattened tensors
all operations should output flattened tensors for tensors >= 2 dimensions -> this is a user-facing change
analyze will be the conversion point between flattened and nested representations, with an extra option compact_storage. This option will either accept a single boolean (all numerical types), the letter 'R" (all columns compacted in Row order) or a list of names of columns (only these columns are compacted in Row order). A dictionary could be supported later.
printschema will differentiate between tensors stored in 1d and n

Publish the examples in the guide as an ipython/databricks notebook

Preliminary support for sparse vectors

Ideally, a block of mllib sparse vectors should be represented as a sparse matrix in an input. No output task is considered for now.

This should be done after supporting mllib's dense format.

How to combine distributed Tensorflow with TensorFrames?

As we all know, TensorFlow v0.8 has supported built-in distributed training, but I can't find any work around combinations of them. So, guys, could you introduce your plan on that?

Current build of tensorflow is not compatible with travis

The docker image used by the tensorflow team uses a version of glibc that is too recent for travis (ubuntu 12.04).

Update the scripts in https://github.com/tjhunter/tensorframes-artifacts to use directly ubuntu 12.04 as the base docker image.

Experimental catalyst conversions for low-dimension tensors

Directly integrate with the logical plan of Catalyst when working on small dimensions.

For columns containing scalars or vectors, perform an unsafe memory copy to a buffer instead of using scala collections, and provide a fallback for higher-order tensors. It is recommended to flatten the operations for higher order tensors because the Tungsten representation is not very efficient from a memory perspective.

Implementation for the 4 basic types will duplicate some code, so investigate a templating library or scala macros to generate the code for each type?

The benchmark for this task will be the computation of the covariance matrix of a dataset (hopefully, it can get faster than the current computation in Spark).

our buddy, protobuf, is causing problems again

example zeppelin notebook: http://demo.pipeline.io:8080/#/notebook/2BNQNXKH5

getting the following error:

java.lang.VerifyError: class org.tensorflow.framework.AttrValue$Builder overrides final method mergeUnknownFields.(Lcom/google/protobuf/UnknownFieldSet;)Lcom/google/protobuf/GeneratedMessage$Builder;
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at org.tensorflow.framework.AttrValue.toBuilder(AttrValue.java:3085)
    at org.tensorflow.framework.AttrValue.newBuilder(AttrValue.java:3079)
    at org.tensorframes.dsl.DslImpl$ShapeToAttr.toAttr(DslImpl.scala:21)
    at org.tensorframes.dsl.DslImpl$.placeholder(DslImpl.scala:86)
    at org.tensorframes.dsl.DslImpl$.extractPlaceholder(DslImpl.scala:104)
    at org.tensorframes.dsl.package$.block(package.scala:97)
    at org.tensorframes.dsl.DFImplicits$RichDataFrame.block(Implicits.scala:96)
...

spark 1.6.1, TENSORFRAMES_VERSION=0.2.2

all versions of everything can be found here: https://github.com/fluxcapacitor/pipeline/blob/master/Dockerfile

starting zeppelin with the following

# --repositories used to resolve --packages
export SPARK_REPOSITORIES=http://dl.bintray.com/spark-packages/maven,https://oss.sonatype.org/content/repositories/snapshots,https://repository.apache.org/content/groups/snapshots

# --packages used to pass into our Spark jobs
export SPARK_SUBMIT_PACKAGES=tjhunter:tensorframes:$TENSORFRAMES_VERSION-s_2.10,amplab:spark-indexedrdd:$INDEXEDRDD_VERSION,org.apache.spark:spark-streaming-kafka-assembly_2.10:$SPARK_VERSION,org.elasticsearch:elasticsearch-spark_2.10:$SPARK_ELASTICSEARCH_CONNECTOR_VERSION,com.datastax.spark:spark-cassandra-connector_2.10:$SPARK_CASSANDRA_CONNECTOR_VERSION,redis.clients:jedis:$JEDIS_VERSION,com.twitter:algebird-core_2.10:$ALGEBIRD_VERSION,com.databricks:spark-avro_2.10:$SPARK_AVRO_CONNECTOR_VERSION,com.databricks:spark-csv_2.10:$SPARK_CSV_CONNECTOR_VERSION,org.apache.nifi:nifi-spark-receiver:$SPARK_NIFI_CONNECTOR_VERSION,com.madhukaraphatak:java-sizeof_2.10:0.1,com.databricks:spark-xml_2.10:$SPARK_XML_VERSION,edu.stanford.nlp:stanford-corenlp:$STANFORD_CORENLP_VERSION,org.jblas:jblas:$JBLAS_VERSION,graphframes:graphframes:$GRAPHFRAMES_VERSION

# We still need to include a reference to a local stanford-corenlp-$STANFORD_CORENLP_VERSION-models.jar because SparkSubmit doesn't support a classifier in --packages
export SPARK_SUBMIT_JARS=$MYAPPS_HOME/codegen/spark/1.6.1/target/scala-2.10/codegen-spark-1-6-1_2.10-1.0.jar,$MYAPPS_HOME/spark/redis/lib/spark-redis_2.10-$SPARK_REDIS_CONNECTOR_VERSION.jar,$MYSQL_CONNECTOR_JAR,$MYAPPS_HOME/spark/ml/lib/spark-corenlp_2.10-0.1.jar,$MYAPPS_HOME/spark/ml/lib/stanford-corenlp-$STANFORD_CORENLP_VERSION-models.jar,$MYAPPS_HOME/spark/ml/target/scala-2.10/ml_2.10-1.0.jar,$MYAPPS_HOME/spark/sql/target/scala-2.10/sql_2.10-1.0.jar,$MYAPPS_HOME/spark/core/target/scala-2.10/core_2.10-1.0.jar,$MYAPPS_HOME/spark/streaming/target/scala-2.10/streaming_2.10-1.0.jar

Write k-means example

Write a simple k-means example to see what is missing from the API

Feed dictionaries

Add feed dictionaries to the interface. The current workaround is to create a new graph with different constants at each step, but this requires redoing the analysis each time.

This is useful for more complex algorithms that need to update their parameters at each iteration, but it is unclear if this is going to make too much of a performance difference at this point.

Specify the shapes to the analysis step

If the user has extra information about the shapes of the tensors, there is no need to run deep analysis.

The analyze method will be modified with an extra argument shapes=<dict> with dict a mapping from column names (strings) to lists, each list corresponding to a cell shape. The limitation of this approach is that it will not infer the sizes of the blocks if they are the same, but this is not a hindrance in practice.

Possible optimization: if there is only one numerical column in the dataframe, the user can directly pass the list of dimension instead of a dictionary.

Install tensorframes from git without internet

Hello Databricks team,
You guys are doing great job. I am not raising an issue, I am asking for help.
Please help me with the steps to install tensorframe on pyspark, without internet.
I have tried the following steps.

Try1 :

$SPARK_HOME/bin/pyspark --packages /opt/user1/tensorflow/tensorframes-0.2.3-s_2.10.jar
Error:
Python 3.5.1 |Anaconda 4.0.0 (64-bit)|
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Provided Maven Coordinates must be in the form 'groupId:artifactId:version'. The coordinate provided is: /opt/user1/tensorflow/tensorframes-0.2.3-s_2.10.jar
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.deploy.SparkSubmitUtils$$anonfun$extractMavenCoordinates$1.apply(SparkSubmit.scala:842)
at org.apache.spark.deploy.SparkSubmitUtils$$anonfun$extractMavenCoordinates$1.apply(SparkSubmit.scala:840)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.deploy.SparkSubmitUtils$.extractMavenCoordinates(SparkSubmit.scala:840)
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1003)
at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:287)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:154)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
File "/opt/user1/spark-1.6.1-bin-hadoop2.4/python/pyspark/shell.py", line 43, in
sc = SparkContext(pyFiles=add_files)
File "/opt/user1/spark-1.6.1-bin-hadoop2.4/python/pyspark/context.py", line 112, in init
SparkContext._ensure_initialized(self, gateway=gateway)
File "/opt/user1/spark-1.6.1-bin-hadoop2.4/python/pyspark/context.py", line 245, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
File "/opt/user1/spark-1.6.1-bin-hadoop2.4/python/pyspark/java_gateway.py", line 94, in launch_gateway
raise Exception("Java gateway process exited before sending the driver its port number")
Exception: Java gateway process exited before sending the driver its port number

Try2 : downloaded the packages from git and unzip it
$SPARK_HOME/bin/pyspark --packages /opt/user1/tensorflow/tensorframes-master

Error:
Python 3.5.1 |Anaconda 4.0.0 (64-bit)|
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Provided Maven Coordinates must be in the form 'groupId:artifactId:version'. The coordinate provided is: /opt/user1/tensorflow/tensorframes-master
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.deploy.SparkSubmitUtils$$anonfun$extractMavenCoordinates$1.apply(SparkSubmit.scala:842)
at org.apache.spark.deploy.SparkSubmitUtils$$anonfun$extractMavenCoordinates$1.apply(SparkSubmit.scala:840)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.deploy.SparkSubmitUtils$.extractMavenCoordinates(SparkSubmit.scala:840)
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1003)
at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:287)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:154)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
File "/opt/user1/spark-1.6.1-bin-hadoop2.4/python/pyspark/shell.py", line 43, in
sc = SparkContext(pyFiles=add_files)
File "/opt/user1/spark-1.6.1-bin-hadoop2.4/python/pyspark/context.py", line 112, in init
SparkContext._ensure_initialized(self, gateway=gateway)
File "/opt/user1/spark-1.6.1-bin-hadoop2.4/python/pyspark/context.py", line 245, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
File "/opt/user1/spark-1.6.1-bin-hadoop2.4/python/pyspark/java_gateway.py", line 94, in launch_gateway
raise Exception("Java gateway process exited before sending the driver its port number")
Exception: Java gateway process exited before sending the driver its port number

Try 3:
##try installing tensorframe again
convert the zip file to tar:

pip install /opt/user1/tensorflow/tensorframes.tar
Processing /opt/user1/tensorflow/tensorframes.tar
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "/opt/user1/anaconda3/lib/python3.5/tokenize.py", line 454, in open
buffer = _builtin_open(filename, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-nv05aw17-build/setup.py'

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-nv05aw17-build/

Make the scala API more user-friendly

The scala DSL requires the import of various internal objects. This should be cleaner:

one import for all the transforms that mimic the official python API
one import to add all the implicits in scope

Python's reducer output should be numpy types, not lists

The output format should be the same as tensorflow and spark