dvgodoy / handyspark Goto Github PK
View Code? Open in Web Editor NEWHandySpark - bringing pandas-like capabilities to Spark dataframes
License: MIT License
HandySpark - bringing pandas-like capabilities to Spark dataframes
License: MIT License
Hello
We have tried to install the library at our Linux server that has Anaconda3 repository. It errors out asking for PyQT4 to be installed first. Since PyQT5 now is the standard for Anaconda3 and Anaconda2, I wonder if there is a way to install it without having to downgrade PyQT?
used the code as below
fig, axs = plt.subplots(1, 4, figsize=(16, 4))
hdf_filled.cols['Parch'].hist(ax=axs[0])
hdf_filled.cols['SibSp'].hist(ax=axs[1])
hdf_filled.cols['Age'].boxplot(ax=axs[2], k=3)
hdf_filled.cols['Fare'].boxplot(ax=axs[3], k=3)
plt.show()
the plots are shown on the popup windows rather than inline.
How to have plots inline on the zeppelin?
Hi
I installed handyspark on Google Dataproc cluster but I am unable to import it.
When I import it using from handyspark import * it gives me an error only named arguments may follow an * expression.
Can anyone please guide me on why this is happening.
it's awesome that you're adding plots (databricks/koalas#293) to koalas @dvgodoy ! I've been using handyspark for plotting with spark dataframes, will you be continuing the development of handyspark or would you recommend switching to koalas?
Since the Mahalanobis distances are compared to a critical value using a Chi-Squared distribution, should this method only be used if the columns are all Normally distributed?
I tried:
hdf.outliers(method='tukey', k=3.)
and I got this error with pySpark 3.01
HANDY EXCEPTION SUMMARY
Location: "<string>"
Line : 3
Function: raise_from
Error : +- Relation[dayOfWeek#73,AIRLINE#74,FLIGHT_NUMBER#75,ORIGIN_AIRPORT#76,DESTINATION_AIRPORT#77,DISTANCE#78,SCHEDULED_TIME#79,plannedDepartTime#80,label#81] parquet
---------------------------------------------------------------------------
HandyException: cannot resolve 'approx_percentile(`FLIGHT_NUMBER`, CAST(0.25BD AS DOUBLE), 100.0BD)' due to data type mismatch: argument 3 requires integral type, however, '100.0BD' is of decimal(4,1) type.; line 1 pos 0;
might be related to #26
When I summon a hist from a Pandas column (Series) containing integers I get a proper histogram where the x axis is divided to bins of value ranges.
When I do the same using a handy DataFrame I get a categorical histogram.
I dug into the code and the reason for the way handy acts is that the column of integers is not defined as a member of the self._continuous group of columns.
hist uses the continuous list as an indication of using categorical for non continuous. This is why a hist of integers in handy is not what one would expect from a hist of integers in Pandas.
a workaround is to cast the integer column to floats. I think this is a bug (couldn't find anything in the docs).
Here's a quick repro code..
pdf = pd.DataFrame({'bobo': np.random.randint(0, 100, 5000)})
df = spark.createDataFrame(pdf).withColumn('float_bobo', F.col('bobo').astype('float'))
hdf = df.toHandy()
pdf.bobo.hist()
hdf.cols['bobo'].hist()
hdf.cols['float_bobo'].hist()
I forgot to congratulate you on this great lib, it really is cool!
Itamar
Using DataBricks v7.4 ML cluster, with Spark 3.01
(also I think you are using MultiClassMetrics to get the AUC value, but that seems to be high when one has a LR model. See:
https://stackoverflow.com/questions/60772315/how-to-evaluate-a-classifier-with-apache-spark-2-4-5-and-pyspark-python )
I get this error when trying to do the confusion matrix:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<command-3318334168327659> in <module>
----> 1 bcm.print_confusion_matrix(.572006)
/databricks/python/lib/python3.7/site-packages/handyspark/extensions/evaluation.py in print_confusion_matrix(self, threshold)
111 confusionMatrix: pd.DataFrame
112 """
--> 113 cm = self.confusionMatrix(threshold).toArray()
114 df = pd.concat([pd.DataFrame(cm)], keys=['Actual'], names=[])
115 df.columns = pd.MultiIndex.from_product([['Predicted'], df.columns])
/databricks/python/lib/python3.7/site-packages/handyspark/extensions/evaluation.py in confusionMatrix(self, threshold)
92 """
93 scoreAndLabels = self.call2('scoreAndLabels').map(lambda t: (float(t[0] > threshold), t[1]))
---> 94 mcm = MulticlassMetrics(scoreAndLabels)
95 return mcm.confusionMatrix()
96
/databricks/spark/python/pyspark/mllib/evaluation.py in __init__(self, predictionAndLabels)
254 sc = predictionAndLabels.ctx
255 sql_ctx = SQLContext.getOrCreate(sc)
--> 256 numCol = len(predictionAndLabels.first())
257 schema = StructType([
258 StructField("prediction", DoubleType(), nullable=False),
/databricks/spark/python/pyspark/rdd.py in first(self)
1491 ValueError: RDD is empty
1492 """
-> 1493 rs = self.take(1)
1494 if rs:
1495 return rs[0]
/databricks/spark/python/pyspark/rdd.py in take(self, num)
1473
1474 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
-> 1475 res = self.context.runJob(self, takeUpToNumLeft, p)
1476
1477 items += res
/databricks/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, partitions, allowLocal)
1225 finally:
1226 os.remove(filename)
-> 1227 sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
1228 return list(_load_from_socket(sock_info, mappedRDD._jrdd_deserializer))
1229
/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
1303 answer = self.gateway_client.send_command(command)
1304 return_value = get_return_value(
-> 1305 answer, self.gateway_client, self.target_id, self.name)
1306
1307 for temp_arg in temp_args:
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
125 def deco(*a, **kw):
126 try:
--> 127 return f(*a, **kw)
128 except py4j.protocol.Py4JJavaError as e:
129 converted = convert_exception(e.java_exception)
/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2131.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2131.0 (TID 9926, ip-10-172-254-217.us-west-2.compute.internal, executor driver): java.lang.ClassCastException: scala.Tuple3 cannot be cast to scala.Tuple2
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:159)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:150)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:150)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:442)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:703)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:479)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2146)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:271)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2519)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2466)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2460)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2460)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1152)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1152)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1152)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2721)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2668)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2656)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2331)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2352)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2371)
at org.apache.spark.api.python.PythonRDD$.collectPartitions(PythonRDD.scala:197)
at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:217)
at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassCastException: scala.Tuple3 cannot be cast to scala.Tuple2
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:159)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:150)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:150)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:442)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:703)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:479)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2146)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:271)
I also get deprecation warnings when doing the ROC plots:
Really nice work
How to compute the correct median value of a handyspark dataframe?
I tried to compute the median value of a column through pandas and I can the correct value, but when i compute the median value for the same column and same dataset through handyspark I get a different value. Any clue as to why this may happen?
Thanks!
I'm trying to use this package in a production environment, pyspark is provided as I'm submitting my application using spark submit. Handyspark however requires spark to be installed, which results in a conflict in Spark versions (my cluster runs Spark 2.3 and Handyspark pulls in Spark 2.4 since that's the latest stable).
I think this is a common scenario and it would be better if Handyspark doesn't depend on pyspark but simply assumes you'll use it in an environment where pyspark is available, either installed or added to that system path using findspark.
I am using PySpark version 3.01 on DataBricks 7.4.
I am getting this error when trying to do a boxplot (histograms work fine). I have tried manually casting DISTANCE as both as integer and a double, but both fail:
AnalysisException: cannot resolve 'approx_percentile(`DISTANCE`, CAST(0.25BD AS DOUBLE), 100.0BD)' due to data type mismatch: argument 3 requires integral type, however, '100.0BD' is of decimal(4,1) type.; line 1 pos 0;
---------------------------------------------------------------------------
AnalysisException Traceback (most recent call last)
<command-2120656041886569> in <module>
5 hdf.cols["ORIGIN_AIRPORT"].hist(ax=axs[1,0])
6 hdf.cols["DESTINATION_AIRPORT"].hist(ax=axs[1,1])
----> 7 hdf.cols["DISTANCE"].boxplot(ax=axs[2,0])
8 hdf.cols["plannedDepartTime"].boxplot(ax=axs[2,1])
root
|-- dayOfWeek: string (nullable = true)
|-- AIRLINE: string (nullable = true)
|-- FLIGHT_NUMBER: integer (nullable = true)
|-- ORIGIN_AIRPORT: string (nullable = true)
|-- DESTINATION_AIRPORT: string (nullable = true)
|-- DISTANCE: double (nullable = true)
|-- SCHEDULED_TIME: integer (nullable = true)
|-- plannedDepartTime: integer (nullable = true)
|-- label: integer (nullable = true)
When I use python2, I'll got the error as follow:
$ pyspark
Python 2.7.5 (default, Aug 4 2017, 00:39:18)
// comment:some pyspark output are omitted
Using Python version 2.7.5 (default, Aug 4 2017 00:39:18)
SparkSession available as 'spark'.
>>> import handyspark
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/site-packages/handyspark/__init__.py", line 1, in <module>
from handyspark.extensions.evaluation import BinaryClassificationMetrics
File "/usr/lib/python2.7/site-packages/handyspark/extensions/__init__.py", line 2, in <module>
from handyspark.extensions.evaluation import BinaryClassificationMetrics
File "/usr/lib/python2.7/site-packages/handyspark/extensions/evaluation.py", line 3, in <module>
from handyspark.plot import roc_curve, pr_curve
File "/usr/lib/python2.7/site-packages/handyspark/plot.py", line 53
splits = np.linspace(*sdf.agg(F.min(col), F.max(col)).rdd.map(tuple).collect()[0], n + 1)
SyntaxError: only named arguments may follow *expression
Python modules' info:
$ pip list |grep -E "spark"
handyspark (0.2.1a1)
pyspark (2.4.0)
As PEP 3132 says:
Only allow a starred expression as the last item in the exprlist. This would simplify the unpacking code a bit and allow for the starred expression to be assigned an iterator. This behavior was rejected because it would be too surprising.
This error only appear in python2.
Is there a way to rotate x labels in grouped/stratified boxplots? I tried 'rot' argument from pandas boxplot, I also tried to assign ax=axs so I can set the labels afterwards, but it's not working for stratified boxplot.
(BTW thank you for this great tool!!!)
this works(no rotation):
pattern_time_ranges.stratify(['event_type']).cols['event_time_range'].boxplot(figsize=(16, 5))
this does not work:
fig, axs = plt.subplots(1, 1, figsize=(16, 5))
pattern_time_ranges.stratify(['event_type']).cols['event_time_range'].boxplot(ax=axs)
Error : NotImplementedError: TransformNode instances can not be copied. Consider using frozen() instead.
this does not work too:
pattern_time_ranges.stratify(['event_type']).cols['event_time_range'].boxplot(figsize=(16, 5), rot=90)
Getting this error when trying to get back bins and counts instead of a histogram plot...
module 'handyspark.plot' has no attribute 'stratified_histogram'
but other modules exist
help(handy.plot.histogram)
histogram(sdf, colname, bins=10, categorical=False, ax=None)
The doc pages lists this method in the plot module however ! Is there a recommended way of getting the bins and counts instead of a plot ?
Thanks for the great work!! , not sure why every other such module converts to Pandas by default, defeats the purpose imo. I'm trying it out currently.
I have a question around filtering a dataframe rows in Handy: Is it currently possible to filter rows based on column values directly instead of creating a column output and assigning it back as a new column in the dataframe ? Would be great to have a direct filter capability in the API or any workaround that doesn't need the user to use low level spark calls for filtering.
Can you make a python2.6.6 version?
I work with a Cloudera Cluster that uses python2 and cannot be upgraded.
Really awesome work.
Requesting to add qcut() method to 'handy'.
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.qcut.html
Much appreciated.
This method is failing because of TypeError in eveluation.py line 142.
I believe in lines 141 and 142,
select(scoreCol, labelCol).rdd.map((lambda row:(float(row [scoreCol][1]) , float(row[labelCol]))) should change to the following:
select(scoreCol, labelCol).rdd.map((lambda row:(float(row [scoreCol]) , float(row[labelCol])))
this is a really useful library @dvgodoy! I have a question related to the options available for the hist() plot. The command hdf.cols['Embarked'].hist(ax=axs[0])
does not accept a lot of the keywords that are usually available with the pandas hist() plot. For e.g., bins
is accepted, but grid=True
is not accepted and range
is not accepted.
How do I find out what keyword arguments can be passed to handyspark dataframe plots? I'd greatly appreciate your feedback. Thanks,
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.