dvgodoy / handyspark Goto Github PK

View Code? Open in Web Editor NEW

183.0 15.0 22.0 1.72 MB

HandySpark - bringing pandas-like capabilities to Spark dataframes

License: MIT License

Python 47.50% Jupyter Notebook 52.50%

spark pandas visualization python pyspark exploratory-data-analysis imputation outlier-detection

handyspark's Issues

pyqt4 error

Hello

We have tried to install the library at our Linux server that has Anaconda3 repository. It errors out asking for PyQT4 to be installed first. Since PyQT5 now is the standard for Anaconda3 and Anaconda2, I wonder if there is a way to install it without having to downgrade PyQT?

need to have inline plots visuals on zeppelin ..

used the code as below
fig, axs = plt.subplots(1, 4, figsize=(16, 4))
hdf_filled.cols['Parch'].hist(ax=axs[0])
hdf_filled.cols['SibSp'].hist(ax=axs[1])
hdf_filled.cols['Age'].boxplot(ax=axs[2], k=3)
hdf_filled.cols['Fare'].boxplot(ax=axs[3], k=3)
plt.show()

the plots are shown on the popup windows rather than inline.
How to have plots inline on the zeppelin?

Error: only named arguments may follow *expression

I installed handyspark on Google Dataproc cluster but I am unable to import it.

When I import it using from handyspark import * it gives me an error only named arguments may follow an * expression.

Can anyone please guide me on why this is happening.

koalas vs handyspark

it's awesome that you're adding plots (databricks/koalas#293) to koalas @dvgodoy ! I've been using handyspark for plotting with spark dataframes, will you be continuing the development of handyspark or would you recommend switching to koalas?

Use of Mahalanobis distance for outlier detection

Since the Mahalanobis distances are compared to a critical value using a Chi-Squared distribution, should this method only be used if the columns are all Normally distributed?

'HandyGrouped' object has no attribute 'session'

so i tried to work handyspark with the plotting and keep getting me the errors like these.

tukey outliers fails

I tried:

hdf.outliers(method='tukey', k=3.)

and I got this error with pySpark 3.01

HANDY EXCEPTION SUMMARY

Location: "<string>"
Line	: 3
Function: raise_from
Error	:    +- Relation[dayOfWeek#73,AIRLINE#74,FLIGHT_NUMBER#75,ORIGIN_AIRPORT#76,DESTINATION_AIRPORT#77,DISTANCE#78,SCHEDULED_TIME#79,plannedDepartTime#80,label#81] parquet
---------------------------------------------------------------------------
HandyException: cannot resolve 'approx_percentile(`FLIGHT_NUMBER`, CAST(0.25BD AS DOUBLE), 100.0BD)' due to data type mismatch: argument 3 requires integral type, however, '100.0BD' is of decimal(4,1) type.; line 1 pos 0;

might be related to #26

When column is of type int, histogram acts different from pandas dataframe

When I summon a hist from a Pandas column (Series) containing integers I get a proper histogram where the x axis is divided to bins of value ranges.
When I do the same using a handy DataFrame I get a categorical histogram.

I dug into the code and the reason for the way handy acts is that the column of integers is not defined as a member of the self._continuous group of columns.

hist uses the continuous list as an indication of using categorical for non continuous. This is why a hist of integers in handy is not what one would expect from a hist of integers in Pandas.

a workaround is to cast the integer column to floats. I think this is a bug (couldn't find anything in the docs).

Here's a quick repro code..

pdf = pd.DataFrame({'bobo': np.random.randint(0, 100, 5000)})
df = spark.createDataFrame(pdf).withColumn('float_bobo', F.col('bobo').astype('float'))
hdf = df.toHandy()
pdf.bobo.hist()
hdf.cols['bobo'].hist()
hdf.cols['float_bobo'].hist()

I forgot to congratulate you on this great lib, it really is cool!

Itamar

Confusion martrix not working with DataBricks 7.4 ML level of spark

Using DataBricks v7.4 ML cluster, with Spark 3.01

(also I think you are using MultiClassMetrics to get the AUC value, but that seems to be high when one has a LR model. See:
https://stackoverflow.com/questions/60772315/how-to-evaluate-a-classifier-with-apache-spark-2-4-5-and-pyspark-python )

I get this error when trying to do the confusion matrix:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<command-3318334168327659> in <module>
----> 1 bcm.print_confusion_matrix(.572006)

/databricks/python/lib/python3.7/site-packages/handyspark/extensions/evaluation.py in print_confusion_matrix(self, threshold)
    111     confusionMatrix: pd.DataFrame
    112     """
--> 113     cm = self.confusionMatrix(threshold).toArray()
    114     df = pd.concat([pd.DataFrame(cm)], keys=['Actual'], names=[])
    115     df.columns = pd.MultiIndex.from_product([['Predicted'], df.columns])

/databricks/python/lib/python3.7/site-packages/handyspark/extensions/evaluation.py in confusionMatrix(self, threshold)
     92     """
     93     scoreAndLabels = self.call2('scoreAndLabels').map(lambda t: (float(t[0] > threshold), t[1]))
---> 94     mcm = MulticlassMetrics(scoreAndLabels)
     95     return mcm.confusionMatrix()
     96 

/databricks/spark/python/pyspark/mllib/evaluation.py in __init__(self, predictionAndLabels)
    254         sc = predictionAndLabels.ctx
    255         sql_ctx = SQLContext.getOrCreate(sc)
--> 256         numCol = len(predictionAndLabels.first())
    257         schema = StructType([
    258             StructField("prediction", DoubleType(), nullable=False),

/databricks/spark/python/pyspark/rdd.py in first(self)
   1491         ValueError: RDD is empty
   1492         """
-> 1493         rs = self.take(1)
   1494         if rs:
   1495             return rs[0]

/databricks/spark/python/pyspark/rdd.py in take(self, num)
   1473 
   1474             p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
-> 1475             res = self.context.runJob(self, takeUpToNumLeft, p)
   1476 
   1477             items += res

/databricks/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, partitions, allowLocal)
   1225             finally:
   1226                 os.remove(filename)
-> 1227         sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
   1228         return list(_load_from_socket(sock_info, mappedRDD._jrdd_deserializer))
   1229 

/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1303         answer = self.gateway_client.send_command(command)
   1304         return_value = get_return_value(
-> 1305             answer, self.gateway_client, self.target_id, self.name)
   1306 
   1307         for temp_arg in temp_args:

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    125     def deco(*a, **kw):
    126         try:
--> 127             return f(*a, **kw)
    128         except py4j.protocol.Py4JJavaError as e:
    129             converted = convert_exception(e.java_exception)

/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2131.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2131.0 (TID 9926, ip-10-172-254-217.us-west-2.compute.internal, executor driver): java.lang.ClassCastException: scala.Tuple3 cannot be cast to scala.Tuple2
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
	at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:159)
	at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:150)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:150)
	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:442)
	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:703)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:479)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2146)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:271)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2519)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2466)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2460)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2460)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1152)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1152)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1152)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2721)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2668)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2656)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2331)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2352)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2371)
	at org.apache.spark.api.python.PythonRDD$.collectPartitions(PythonRDD.scala:197)
	at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:217)
	at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
	at py4j.Gateway.invoke(Gateway.java:295)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:251)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassCastException: scala.Tuple3 cannot be cast to scala.Tuple2
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
	at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:159)
	at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:150)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:150)
	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:442)
	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:703)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:479)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2146)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:271)

I also get deprecation warnings when doing the ROC plots:

How to compute the median value of a column

Really nice work

How to compute the correct median value of a handyspark dataframe?
I tried to compute the median value of a column through pandas and I can the correct value, but when i compute the median value for the same column and same dataset through handyspark I get a different value. Any clue as to why this may happen?

Thanks!

remove pyspark dependency

I'm trying to use this package in a production environment, pyspark is provided as I'm submitting my application using spark submit. Handyspark however requires spark to be installed, which results in a conflict in Spark versions (my cluster runs Spark 2.3 and Handyspark pulls in Spark 2.4 since that's the latest stable).

I think this is a common scenario and it would be better if Handyspark doesn't depend on pyspark but simply assumes you'll use it in an environment where pyspark is available, either installed or added to that system path using findspark.

boxplot is failing

I am using PySpark version 3.01 on DataBricks 7.4.

I am getting this error when trying to do a boxplot (histograms work fine). I have tried manually casting DISTANCE as both as integer and a double, but both fail:

AnalysisException: cannot resolve 'approx_percentile(`DISTANCE`, CAST(0.25BD AS DOUBLE), 100.0BD)' due to data type mismatch: argument 3 requires integral type, however, '100.0BD' is of decimal(4,1) type.; line 1 pos 0;
---------------------------------------------------------------------------
AnalysisException                         Traceback (most recent call last)
<command-2120656041886569> in <module>
      5 hdf.cols["ORIGIN_AIRPORT"].hist(ax=axs[1,0])
      6 hdf.cols["DESTINATION_AIRPORT"].hist(ax=axs[1,1])
----> 7 hdf.cols["DISTANCE"].boxplot(ax=axs[2,0])
      8 hdf.cols["plannedDepartTime"].boxplot(ax=axs[2,1])

root
 |-- dayOfWeek: string (nullable = true)
 |-- AIRLINE: string (nullable = true)
 |-- FLIGHT_NUMBER: integer (nullable = true)
 |-- ORIGIN_AIRPORT: string (nullable = true)
 |-- DESTINATION_AIRPORT: string (nullable = true)
 |-- DISTANCE: double (nullable = true)
 |-- SCHEDULED_TIME: integer (nullable = true)
 |-- plannedDepartTime: integer (nullable = true)
 |-- label: integer (nullable = true)

SyntaxError: only named arguments may follow *expression when use python2

When I use python2, I'll got the error as follow:

$ pyspark
Python 2.7.5 (default, Aug  4 2017, 00:39:18)
// comment：some pyspark output are omitted
Using Python version 2.7.5 (default, Aug  4 2017 00:39:18)
SparkSession available as 'spark'.
>>> import handyspark
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/site-packages/handyspark/__init__.py", line 1, in <module>
    from handyspark.extensions.evaluation import BinaryClassificationMetrics
  File "/usr/lib/python2.7/site-packages/handyspark/extensions/__init__.py", line 2, in <module>
    from handyspark.extensions.evaluation import BinaryClassificationMetrics
  File "/usr/lib/python2.7/site-packages/handyspark/extensions/evaluation.py", line 3, in <module>
    from handyspark.plot import roc_curve, pr_curve
  File "/usr/lib/python2.7/site-packages/handyspark/plot.py", line 53
    splits = np.linspace(*sdf.agg(F.min(col), F.max(col)).rdd.map(tuple).collect()[0], n + 1)
SyntaxError: only named arguments may follow *expression

Python modules' info:

$ pip list |grep -E "spark"
handyspark (0.2.1a1)
pyspark (2.4.0)

As PEP 3132 says:

Only allow a starred expression as the last item in the exprlist. This would simplify the unpacking code a bit and allow for the starred expression to be assigned an iterator. This behavior was rejected because it would be too surprising.

This error only appear in python2.

Boxplot X label rotation

Is there a way to rotate x labels in grouped/stratified boxplots? I tried 'rot' argument from pandas boxplot, I also tried to assign ax=axs so I can set the labels afterwards, but it's not working for stratified boxplot.
(BTW thank you for this great tool!!!)

this works(no rotation):
pattern_time_ranges.stratify(['event_type']).cols['event_time_range'].boxplot(figsize=(16, 5))

this does not work:
fig, axs = plt.subplots(1, 1, figsize=(16, 5))
pattern_time_ranges.stratify(['event_type']).cols['event_time_range'].boxplot(ax=axs)
Error : NotImplementedError: TransformNode instances can not be copied. Consider using frozen() instead.

this does not work too:
pattern_time_ranges.stratify(['event_type']).cols['event_time_range'].boxplot(figsize=(16, 5), rot=90)

how to get get bins and counts instead of a plot

Getting this error when trying to get back bins and counts instead of a histogram plot...
module 'handyspark.plot' has no attribute 'stratified_histogram'

but other modules exist

help(handy.plot.histogram)
histogram(sdf, colname, bins=10, categorical=False, ax=None)

The doc pages lists this method in the plot module however ! Is there a recommended way of getting the bins and counts instead of a plot ?

Access to filter method in handyspark API

Thanks for the great work!! , not sure why every other such module converts to Pandas by default, defeats the purpose imo. I'm trying it out currently.

I have a question around filtering a dataframe rows in Handy: Is it currently possible to filter rows based on column values directly instead of creating a column output and assigning it back as a new column in the dataframe ? Would be great to have a direct filter capability in the API or any workaround that doesn't need the user to use low level spark calls for filtering.

Request Python2.6.6 version

Can you make a python2.6.6 version?
I work with a Cloudera Cluster that uses python2 and cannot be upgraded.

Request for new feature: Quantile based discretization: pandas.qcut()

Really awesome work.
Requesting to add qcut() method to 'handy'.
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.qcut.html
Much appreciated.

getMetricsByThreshold is failing

This method is failing because of TypeError in eveluation.py line 142.

I believe in lines 141 and 142,
select(scoreCol, labelCol).rdd.map((lambda row:(float(row [scoreCol][1]) , float(row[labelCol]))) should change to the following:
select(scoreCol, labelCol).rdd.map((lambda row:(float(row [scoreCol]) , float(row[labelCol])))

Options for hist() plot

this is a really useful library @dvgodoy! I have a question related to the options available for the hist() plot. The command hdf.cols['Embarked'].hist(ax=axs[0]) does not accept a lot of the keywords that are usually available with the pandas hist() plot. For e.g., bins is accepted, but grid=True is not accepted and range is not accepted.

How do I find out what keyword arguments can be passed to handyspark dataframe plots? I'd greatly appreciate your feedback. Thanks,

dvgodoy / handyspark Goto Github PK

handyspark's Issues

Recommend Projects

Recommend Topics

Recommend Org