zero-one-group / geni Goto Github PK
View Code? Open in Web Editor NEWA Clojure dataframe library that runs on Spark
License: Apache License 2.0
A Clojure dataframe library that runs on Spark
License: Apache License 2.0
Is it feasible to use a sparkml java package like https://graphframes.github.io/graphframes/docs/_site/index.html with Geni? Or is this outside of the scope?
Geni Version: 0.0.38
user=> (require '[zero-one.geni.core.dataset-creation :as g] :reload)
nil
user=> (g/->schema {:coords [{:x :int :y :int}]})
Execution error (IllegalArgumentException) at org.apache.spark.sql.types.DataTypes/createArrayType (DataTypes.java:114).
elementType should not be null.
user=> (g/->schema {:coords [{:x :int :y :int}]})
#object[org.apache.spark.sql.types.StructType 0x5cb6297e "StructType(StructField(coords,ArrayType(StructType(StructField(x,IntegerType,true), StructField(y,IntegerType,true)),true),true))"]
At the moment, array-type
supports only simple val-type
listed in data-type->spark-type
. E.g. :bool
, :string
.
We can extend array-type
to support any Spark SQL DataType
, in the same fashion we are already doing in struct-field
.
a SparseVector has three pieces of information:
Current code here:
geni/src/clojure/zero_one/geni/interop.clj
Line 142 in 5b570b5
only collects values, but we need all three:
(seq (.indices sparse-vector))
(seq (.values sparse-vector))
(.size sparse-vector)
Maybe a map with all three should be returned instead
Info | Value |
---|---|
Operating System | Nixos 21.09 |
Geni Version | 0.0.38 |
JDK | openjdk version 11.0.9 2020-10-20 |
Spark Version | 3.1.0 |
I am working through the cookbook and encountered an error on example 2.5
https://github.com/zero-one-group/geni/blob/develop/docs/cookbook/part_02_selecting_rows_and_columns.md#25-selecting-only-noise-complaints
I entered in the following:
(-> complaints
(g/filter (g/= :complaint-type (g/lit "Noise - Street/Sidewalk")))
(g/select :complaint-type :borough :created-date :descriptor)
(g/limit 3)
g/show)
I receive the following error:
Exectuion error (NoClassDefFoundError) at org.apache.spark.sql.catalyst.parser.AbstractSQLParser/parse (ParseDriver.scala:90).
Could not initialize class org.apache.spark.sql.catalyst.parser.SqlBaseLexer
Info | Value |
---|---|
Geni Version | 0.0.38 |
Cannot use structs or list of structs in g/->dataset
, example:
(-> (g/->dataset [{:id "bob" :skills ["javascript" "clojure"] :preference {:remote true} :experience [{:name "CompanyA" :years 2}
{:name "CompanyB" :years 1}]}
{:id "alice" :skills ["python" "c++"] :experience []}])
g/show)
leads to
; Execution error (ClassCastException) at org.apache.spark.sql.catalyst.expressions.CastBase/buildCast (Cast.scala:295).
; class clojure.lang.Keyword cannot be cast to class [B (clojure.lang.Keyword is in unnamed module of loader 'app'; [B is in module java.base of loader 'bootstrap')
Running g/print-schema
right before g/show
above shows incorrect schema:
root
|-- id: string (nullable = true)
|-- skills: array (nullable = true)
| |-- element: string (containsNull = true)
|-- preference: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: binary (containsNull = true)
|-- experience: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: array (containsNull = true)
| | | |-- element: binary (containsNull = true)
g/show
should output:
+-----+---------------------+----------+------------------------------+
|id |skills |preference|experience |
+-----+---------------------+----------+------------------------------+
|bob |[javascript, clojure]|{true} |[{CompanyA, 2}, {CompanyB, 1}]|
|alice|[python, c++] |null |[] |
+-----+---------------------+----------+------------------------------+
g/print-schema
should output:
root
|-- id: string (nullable = true)
|-- skills: array (nullable = true)
| |-- element: string (containsNull = true)
|-- preference: struct (nullable = true)
| |-- remote: boolean (nullable = true)
|-- experience: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- years: long (nullable = true)
map?
in infer-spark-type
at
table
, transform to array of values
Info | Value |
---|---|
Operating System | MacOS |
Geni Version | 0.3.8 |
JDK | 1.8 |
Spark Version | 3.0.2 |
It seems like it is impossible to create a boolean column from all false
values using records->dataset
because they get recognized as null columns. Here is a failing tests.
(fact "should work for bool columns"
(let [dataset (g/records->dataset
@tr/spark
[{:i 0 :s "A" :b false}
{:i 1 :s "B" :b false}
{:i 2 :s "C" :b false}])]
(instance? Dataset dataset) => true
(g/schema dataset) => (g/->schema {:i :long
:s :string
:b :bool})
(g/collect-vals dataset) => [[0 "A" false]
[1 "B" false]
[2 "C" false]]))
and here is the output.
FAIL On records->dataset - should work for bool columns at (dataset_creation_test.clj:143)
Expected:
#<org.apache.spark.sql.types.StructType@2e83f3f5 StructType(StructField(i,LongType,true), StructField(s,StringType,true), StructField(b,BooleanType,true))>
Actual:
#<org.apache.spark.sql.types.StructType@67b8b180 StructType(StructField(i,LongType,true), StructField(s,StringType,true), StructField(b,NullType,true))>
FAIL On records->dataset - should work for bool columns at (dataset_creation_test.clj:146)
Expected:
[[0 "A" false] [1 "B" false] [2 "C" false]]
Actual:
([0 "A" nil] [1 "B" nil] [2 "C" nil])
Diffs: in [0 2] expected false, was nil
in [1 2] expected false, was nil
in [2 2] expected false, was nil
The same behavior applies to map->dataset
and table->dataset
. If any of the booleans are true
, then the schema is understood correctly.
This would be super usefull, I believe.
Tech.ml.dataset has very good support for working with arrow files.
(memory mapped and larger then heap)
This would nicely bring Geni and tech.ml.dataset.
I currently can convert with them, using R...
Geni users would benefit from support for Spark User Defined Functions on dataframes as documented here.
UDFs are very useful for data analysis from the simple classification of continuous values to implementing models that operate on rows of values (e.g., modelling the impact on sales as a function of own and competitor price changes) to cleansing data using the values of multiple columns.
Info | Value |
---|---|
Operating System | Linux 5.19.0-38-generic #39~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Mar 17 21:16:15 UTC 2 x86_64 x86_64 x86_64 GNU/Linux |
Geni Version | [zero.one/geni "0.0.40"] |
JDK | openjdk version "1.8.0_362" |
OpenJDK Runtime Environment (build 1.8.0_362-8u362-ga-0ubuntu1~22.04-b09) | |
OpenJDK 64-Bit Server VM (build 25.362-b09, mixed mode) | |
Spark Version | [org.apache.spark/spark-core_2.12 "3.3.2"] |
org.apache.spark.sql.functions/ltrim
supports both 1- and 2-arity calls, but zero-one.geni.core.functions/ltrim
does only 1-arity. The same is true for rtrim
.
Let me know if you fancy a PR.
Info | Value |
---|---|
Geni Version | v0.0.38 |
Package [reply "0.4.4"]
have dependencies on [commons-fileupload/commons-fileupload "1.3.3"]
which contain critical vulnerability CVE-2016-1000031
Upgrade reply
to 5.0.1
WARNING: abs already refers to: #'clojure.core/abs in namespace: taoensso.encore, being replaced by: #'taoensso.encore/abs
WARNING: abs already refers to: #'clojure.core/abs in namespace: zero-one.geni.core.functions, being replaced by: #'zero-one.geni.core.functions/abs
WARNING: cat already refers to: #'clojure.core/cat in namespace: net.cgrand.parsley.fold, being replaced by: #'net.cgrand.parsley.fold/cat
WARNING: abs already refers to: #'clojure.core/abs in namespace: tech.v3.datatype.functional, being replaced by: #'tech.v3.datatype.functional/abs
WARNING: infinite? already refers to: #'clojure.core/infinite? in namespace: tech.v3.datatype.functional, being replaced by: #'tech.v3.datatype.functional/infinite?
WARNING: random-uuid already refers to: #'clojure.core/random-uuid in namespace: tech.v3.io.uuid, being replaced by: #'tech.v3.io.uuid/random-uuid
Updating dependencies can resolve some of these warnings, but updating them all may introduce new conflicts.
The second warning can be solved in Geni's code.
What we need to do:
com.taoensso/nippy
from 3.1.1
to 3.3.0
techascent/tech.ml.dataset
from 5.21
to 6.101
Reflection warning, ham_fisted/api.clj:1144:12 - call to method expireAfterAccess on com.google.common.cache.CacheBuilder can't be resolved (no such method).
Reflection warning, ham_fisted/api.clj:1146:12 - call to method expireAfterWrite can't be resolved (target class is unknown).
Reflection warning, ham_fisted/api.clj:1148:12 - reference to field softValues can't be resolved.
Reflection warning, ham_fisted/api.clj:1150:12 - reference to field weakValues can't be resolved.
Reflection warning, ham_fisted/api.clj:1152:12 - call to method maximumSize can't be resolved (target class is unknown).
Reflection warning, ham_fisted/api.clj:1154:12 - reference to field recordStats can't be resolved.
midje
from 1.10.3
to 1.10.9
abs
in src/clojure/zero_one/geni/core.clj
abs
in src/clojure/zero_one/geni/core/functions.clj
parsley
version as 0.9.3
Following the minikube guide:
https://github.com/zero-one-group/geni/blame/develop/docs/kubernetes_basic.md
the verification of line 118 fails.
It seems that I cannot change the spark-session, by calling
g/create-spark-session
I am pretty sure, that it worked at one moment.
Does geni support reading directly from an HDFS path?
Is there something akin to the following?
(def df (read-from-hdfs "/some/path/on/hdfs/to/a/subdir/"))
... where /some/path/on/hdfs/to/a/subdir/
is a path on hdfs that contains many files?
thanks in advance.
Will geni support Spark 3.5's remote connections to clusters via spark connect and databricks connect ?
Creating histograms is a very common activity. Geni offers cut
which supports the creation of histograms as a function of bins
, an array of values, but the user has to compute these bins manually.
Geni provides qcut
to help users determine how wide each bin should be.
It would be helpful to provide support for a function, (g/histogram :column {:n-bins :bins-vector})
that would either compute the bins automatically if provided with an :n-bins
parameter, or compute the histogram on the basis of the supplied :bins-vector
.
Using the form with just the :n-bins
argument is very useful for data analysis and review, while being able to provide a :bins-vector
addresses the use case where histogram use is informed by business domain needs (e.g., bin populations into age brackets that align with survey methodology).
There is a pre-release for Spark NLP on Spark 3.0.0
This might it possible to use it with Geni ?
https://github.com/JohnSnowLabs/spark-nlp/releases/tag/3.0.0-rc8
First, I have to thank you for creating this library! I have been using Geni for about a week now and it is great to have a Clojure interface into Spark dataframes.
It seems like the dataframe reading/writing API doesn't support specifying a schema. Looking at the library source code, it seems like most map entries fall back on the reader's/writer's .option(key, val)
method. This works in any case where the option's value is a String, Boolean, Long, or Double because Spark has overloaded methods for this.
In the case of schemas, we must specify a StructType (or schema string) and thus Spark requires us to call the .schema
method instead of .option
. It would be great if geni could treat the :schema
key of the reader/writer options as a special case and pass the value to .schema
instead of .option
.
See the code snippet below as an example.
Info | Value |
---|---|
Operating System | MacOS Catalina |
Geni Version | 0.0.25 |
JDK | 1.8 |
Spark Version | 3.0 |
(require '[zero-one.geni.core :as g])
(g/read-csv! "test/resources/data/tiny_table.csv"
{:header true
:delimiter ","
:schema (g/struct-type
(g/struct-field :i :long true)
(g/struct-field :s :string true))})
;; Execution error (IllegalArgumentException) at zero-one.geni.core.data-sources/configure-reader-or-writer$fn (data_sources.clj:19).
;; No matching method option found taking 2 args for class org.apache.spark.sql.DataFrameReader
Info | Value |
---|---|
Operating System | rhel7 and windows |
Geni Version | 0.0.38 |
JDK | 11.0.10 |
Spark Version | 3.1.2 |
If zero-one.geni.core
has been required, it creates a default spark session which impacts the behavior of calling g/create-spark-session
Specifically, geni.core
loads geni.spark-context
which loads geni.defaults
, which creates a spark session in an atom
which should probably be a delay
(def s (g/create-spark-session {:app-name “foo”}))
(g/spark-conf s)
; => {…:spark.app.name Geni app…}
; which is the wrong name
if requiring zero-one.geni.spark
directly instead (as g
), the spark session is correctly configured.
The incorrect behaviour takes effect if core is required at any point before creating the session, so it is a bit problematic. As above, maybe replacing the default with a delay will be sufficient to avoid this.
Thanks for your work on this library!
Delta Lake brings a lot of crucial features into the Spark ecosystem. Some of the highlights include:
In some toy projects, I have begun using Delta with Geni using Scala interop. I haven't worked out all the issues yet, but I think I am getting close to something that could be useful to others.
It would be great to get people's thoughts on the best way to manage Delta support. My specific question is: Would it be a good idea to add Delta Lake as an optional dep (like we do with xgboost) and implement the Delta Lake API directly in Geni?
If yes, I can move my code into a PR. If no, I can create a geni compatible Clojure lib specifically for Delta lake.
Some things to consider:
with-dynamic-import
when reaching into optional dependencies, but that is usually for 1 or 2 functions. Adding Delta would create a set of functions to mirror the public API of an entire Scala library. It could be a little awkward to with-dymanic-import
so much stuff. My editor has a hard time handling it, but maybe that is just me.Personally, I think option 1 is best (integrate into Geni) but I don't want to create more work for other people if I am one of the only people who would benefit. :)
Hi Team,
Thanks for this wonderful project - I am really excited to see the integration with Spark 3!
I was trying out the latest 0.0.19
and ran into an import error while trying out the example
Clojure 1.10.1
(ns spark.geni.core
(:require
[clojure.string]
[zero-one.geni.core :as g]
[zero-one.geni.ml :as ml]))
Syntax error (ClassNotFoundException) compiling at (zero_one\geni\interop.clj:1:1).
org.apache.spark.ml.linalg.DenseVector
Could you please help me out a bit here?
I would like to use geni against a Spark cluster, running in Azure Kubernetes.
It seems to be an amazing tool.
Good new is, that it did work, but it was very tricky to get started on this.
I believe, it is mainly a question of missing documentation.
My main question, which I was struggling for a while, was to:
Understand firstly, if Geni should support it at all. I assumed it, but as the documentation only showcases local use, I was not sure about it.
When and where to call my own g/create-spark-session. This was very tricky to discover, as
geni/src/clojure/zero_one/geni/main.clj
Line 13 in 671510e
I would propose to add one example documentation of using geny with a remote/ Cloud spark instance,
or maybe even show the full setup using a local minikube kubernetes cluster.
At least with minikube it is possible to create a "step-by-step" recipe, which can be copy-pasted into a shell.
(I know, Spark on kubernets is a very tricky issue by itself.)
Info | Value |
---|---|
Operating System | MacOS Catalina |
Geni Version | 0.0.33 |
JDK | openjdk8, openjdk11 |
Spark Version | 3.0.1 |
When I try to build an uberjar which depends on geni I get an error
Execution error (NoSuchMethodError) at io.netty.channel.SingleThreadEventLoop/<init> (SingleThreadEventLoop.java:65).
io.netty.util.concurrent.SingleThreadEventExecutor.<init>(Lio/netty/util/concurrent/EventExecutorGroup;Ljava/util/concurrent/Executor;ZLjava/util/Queue;Lio/netty/util/concurrent/RejectedExecutionHandler;)V
The repo https://github.com/klausharbo/geni-uberjar-example reproduces the issue on my machine when I do lein uberjar
Info | Value |
---|---|
Operating System | macOS Big Sur 11.2.1 |
Geni Version | 0.0.38 |
Spark Java lib Version | org.apache.spark/spark-XXX_2.12 "3.1.0" |
JDK | openjdk version "1.8.0_282" |
Spark Version | 3.0.2 |
Standard lein new geni …, bitnami/spark 3.0.2 docker, then used code from geni cookbook chapter 4.
The following code from cookbook example 4 fails with ArityException:
(def null-counts
(-> raw-weather-mar-2012
(g/agg (->> (g/column-names raw-weather-mar-2012)
(map #(vector % (g/null-count %)))
(into {})))
(g/first)))
Exception is
Execution error (AnalysisException) at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt/failAnalysis (package.scala:42).
cannot resolve '`Precip. Amount (mm)`' given input columns: [Climate ID, Date/Time (LST), Day, Dew Point Temp (°C), Dew Point Temp Flag, Hmdx, Hmdx Flag, Latitude (y), Longitude (x), Month, Precip. Amount (mm), Precip. Amount Flag, Rel Hum (%), Rel Hum Flag, Station Name, Stn Press (kPa), Stn Press Flag, Temp (°C), Temp Flag, Time (LST), Visibility (km), Visibility Flag, Weather, Wind Chill, Wind Chill Flag, Wind Dir (10s deg), Wind Dir Flag, Wind Spd (km/h), Wind Spd Flag, Year];
'Aggregate [sum(cast(cast(isnull(Latitude (y)#13272) as int) as bigint)) AS Latitude (y)#16363L, sum(cast(cast(isnull(Stn Press Flag#13295) as int) as bigint)) AS Stn Press Flag#16364L, sum(cast(cast(isnull(Temp (°C)#13280) as int) as bigint)) AS Temp (°C)#16365L, sum(cast(cast(isnull(Wind Spd (km/h)#13290) as int) as bigint)) AS Wind Spd (km/h)#16366L, sum(cast(cast(isnull(Rel Hum Flag#13285) as int) as bigint)) AS Rel Hum Flag#16367L, sum(cast(cast(isnull(Date/Time (LST)#13275) as int) as bigint)) AS Date/Time (LST)#16368L, sum(cast(cast(isnull(Visibility Flag#13293) as int) as bigint)) AS Visibility Flag#16369L, sum(cast(cast(isnull(Visibility (km)#13292) as int) as bigint)) AS Visibility (km)#16370L, sum(cast(isnull('Precip. Amount (mm)) as int)) AS Precip. Amount (mm)#16371, sum(cast(cast(isnull(Dew Point Temp Flag#13283) as int) as bigint)) AS Dew Point Temp Flag#16372L, sum(cast(isnull('Precip. Amount Flag) as int)) AS Precip. Amount Flag#16373, sum(cast(cast(isnull(Station Name#13273) as int) as bigint)) AS Station Name#16374L, sum(cast(cast(isnull(Wind Chill Flag#13299) as int) as bigint)) AS Wind Chill Flag#16375L, sum(cast(cast(isnull(Longitude (x)#13271) as int) as bigint)) AS Longitude (x)#16376L, sum(cast(cast(isnull(Time (LST)#13279) as int) as bigint)) AS Time (LST)#16377L, sum(cast(cast(isnull(Dew Point Temp (°C)#13282) as int) as bigint)) AS Dew Point Temp (°C)#16378L, sum(cast(cast(isnull(Rel Hum (%)#13284) as int) as bigint)) AS Rel Hum (%)#16379L, sum(cast(cast(isnull(Wind Dir Flag#13289) as int) as bigint)) AS Wind Dir Flag#16380L, sum(cast(cast(isnull(Climate ID#13274) as int) as bigint)) AS Climate ID#16381L, sum(cast(cast(isnull(Wind Dir (10s deg)#13288) as int) as bigint)) AS Wind Dir (10s deg)#16382L, sum(cast(cast(isnull(Stn Press (kPa)#13294) as int) as bigint)) AS Stn Press (kPa)#16383L, sum(cast(cast(isnull(Year#13276) as int) as bigint)) AS Year#16384L, sum(cast(cast(isnull(Temp Flag#13281) as int) as bigint)) AS Temp Flag#16385L, sum(cast(cast(isnull(Hmdx#13296) as int) as bigint)) AS Hmdx#16386L, ... 6 more fields]
+- Project [Longitude (x)#13211 AS Longitude (x)#13271, Latitude (y)#13212 AS Latitude (y)#13272, Station Name#13213 AS Station Name#13273, Climate ID#13214 AS Climate ID#13274, Date/Time (LST)#13215 AS Date/Time (LST)#13275, Year#13216 AS Year#13276, Month#13217 AS Month#13277, Day#13218 AS Day#13278, Time (LST)#13219 AS Time (LST)#13279, Temp (°C)#13220 AS Temp (°C)#13280, Temp Flag#13221 AS Temp Flag#13281, Dew Point Temp (°C)#13222 AS Dew Point Temp (°C)#13282, Dew Point Temp Flag#13223 AS Dew Point Temp Flag#13283, Rel Hum (%)#13224 AS Rel Hum (%)#13284, Rel Hum Flag#13225 AS Rel Hum Flag#13285, Precip. Amount (mm)#13226 AS Precip. Amount (mm)#13286, Precip. Amount Flag#13227 AS Precip. Amount Flag#13287, Wind Dir (10s deg)#13228 AS Wind Dir (10s deg)#13288, Wind Dir Flag#13229 AS Wind Dir Flag#13289, Wind Spd (km/h)#13230 AS Wind Spd (km/h)#13290, Wind Spd Flag#13231 AS Wind Spd Flag#13291, Visibility (km)#13232 AS Visibility (km)#13292, Visibility Flag#13233 AS Visibility Flag#13293, Stn Press (kPa)#13234 AS Stn Press (kPa)#13294, ... 6 more fields]
+- Relation[Longitude (x)#13211,Latitude (y)#13212,Station Name#13213,Climate ID#13214,Date/Time (LST)#13215,Year#13216,Month#13217,Day#13218,Time (LST)#13219,Temp (°C)#13220,Temp Flag#13221,Dew Point Temp (°C)#13222,Dew Point Temp Flag#13223,Rel Hum (%)#13224,Rel Hum Flag#13225,Precip. Amount (mm)#13226,Precip. Amount Flag#13227,Wind Dir (10s deg)#13228,Wind Dir Flag#13229,Wind Spd (km/h)#13230,Wind Spd Flag#13231,Visibility (km)#13232,Visibility Flag#13233,Stn Press (kPa)#13234,... 6 more fields] csv
Also manual select with column named "Precip. Amount (mm)" does not work. It seems that they have backticks around them internally.
I tried to rename all columns with
(g/to-df weather-data
"Longitude (x)" "Latitude (y)" "Station Name" "Climate ID" "Date/Time (LST)" "Year" "Month" "Day"
"Time (LST)" "Temp (°C)" "Temp Flag" "Dew Point Temp (°C)" "Dew Point Temp Flag" "Rel Hum (%)"
"Rel Hum Flag" "Precip. Amount (mm)" "Precip. Amount Flag" "Wind Dir (10s deg)" "Wind Dir Flag"
"Wind Spd (km/h)" "Wind Spd Flag" "Visibility (km)" "Visibility Flag" "Stn Press (kPa)" "Stn Press Flag"
"Hmdx" "Hmdx Flag" "Wind Chill" "Wind Chill Flag" "Weather"))
but the problem persists, still backticks.
Crashes: (g/select raw-weather-mar-2012 "Precip. Amount (mm)")
Works: (g/select raw-weather-mar-2012 "`Precip. Amount (mm)`")
(g/column-names (g/select raw-weather-mar-2012 "`Precip. Amount (mm)`"))
yields "Precip. Amount (mm)" without backticks.
This lead me to believe that there is some issue in geni or spark with these column names.
Info | Value |
---|---|
Operating System | Pop_Os 20.10 |
Geni Version | 0.035 |
JDK | OpenJDK Runtime Environment (build 11.0.9.1+1-Ubuntu-0ubuntu1.20.10) |
Spark Version | 3.0.1 |
Incompatible library:
Scala module 2.10.0 requires Jackson Databind version >= 2.10.0 and < 2.11.0
Steps:
lein new geni geni-cookbook
cd geni-cookbook && lein run
The log file:
{:clojure.main/message
"Syntax error (JsonMappingException) compiling at (/tmp/form-init3765768439809039245.clj:1:73).\nScala module 2.10.0 requires Jackson Databind version >= 2.10.0 and < 2.11.0\n",
:clojure.main/triage
{:clojure.error/phase :compile-syntax-check,
:clojure.error/line 1,
:clojure.error/column 73,
:clojure.error/source "form-init3765768439809039245.clj",
:clojure.error/path "/tmp/form-init3765768439809039245.clj",
:clojure.error/class
com.fasterxml.jackson.databind.JsonMappingException,
:clojure.error/cause
"Scala module 2.10.0 requires Jackson Databind version >= 2.10.0 and < 2.11.0"},
:clojure.main/trace
{:via
[{:type clojure.lang.Compiler$CompilerException,
:message
"Syntax error compiling at (/tmp/form-init3765768439809039245.clj:1:73).",
:data
{:clojure.error/phase :compile-syntax-check,
:clojure.error/line 1,
:clojure.error/column 73,
:clojure.error/source "/tmp/form-init3765768439809039245.clj"},
:at [clojure.lang.Compiler load "Compiler.java" 7648]}
{:type java.lang.ExceptionInInitializerError,
:at
[org.apache.spark.sql.execution.SparkPlan
executeQuery
"SparkPlan.scala"
210]}
{:type com.fasterxml.jackson.databind.JsonMappingException,
:message
"Scala module 2.10.0 requires Jackson Databind version >= 2.10.0 and < 2.11.0",
:at
[com.fasterxml.jackson.module.scala.JacksonModule
setupModule
"JacksonModule.scala"
61]}],
:trace
[[com.fasterxml.jackson.module.scala.JacksonModule
setupModule
"JacksonModule.scala"
61]
[com.fasterxml.jackson.module.scala.JacksonModule
setupModule$
"JacksonModule.scala"
46]
[com.fasterxml.jackson.module.scala.DefaultScalaModule
setupModule
"DefaultScalaModule.scala"
17]
[com.fasterxml.jackson.databind.ObjectMapper
registerModule
"ObjectMapper.java"
816]
[org.apache.spark.rdd.RDDOperationScope$
"RDDOperationScope.scala"
82]
[org.apache.spark.rdd.RDDOperationScope$
"RDDOperationScope.scala"
-1]
[org.apache.spark.sql.execution.SparkPlan
executeQuery
"SparkPlan.scala"
210]
[org.apache.spark.sql.execution.SparkPlan
execute
"SparkPlan.scala"
171]
[org.apache.spark.sql.execution.QueryExecution
toRdd$lzycompute
"QueryExecution.scala"
122]
[org.apache.spark.sql.execution.QueryExecution
toRdd
"QueryExecution.scala"
121]
[org.apache.spark.sql.Dataset rdd$lzycompute "Dataset.scala" 3200]
[org.apache.spark.sql.Dataset rdd "Dataset.scala" 3198]
[org.apache.spark.ml.PredictorParams
extractInstances
"Predictor.scala"
80]
[org.apache.spark.ml.PredictorParams
extractInstances$
"Predictor.scala"
70]
[org.apache.spark.ml.Predictor
extractInstances
"Predictor.scala"
114]
[org.apache.spark.ml.classification.LogisticRegression
$anonfun$train$1
"LogisticRegression.scala"
488]
[org.apache.spark.ml.util.Instrumentation$
$anonfun$instrumented$1
"Instrumentation.scala"
191]
[scala.util.Try$ apply "Try.scala" 213]
[org.apache.spark.ml.util.Instrumentation$
instrumented
"Instrumentation.scala"
191]
[org.apache.spark.ml.classification.LogisticRegression
train
"LogisticRegression.scala"
487]
[org.apache.spark.ml.classification.LogisticRegression
train
"LogisticRegression.scala"
482]
[org.apache.spark.ml.classification.LogisticRegression
train
"LogisticRegression.scala"
281]
[org.apache.spark.ml.Predictor fit "Predictor.scala" 150]
[org.apache.spark.ml.Predictor fit "Predictor.scala" 114]
[org.apache.spark.ml.Pipeline $anonfun$fit$5 "Pipeline.scala" 151]
[org.apache.spark.ml.MLEvents withFitEvent "events.scala" 132]
[org.apache.spark.ml.MLEvents withFitEvent$ "events.scala" 125]
[org.apache.spark.ml.util.Instrumentation
withFitEvent
"Instrumentation.scala"
42]
[org.apache.spark.ml.Pipeline $anonfun$fit$4 "Pipeline.scala" 151]
[scala.collection.Iterator foreach "Iterator.scala" 941]
[scala.collection.Iterator foreach$ "Iterator.scala" 941]
[scala.collection.AbstractIterator foreach "Iterator.scala" 1429]
[scala.collection.IterableViewLike$Transformed
foreach
"IterableViewLike.scala"
47]
[scala.collection.IterableViewLike$Transformed
foreach$
"IterableViewLike.scala"
47]
[scala.collection.SeqViewLike$AbstractTransformed
foreach
"SeqViewLike.scala"
40]
[org.apache.spark.ml.Pipeline $anonfun$fit$2 "Pipeline.scala" 147]
[org.apache.spark.ml.MLEvents withFitEvent "events.scala" 132]
[org.apache.spark.ml.MLEvents withFitEvent$ "events.scala" 125]
[org.apache.spark.ml.util.Instrumentation
withFitEvent
"Instrumentation.scala"
42]
[org.apache.spark.ml.Pipeline $anonfun$fit$1 "Pipeline.scala" 133]
[org.apache.spark.ml.util.Instrumentation$
$anonfun$instrumented$1
"Instrumentation.scala"
191]
[scala.util.Try$ apply "Try.scala" 213]
[org.apache.spark.ml.util.Instrumentation$
instrumented
"Instrumentation.scala"
191]
[org.apache.spark.ml.Pipeline fit "Pipeline.scala" 133]
[jdk.internal.reflect.NativeMethodAccessorImpl
invoke0
"NativeMethodAccessorImpl.java"
-2]
[jdk.internal.reflect.NativeMethodAccessorImpl
invoke
"NativeMethodAccessorImpl.java"
62]
[jdk.internal.reflect.DelegatingMethodAccessorImpl
invoke
"DelegatingMethodAccessorImpl.java"
43]
[java.lang.reflect.Method invoke "Method.java" 566]
[clojure.lang.Reflector invokeMatchingMethod "Reflector.java" 167]
[clojure.lang.Reflector invokeInstanceMethod "Reflector.java" 102]
[zero_one.geni.ml$fit invokeStatic "ml.clj" 164]
[zero_one.geni.ml$fit invoke "ml.clj" 163]
[spark_app.core$_main invokeStatic "core.clj" 46]
[spark_app.core$_main doInvoke "core.clj" 44]
[clojure.lang.RestFn invoke "RestFn.java" 397]
[clojure.lang.Var invoke "Var.java" 380]
[user$eval140 invokeStatic "form-init3765768439809039245.clj" 1]
[user$eval140 invoke "form-init3765768439809039245.clj" 1]
[clojure.lang.Compiler eval "Compiler.java" 7177]
[clojure.lang.Compiler eval "Compiler.java" 7167]
[clojure.lang.Compiler load "Compiler.java" 7636]
[clojure.lang.Compiler loadFile "Compiler.java" 7574]
[clojure.main$load_script invokeStatic "main.clj" 475]
[clojure.main$init_opt invokeStatic "main.clj" 477]
[clojure.main$init_opt invoke "main.clj" 477]
[clojure.main$initialize invokeStatic "main.clj" 508]
[clojure.main$null_opt invokeStatic "main.clj" 542]
[clojure.main$null_opt invoke "main.clj" 539]
[clojure.main$main invokeStatic "main.clj" 664]
[clojure.main$main doInvoke "main.clj" 616]
[clojure.lang.RestFn applyTo "RestFn.java" 137]
[clojure.lang.Var applyTo "Var.java" 705]
[clojure.main main "main.java" 40]],
:cause
"Scala module 2.10.0 requires Jackson Databind version >= 2.10.0 and < 2.11.0",
:phase :compile-syntax-check}}
Info | Value |
---|---|
Operating System | macOS Catalina |
Geni Version | 0.0.27 |
JDK | 1.8 |
Spark Version | 3.0.1 |
Today I had some time to sit down and get my fork setup for local development. It's been a while since I used lein and I am new to midje, so I apologize in advance if I ask stupid questions! While following CONTRIBUTING.md, I was able to run make ci
with no problems after I allowed the MacOS docker client to see the default directory of temporary files. In case you want to add a note about in the documentation, I have pasted the initial error message below:
Successfully tagged zeroonetechnology/geni:latest
cp -r . /var/folders/zl/hlg5bvnj1y9753gvjsf0_86r0000gn/T/tmp.nV2X5lAq
docker run --rm -v /var/folders/zl/hlg5bvnj1y9753gvjsf0_86r0000gn/T/tmp.nV2X5lAq:/root/geni -w /root/geni -t zeroonetechnology/geni \
scripts/coverage
docker: Error response from daemon: Mounts denied:
The path /var/folders/zl/hlg5bvnj1y9753gvjsf0_86r0000gn/T/tmp.nV2X5lAq
is not shared from OS X and is not known to Docker.
You can configure shared paths from Docker -> Preferences... -> File Sharing.
See https://docs.docker.com/docker-for-mac/osxfs/#namespaces for more info.
The dockerized CI is great and easy to use, but I also attempted to use lein midje
directly so that I could filter to facts and have a faster workflow. Is this recommended? I am getting 6 failed tests when running lein midje
, but I know these failures are configuration differences on my machine, because the dockerized CI passes. If using midje directly is intended to be supported, then a few notes in the documentation should be added (or the tests tweaked).
Attached to this issue is the saved stdout and stderr produce midje on the develop
branch. I will summarize the issues I encountered below.
(fact "On XGB native" :slow ...)
Raises and exception XGBoostError: XGBoostModel training failed
and also bizarely causes a Clojure compilation error. The location of the compilation error is not the "On XBG native" fact, and yet when I comment out the fact the compilation error goes away. See the full stack trace here.
On JavaSparkContext methods
checks (rdd/spark-home) => nil?
. My development machine has a spark home set via environment, and thus (rdd/spark-home)
is not nil. Why spark home be nil? Could this test be equally effective as (rdd/spark-home) => any?
.
The RDD saving and loading tests fail due to a file-not-found. I suspect the root cause of this issue is MacOS's way of handling temporary files. These errors can be resolved by using the java.nio.file
temporary file utilities that should be useable cross platform. For example, these tests passed once I modified create-temp-file!
to the following.
(def -tmp-dir-attr
(into-array FileAttribute '()))
(defn create-temp-file! [extension]
(let [temp-dir (.toFile (Files/createTempDirectory "tmp-dir" -tmp-dir-attr))]
(File/createTempFile "temporary" extension temp-dir)))
sql_functions_test.clj
which are simply due to timezones. I was able to fix them by adding :spark.sql.session.timeZone "UTC"
to the spark session configuration in defaults.clj
, although I don't think that is a good permanent fix.I would recommend creating both the "actual" and "expected" dates from the same unix timestamp, but use Spark for "actual" and use java.time
directly for the "expected". This way, both Spark and Java will be using the default timezone of the machine. For example:
(-> (Instant/ofEpochMilli 1)
(.atZone (ZoneId/systemDefault))
(.toLocalDate))
Stdout and Stderr output for 2 - 4 can be found here:
geni-test-results.txt
I would love for the project to have common style guide so that everyone are on the same page.
e.g. We could be using different IDE to edit the code, but then in the end we can all agree on one common style to avoid stepping on each other.
The tools like cljfmt make this very simple.
It works very well with Leiningen project via the following plugin
:plugins [[lein-cljfmt "0.7.0"]]
Can can be run very simple via the two command:
lein cljfmt check
It can fix using something like
lein cljfmt fix
This can be run as part of the build/cicd check.
If you are ok with it then I can create a PR and you can take a look.
For now I don't want to do this unless you are ok with it.
Thanks
One of the use case I have in mind for "geni" and why I developed as well #284 , was to use geni/spark as a first step to transform "arbitrary data" into arrow files (for using them in TMD mainly)
Ideally I would have a cli tool for this, which does the following operation:
(->
(g/read-xxx! ; xxx-> "parquet" or "csv" or ....
(g/repartition n)
(g/collect-as-arrow m dir)
Maybe "geni" cli could become this tool.
So it gets run as
"geni repl" -> as now
or alternatively like this:
"geni to-arrow xxxx.csv 10 50000 /tmp
I would hope that this "simple" case is enough for most cases. Eventually the "transform" need to be extended to allow 2 more things:
The first would require to extend #284 to allow to write several arrow files which are partitioned by the groups.
I am not sure, if this is even possible to do, while assuming big data and therefore "limited heap space".
And to have it very useful, TDM need to have "multi-file dataset support" for arrow files in some form:
techascent/tech.ml.dataset#145
20/10/05 17:44:56 WARN HttpChannel: /static/spark-logo-77x50px-hd.png
java.lang.NoSuchMethodError: 'boolean javax.servlet.http.HttpServletRequest.isAsyncSupported()'
at org.sparkproject.jetty.server.ResourceService.sendData(ResourceService.java:697)
at org.sparkproject.jetty.server.ResourceService.doGet(ResourceService.java:294)
at org.sparkproject.jetty.servlet.DefaultServlet.doGet(DefaultServlet.java:457)
I saw some reports on this somewhere else, and it is about conflicting servlet implementations.
I got rid of it remove "reply", for example.
@behrica, that sounds awesome! Would you mind sharing what config options to pass, and we can add it to the docs or the README? 😄
Originally posted by @anthony-khong in #228 (comment)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.