zero-one-group / geni Goto Github PK

A Clojure dataframe library that runs on Spark

License: Apache License 2.0

Clojure 95.57% Makefile 0.47% Dockerfile 0.26% Shell 0.94% Java 2.76%

big-data clojure clojure-library clojure-repl data-engineering data-science dataframe distributed-computing high-performance-computing machine-learning parallel-computing spark

geni's People

Contributors

Stargazers

Watchers

geni's Issues

GraphFrames package with Geni?

Is it feasible to use a sparkml java package like https://graphframes.github.io/graphframes/docs/_site/index.html with Geni? Or is this outside of the scope?

`->schema` and `create-dataframe` should support fields of struct array

I have read through the quick start and installation sections of the README.

Info

Geni Version: 0.0.38

Problem / Steps to reproduce

user=> (require '[zero-one.geni.core.dataset-creation :as g] :reload)
nil
user=> (g/->schema {:coords [{:x :int :y :int}]})
Execution error (IllegalArgumentException) at org.apache.spark.sql.types.DataTypes/createArrayType (DataTypes.java:114).
elementType should not be null.

Expected results

user=> (g/->schema {:coords [{:x :int :y :int}]})
#object[org.apache.spark.sql.types.StructType 0x5cb6297e "StructType(StructField(coords,ArrayType(StructType(StructField(x,IntegerType,true), StructField(y,IntegerType,true)),true),true))"]

Proposed solution

At the moment, array-type supports only simple val-type listed in data-type->spark-type. E.g. :bool, :string.

We can extend array-type to support any Spark SQL DataType, in the same fashion we are already doing in struct-field.

Conversion of Spark SparseVector to Clojure looses important information

a SparseVector has three pieces of information:

indices
values
size

Current code here:

geni/src/clojure/zero_one/geni/interop.clj

Line 142 in 5b570b5

(sparse-vector? value) (vector->seq value)

only collects values, but we need all three:

(seq (.indices sparse-vector))
(seq (.values sparse-vector))
(.size sparse-vector)

Maybe a map with all three should be returned instead

Possible issue with example 2.5 from cookbook

I have read through the quick start and installation sections of the README.

Info

Info	Value
Operating System	Nixos 21.09
Geni Version	0.0.38
JDK	openjdk version 11.0.9 2020-10-20
Spark Version	3.1.0

Problem / Steps to reproduce

I am working through the cookbook and encountered an error on example 2.5
https://github.com/zero-one-group/geni/blob/develop/docs/cookbook/part_02_selecting_rows_and_columns.md#25-selecting-only-noise-complaints

I entered in the following:

(-> complaints
    (g/filter (g/= :complaint-type (g/lit "Noise - Street/Sidewalk")))
    (g/select :complaint-type :borough :created-date :descriptor)
    (g/limit 3)
    g/show)

I receive the following error:

Exectuion error (NoClassDefFoundError) at org.apache.spark.sql.catalyst.parser.AbstractSQLParser/parse (ParseDriver.scala:90).
Could not initialize class org.apache.spark.sql.catalyst.parser.SqlBaseLexer

Allow structs in g/->dataset

I have read through the quick start and installation sections of the README.

Info

Info	Value
Geni Version	0.0.38

Problem / Steps to reproduce

Cannot use structs or list of structs in g/->dataset, example:

(-> (g/->dataset [{:id "bob"   :skills ["javascript" "clojure"] :preference {:remote true}  :experience [{:name "CompanyA" :years 2}
                                                                                                           {:name "CompanyB" :years 1}]}
                    {:id "alice" :skills ["python" "c++"]         :experience []}])
      g/show)

leads to

; Execution error (ClassCastException) at org.apache.spark.sql.catalyst.expressions.CastBase/buildCast (Cast.scala:295).
; class clojure.lang.Keyword cannot be cast to class [B (clojure.lang.Keyword is in unnamed module of loader 'app'; [B is in module java.base of loader 'bootstrap')

Running g/print-schema right before g/show above shows incorrect schema:

root
 |-- id: string (nullable = true)
 |-- skills: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- preference: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: binary (containsNull = true)
 |-- experience: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: array (containsNull = true)
 |    |    |    |-- element: binary (containsNull = true)

Expected Result

g/show should output:

+-----+---------------------+----------+------------------------------+
|id   |skills               |preference|experience                    |
+-----+---------------------+----------+------------------------------+
|bob  |[javascript, clojure]|{true}    |[{CompanyA, 2}, {CompanyB, 1}]|
|alice|[python, c++]        |null      |[]                            |
+-----+---------------------+----------+------------------------------+

g/print-schema should output:

root
 |-- id: string (nullable = true)
 |-- skills: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- preference: struct (nullable = true)
 |    |-- remote: boolean (nullable = true)
 |-- experience: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- years: long (nullable = true)

Proposed Solution

Check for map? in infer-spark-type at

geni/src/clojure/zero_one/geni/core/dataset_creation.clj

Line 130 in 123aec5

(coll? value) (ArrayType. (infer-spark-type (first value)) true)
If any maps exist in table, transform to array of values

geni/src/clojure/zero_one/geni/core/dataset_creation.clj

Line 165 in 123aec5

rows (interop/->java-list (map interop/->spark-row table))

Can't create boolean columns of all `false`

I have read through the quick start and installation sections of the README.

Info

Info	Value
Operating System	MacOS
Geni Version	0.3.8
JDK	1.8
Spark Version	3.0.2

Problem / Steps to reproduce

It seems like it is impossible to create a boolean column from all false values using records->dataset because they get recognized as null columns. Here is a failing tests.

(fact "should work for bool columns"
    (let [dataset (g/records->dataset
                    @tr/spark
                    [{:i 0 :s "A" :b false}
                     {:i 1 :s "B" :b false}
                     {:i 2 :s "C" :b false}])]
      (instance? Dataset dataset) => true
      (g/schema dataset) => (g/->schema {:i :long
                                         :s :string
                                         :b :bool})
      (g/collect-vals dataset) => [[0 "A" false]
                                   [1 "B" false]
                                   [2 "C" false]]))

and here is the output.

FAIL On records->dataset - should work for bool columns at (dataset_creation_test.clj:143)
Expected:
#<org.apache.spark.sql.types.StructType@2e83f3f5 StructType(StructField(i,LongType,true), StructField(s,StringType,true), StructField(b,BooleanType,true))>
Actual:
#<org.apache.spark.sql.types.StructType@67b8b180 StructType(StructField(i,LongType,true), StructField(s,StringType,true), StructField(b,NullType,true))>

FAIL On records->dataset - should work for bool columns at (dataset_creation_test.clj:146)
Expected:
[[0 "A" false] [1 "B" false] [2 "C" false]]
Actual:
([0 "A" nil] [1 "B" nil] [2 "C" nil])
Diffs: in [0 2] expected false, was nil
              in [1 2] expected false, was nil
              in [2 2] expected false, was nil

The same behavior applies to map->dataset and table->dataset. If any of the booleans are true, then the schema is understood correctly.

support for writing arrow files

This would be super usefull, I believe.
Tech.ml.dataset has very good support for working with arrow files.

(memory mapped and larger then heap)

This would nicely bring Geni and tech.ml.dataset.

I currently can convert with them, using R...

Support Spark UDF

Geni users would benefit from support for Spark User Defined Functions on dataframes as documented here.

UDFs are very useful for data analysis from the simple classification of continuous values to implementing models that operate on rows of values (e.g., modelling the impact on sales as a function of own and competitor price changes) to cleansing data using the values of multiple columns.

ltrim/rtrim 2-arity functions are missing

I have read through the quick start and installation sections of the README.

Info

Info	Value
Operating System	Linux 5.19.0-38-generic #39~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Mar 17 21:16:15 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Geni Version	[zero.one/geni "0.0.40"]
JDK	openjdk version "1.8.0_362"
	OpenJDK Runtime Environment (build 1.8.0_362-8u362-ga-0ubuntu1~22.04-b09)
	OpenJDK 64-Bit Server VM (build 25.362-b09, mixed mode)
Spark Version	[org.apache.spark/spark-core_2.12 "3.3.2"]

Problem / Steps to reproduce

org.apache.spark.sql.functions/ltrim supports both 1- and 2-arity calls, but zero-one.geni.core.functions/ltrim does only 1-arity. The same is true for rtrim.

Let me know if you fancy a PR.

Upgrade reply to 0.5.1

I have read through the quick start and installation sections of the README.

Info

Info	Value
Geni Version	v0.0.38

Problem

Package [reply "0.4.4"] have dependencies on [commons-fileupload/commons-fileupload "1.3.3"] which contain critical vulnerability CVE-2016-1000031

Fix

Upgrade reply to 5.0.1

Fix `already refers` warnings in Clojure 1.11

WARNING: abs already refers to: #'clojure.core/abs in namespace: taoensso.encore, being replaced by: #'taoensso.encore/abs
WARNING: abs already refers to: #'clojure.core/abs in namespace: zero-one.geni.core.functions, being replaced by: #'zero-one.geni.core.functions/abs
WARNING: cat already refers to: #'clojure.core/cat in namespace: net.cgrand.parsley.fold, being replaced by: #'net.cgrand.parsley.fold/cat
WARNING: abs already refers to: #'clojure.core/abs in namespace: tech.v3.datatype.functional, being replaced by: #'tech.v3.datatype.functional/abs
WARNING: infinite? already refers to: #'clojure.core/infinite? in namespace: tech.v3.datatype.functional, being replaced by: #'tech.v3.datatype.functional/infinite?
WARNING: random-uuid already refers to: #'clojure.core/random-uuid in namespace: tech.v3.io.uuid, being replaced by: #'tech.v3.io.uuid/random-uuid

Updating dependencies can resolve some of these warnings, but updating them all may introduce new conflicts.

The second warning can be solved in Geni's code.

What we need to do:

Upgrade dependencies to versions where these warnings are solved

com.taoensso/nippy from 3.1.1 to 3.3.0

techascent/tech.ml.dataset from 5.21 to 6.101

Newer versions of TMD will introduce some new warnings.

Reflection warning, ham_fisted/api.clj:1144:12 - call to method expireAfterAccess on com.google.common.cache.CacheBuilder can't be resolved (no such method).
Reflection warning, ham_fisted/api.clj:1146:12 - call to method expireAfterWrite can't be resolved (target class is unknown).
Reflection warning, ham_fisted/api.clj:1148:12 - reference to field softValues can't be resolved.
Reflection warning, ham_fisted/api.clj:1150:12 - reference to field weakValues can't be resolved.
Reflection warning, ham_fisted/api.clj:1152:12 - call to method maximumSize can't be resolved (target class is unknown).
Reflection warning, ham_fisted/api.clj:1154:12 - reference to field recordStats can't be resolved.

midje from 1.10.3 to 1.10.9
Fixed in #352

Exclude abs in src/clojure/zero_one/geni/core.clj
- Fixed in #342
Exclude abs in src/clojure/zero_one/geni/core/functions.clj
- Fixed in #349
~~Wait for some dependencies to solve their own issues~~
- cgrand/sjacket#28
- Expedient: Specify parsley version as 0.9.3
  - Done in #352

spark-session cannot be changed (any more ?)

Following the minikube guide:
https://github.com/zero-one-group/geni/blame/develop/docs/kubernetes_basic.md

the verification of line 118 fails.

It seems that I cannot change the spark-session, by calling
g/create-spark-session

I am pretty sure, that it worked at one moment.

Does geni support reading directly from an HDFS path?

Is there something akin to the following?

(def df (read-from-hdfs "/some/path/on/hdfs/to/a/subdir/"))

... where /some/path/on/hdfs/to/a/subdir/ is a path on hdfs that contains many files?

thanks in advance.

spark connect and databricks connect support?

Will geni support Spark 3.5's remote connections to clusters via spark connect and databricks connect ?

Add support for the creation of histograms

Creating histograms is a very common activity. Geni offers cut which supports the creation of histograms as a function of bins, an array of values, but the user has to compute these bins manually.

Geni provides qcut to help users determine how wide each bin should be.

It would be helpful to provide support for a function, (g/histogram :column {:n-bins :bins-vector}) that would either compute the bins automatically if provided with an :n-bins parameter, or compute the histogram on the basis of the supplied :bins-vector.

Using the form with just the :n-bins argument is very useful for data analysis and review, while being able to provide a :bins-vector addresses the use case where histogram use is informed by business domain needs (e.g., bin populations into age brackets that align with survey methodology).

Support Spark NLP ?

There is a pre-release for Spark NLP on Spark 3.0.0

This might it possible to use it with Geni ?

https://github.com/JohnSnowLabs/spark-nlp/releases/tag/3.0.0-rc8

Can't specify schema when reading/writing dataframes

I have read through the quick start and installation sections of the README.

First, I have to thank you for creating this library! I have been using Geni for about a week now and it is great to have a Clojure interface into Spark dataframes.

It seems like the dataframe reading/writing API doesn't support specifying a schema. Looking at the library source code, it seems like most map entries fall back on the reader's/writer's .option(key, val) method. This works in any case where the option's value is a String, Boolean, Long, or Double because Spark has overloaded methods for this.

In the case of schemas, we must specify a StructType (or schema string) and thus Spark requires us to call the .schema method instead of .option. It would be great if geni could treat the :schema key of the reader/writer options as a special case and pass the value to .schema instead of .option.

See the code snippet below as an example.

Info

Info	Value
Operating System	MacOS Catalina
Geni Version	0.0.25
JDK	1.8
Spark Version	3.0

Problem / Steps to reproduce

(require '[zero-one.geni.core :as g])

(g/read-csv! "test/resources/data/tiny_table.csv"
             {:header    true
              :delimiter ","
              :schema    (g/struct-type
                           (g/struct-field :i :long true)
                           (g/struct-field :s :string true))})
;; Execution error (IllegalArgumentException) at zero-one.geni.core.data-sources/configure-reader-or-writer$fn (data_sources.clj:19).
;; No matching method option found taking 2 args for class org.apache.spark.sql.DataFrameReader

Loading geni.core creates a default spark session

I have read through the quick start and installation sections of the README.

Info

Info	Value
Operating System	rhel7 and windows
Geni Version	0.0.38
JDK	11.0.10
Spark Version	3.1.2

Problem / Steps to reproduce

If zero-one.geni.core has been required, it creates a default spark session which impacts the behavior of calling g/create-spark-session

Specifically, geni.core loads geni.spark-context which loads geni.defaults, which creates a spark session in an atom which should probably be a delay

(def s (g/create-spark-session {:app-name “foo”}))
(g/spark-conf s)
; => {…:spark.app.name Geni app…}
; which is the wrong name

if requiring zero-one.geni.spark directly instead (as g), the spark session is correctly configured.

The incorrect behaviour takes effect if core is required at any point before creating the session, so it is a bit problematic. As above, maybe replacing the default with a delay will be sufficient to avoid this.

Thanks for your work on this library!

Should Geni have support for Delta Lake or should that be a separate library?

Delta Lake brings a lot of crucial features into the Spark ecosystem. Some of the highlights include:

ACID transactions.
Time travel between data versions.
Safe in-place updates, inserts, and upserts.
Storage optimization.

In some toy projects, I have begun using Delta with Geni using Scala interop. I haven't worked out all the issues yet, but I think I am getting close to something that could be useful to others.

It would be great to get people's thoughts on the best way to manage Delta support. My specific question is: Would it be a good idea to add Delta Lake as an optional dep (like we do with xgboost) and implement the Delta Lake API directly in Geni?

If yes, I can move my code into a PR. If no, I can create a geni compatible Clojure lib specifically for Delta lake.

Some things to consider:

Delta is technically a separate project from Spark. It has separate versioning and their APIs don't move in sync. Adding Delta to Geni would be another dependency to manage.
Geni is already using with-dynamic-import when reaching into optional dependencies, but that is usually for 1 or 2 functions. Adding Delta would create a set of functions to mirror the public API of an entire Scala library. It could be a little awkward to with-dymanic-import so much stuff. My editor has a hard time handling it, but maybe that is just me.
The primary way to use Delta is via a new "delta" format for DataFrame readers/writers. Geni's provides a nice data-driven way to create readers and writers. If Delta support was provided by a separate library, it wouldn't be possible to create delta reader/writers using the Geni API and the Delta lib wouldn't be able to benefit from Geni's utilities for configuring readers/writers.

Personally, I think option 1 is best (integrate into Geni) but I don't want to create more work for other people if I am one of the only people who would benefit. :)

Error while running the example

Hi Team,

Thanks for this wonderful project - I am really excited to see the integration with Spark 3!

I was trying out the latest 0.0.19 and ran into an import error while trying out the example

Clojure 1.10.1
(ns spark.geni.core
  (:require
    [clojure.string]
    [zero-one.geni.core :as g]
    [zero-one.geni.ml :as ml]))


Syntax error (ClassNotFoundException) compiling at (zero_one\geni\interop.clj:1:1).
org.apache.spark.ml.linalg.DenseVector

Could you please help me out a bit here?

Documentation for usage with Cloud platforms

I would like to use geni against a Spark cluster, running in Azure Kubernetes.
It seems to be an amazing tool.

Good new is, that it did work, but it was very tricky to get started on this.

I believe, it is mainly a question of missing documentation.

My main question, which I was struggling for a while, was to:

Understand firstly, if Geni should support it at all. I assumed it, but as the documentation only showcases local use, I was not sure about it.
When and where to call my own g/create-spark-session. This was very tricky to discover, as

its gets done "magically" somewhere lazily
is using clojure delay which is then somehow a global variable
the session cannot be changed practically inside a running repl, ones it was set up
The precise setup for this differs very likely between using the Geni CLI tool and working in a repl
The documentation mentions it somehow, but not very clear: https://github.com/zero-one-group/geni/blob/develop/docs/spark_session.md . Specially it does not say, where and when this need to be called and how it relates to the global

geni/src/clojure/zero_one/geni/main.clj

Line 13 in 671510e

(def spark

which is a "future" of a "delay" ....

I would propose to add one example documentation of using geny with a remote/ Cloud spark instance,
or maybe even show the full setup using a local minikube kubernetes cluster.
At least with minikube it is possible to create a "step-by-step" recipe, which can be copy-pasted into a shell.

(I know, Spark on kubernets is a very tricky issue by itself.)

Error when building uberjar with geni dependencies

Info

Info	Value
Operating System	MacOS Catalina
Geni Version	0.0.33
JDK	openjdk8, openjdk11
Spark Version	3.0.1

Problem / Steps to reproduce

When I try to build an uberjar which depends on geni I get an error

Execution error (NoSuchMethodError) at io.netty.channel.SingleThreadEventLoop/<init> (SingleThreadEventLoop.java:65).
io.netty.util.concurrent.SingleThreadEventExecutor.<init>(Lio/netty/util/concurrent/EventExecutorGroup;Ljava/util/concurrent/Executor;ZLjava/util/Queue;Lio/netty/util/concurrent/RejectedExecutionHandler;)V

The repo https://github.com/klausharbo/geni-uberjar-example reproduces the issue on my machine when I do lein uberjar

CI badge is out of sync with actual status

We should use

https://github.com/zero-one-group/geni/actions/workflows/continuous-integration.yml/badge.svg?branch=develop

instead of

https://github.com/zero-one-group/geni/workflows/Continuous%20Integration/badge.svg?branch=develop

ArityException with cookbook sample 4

I have read through the quick start and installation sections of the README.

Info

Info	Value
Operating System	macOS Big Sur 11.2.1
Geni Version	0.0.38
Spark Java lib Version	org.apache.spark/spark-XXX_2.12 "3.1.0"
JDK	openjdk version "1.8.0_282"
Spark Version	3.0.2

Problem / Steps to reproduce

Standard lein new geni …, bitnami/spark 3.0.2 docker, then used code from geni cookbook chapter 4.

The following code from cookbook example 4 fails with ArityException:

(def null-counts
    (-> raw-weather-mar-2012
        (g/agg (->> (g/column-names raw-weather-mar-2012)
                    (map #(vector % (g/null-count %)))
                    (into {})))
        (g/first)))

Exception is

Execution error (AnalysisException) at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt/failAnalysis (package.scala:42).
cannot resolve '`Precip. Amount (mm)`' given input columns: [Climate ID, Date/Time (LST), Day, Dew Point Temp (°C), Dew Point Temp Flag, Hmdx, Hmdx Flag, Latitude (y), Longitude (x), Month, Precip. Amount (mm), Precip. Amount Flag, Rel Hum (%), Rel Hum Flag, Station Name, Stn Press (kPa), Stn Press Flag, Temp (°C), Temp Flag, Time (LST), Visibility (km), Visibility Flag, Weather, Wind Chill, Wind Chill Flag, Wind Dir (10s deg), Wind Dir Flag, Wind Spd (km/h), Wind Spd Flag, Year];
'Aggregate [sum(cast(cast(isnull(Latitude (y)#13272) as int) as bigint)) AS Latitude (y)#16363L, sum(cast(cast(isnull(Stn Press Flag#13295) as int) as bigint)) AS Stn Press Flag#16364L, sum(cast(cast(isnull(Temp (°C)#13280) as int) as bigint)) AS Temp (°C)#16365L, sum(cast(cast(isnull(Wind Spd (km/h)#13290) as int) as bigint)) AS Wind Spd (km/h)#16366L, sum(cast(cast(isnull(Rel Hum Flag#13285) as int) as bigint)) AS Rel Hum Flag#16367L, sum(cast(cast(isnull(Date/Time (LST)#13275) as int) as bigint)) AS Date/Time (LST)#16368L, sum(cast(cast(isnull(Visibility Flag#13293) as int) as bigint)) AS Visibility Flag#16369L, sum(cast(cast(isnull(Visibility (km)#13292) as int) as bigint)) AS Visibility (km)#16370L, sum(cast(isnull('Precip. Amount (mm)) as int)) AS Precip. Amount (mm)#16371, sum(cast(cast(isnull(Dew Point Temp Flag#13283) as int) as bigint)) AS Dew Point Temp Flag#16372L, sum(cast(isnull('Precip. Amount Flag) as int)) AS Precip. Amount Flag#16373, sum(cast(cast(isnull(Station Name#13273) as int) as bigint)) AS Station Name#16374L, sum(cast(cast(isnull(Wind Chill Flag#13299) as int) as bigint)) AS Wind Chill Flag#16375L, sum(cast(cast(isnull(Longitude (x)#13271) as int) as bigint)) AS Longitude (x)#16376L, sum(cast(cast(isnull(Time (LST)#13279) as int) as bigint)) AS Time (LST)#16377L, sum(cast(cast(isnull(Dew Point Temp (°C)#13282) as int) as bigint)) AS Dew Point Temp (°C)#16378L, sum(cast(cast(isnull(Rel Hum (%)#13284) as int) as bigint)) AS Rel Hum (%)#16379L, sum(cast(cast(isnull(Wind Dir Flag#13289) as int) as bigint)) AS Wind Dir Flag#16380L, sum(cast(cast(isnull(Climate ID#13274) as int) as bigint)) AS Climate ID#16381L, sum(cast(cast(isnull(Wind Dir (10s deg)#13288) as int) as bigint)) AS Wind Dir (10s deg)#16382L, sum(cast(cast(isnull(Stn Press (kPa)#13294) as int) as bigint)) AS Stn Press (kPa)#16383L, sum(cast(cast(isnull(Year#13276) as int) as bigint)) AS Year#16384L, sum(cast(cast(isnull(Temp Flag#13281) as int) as bigint)) AS Temp Flag#16385L, sum(cast(cast(isnull(Hmdx#13296) as int) as bigint)) AS Hmdx#16386L, ... 6 more fields]
+- Project [Longitude (x)#13211 AS Longitude (x)#13271, Latitude (y)#13212 AS Latitude (y)#13272, Station Name#13213 AS Station Name#13273, Climate ID#13214 AS Climate ID#13274, Date/Time (LST)#13215 AS Date/Time (LST)#13275, Year#13216 AS Year#13276, Month#13217 AS Month#13277, Day#13218 AS Day#13278, Time (LST)#13219 AS Time (LST)#13279, Temp (°C)#13220 AS Temp (°C)#13280, Temp Flag#13221 AS Temp Flag#13281, Dew Point Temp (°C)#13222 AS Dew Point Temp (°C)#13282, Dew Point Temp Flag#13223 AS Dew Point Temp Flag#13283, Rel Hum (%)#13224 AS Rel Hum (%)#13284, Rel Hum Flag#13225 AS Rel Hum Flag#13285, Precip. Amount (mm)#13226 AS Precip. Amount (mm)#13286, Precip. Amount Flag#13227 AS Precip. Amount Flag#13287, Wind Dir (10s deg)#13228 AS Wind Dir (10s deg)#13288, Wind Dir Flag#13229 AS Wind Dir Flag#13289, Wind Spd (km/h)#13230 AS Wind Spd (km/h)#13290, Wind Spd Flag#13231 AS Wind Spd Flag#13291, Visibility (km)#13232 AS Visibility (km)#13292, Visibility Flag#13233 AS Visibility Flag#13293, Stn Press (kPa)#13234 AS Stn Press (kPa)#13294, ... 6 more fields]
   +- Relation[Longitude (x)#13211,Latitude (y)#13212,Station Name#13213,Climate ID#13214,Date/Time (LST)#13215,Year#13216,Month#13217,Day#13218,Time (LST)#13219,Temp (°C)#13220,Temp Flag#13221,Dew Point Temp (°C)#13222,Dew Point Temp Flag#13223,Rel Hum (%)#13224,Rel Hum Flag#13225,Precip. Amount (mm)#13226,Precip. Amount Flag#13227,Wind Dir (10s deg)#13228,Wind Dir Flag#13229,Wind Spd (km/h)#13230,Wind Spd Flag#13231,Visibility (km)#13232,Visibility Flag#13233,Stn Press (kPa)#13234,... 6 more fields] csv

Also manual select with column named "Precip. Amount (mm)" does not work. It seems that they have backticks around them internally.

I tried to rename all columns with

(g/to-df weather-data
               "Longitude (x)" "Latitude (y)" "Station Name" "Climate ID" "Date/Time (LST)" "Year" "Month" "Day"
               "Time (LST)" "Temp (°C)" "Temp Flag" "Dew Point Temp (°C)" "Dew Point Temp Flag" "Rel Hum (%)"
               "Rel Hum Flag" "Precip. Amount (mm)" "Precip. Amount Flag" "Wind Dir (10s deg)" "Wind Dir Flag"
               "Wind Spd (km/h)" "Wind Spd Flag" "Visibility (km)" "Visibility Flag" "Stn Press (kPa)" "Stn Press Flag"
               "Hmdx" "Hmdx Flag" "Wind Chill" "Wind Chill Flag" "Weather"))

but the problem persists, still backticks.
Crashes: (g/select raw-weather-mar-2012 "Precip. Amount (mm)")
Works: (g/select raw-weather-mar-2012 "`Precip. Amount (mm)`")

(g/column-names (g/select raw-weather-mar-2012 "`Precip. Amount (mm)`"))
yields "Precip. Amount (mm)" without backticks.

This lead me to believe that there is some issue in geni or spark with these column names.

Can't run example

[ x] I have read through the quick start and installation sections of the README.

Info

Info	Value
Operating System	Pop_Os 20.10
Geni Version	0.035
JDK	OpenJDK Runtime Environment (build 11.0.9.1+1-Ubuntu-0ubuntu1.20.10)
Spark Version	3.0.1

Problem / Steps to reproduce

Incompatible library:

Scala module 2.10.0 requires Jackson Databind version >= 2.10.0 and < 2.11.0

Steps:

lein new geni geni-cookbook
cd geni-cookbook && lein run

The log file:

{:clojure.main/message
"Syntax error (JsonMappingException) compiling at (/tmp/form-init3765768439809039245.clj:1:73).\nScala module 2.10.0 requires Jackson Databind version >= 2.10.0 and < 2.11.0\n",
:clojure.main/triage
{:clojure.error/phase :compile-syntax-check,
:clojure.error/line 1,
:clojure.error/column 73,
:clojure.error/source "form-init3765768439809039245.clj",
:clojure.error/path "/tmp/form-init3765768439809039245.clj",
:clojure.error/class
com.fasterxml.jackson.databind.JsonMappingException,
:clojure.error/cause
"Scala module 2.10.0 requires Jackson Databind version >= 2.10.0 and < 2.11.0"},
:clojure.main/trace
{:via
[{:type clojure.lang.Compiler$CompilerException,
:message
"Syntax error compiling at (/tmp/form-init3765768439809039245.clj:1:73).",
:data
{:clojure.error/phase :compile-syntax-check,
:clojure.error/line 1,
:clojure.error/column 73,
:clojure.error/source "/tmp/form-init3765768439809039245.clj"},
:at [clojure.lang.Compiler load "Compiler.java" 7648]}
{:type java.lang.ExceptionInInitializerError,
:at
[org.apache.spark.sql.execution.SparkPlan
executeQuery
"SparkPlan.scala"
210]}
{:type com.fasterxml.jackson.databind.JsonMappingException,
:message
"Scala module 2.10.0 requires Jackson Databind version >= 2.10.0 and < 2.11.0",
:at
[com.fasterxml.jackson.module.scala.JacksonModule
setupModule
"JacksonModule.scala"
61]}],
:trace
[[com.fasterxml.jackson.module.scala.JacksonModule
setupModule
"JacksonModule.scala"
61]
[com.fasterxml.jackson.module.scala.JacksonModule
setupModule$
"JacksonModule.scala"
46]
[com.fasterxml.jackson.module.scala.DefaultScalaModule
setupModule
"DefaultScalaModule.scala"
17]
[com.fasterxml.jackson.databind.ObjectMapper
registerModule
"ObjectMapper.java"
816]
[org.apache.spark.rdd.RDDOperationScope$

"RDDOperationScope.scala"
82]
[org.apache.spark.rdd.RDDOperationScope$

"RDDOperationScope.scala"
-1]
[org.apache.spark.sql.execution.SparkPlan
executeQuery
"SparkPlan.scala"
210]
[org.apache.spark.sql.execution.SparkPlan
execute
"SparkPlan.scala"
171]
[org.apache.spark.sql.execution.QueryExecution
toRdd$lzycompute
"QueryExecution.scala"
122]
[org.apache.spark.sql.execution.QueryExecution
toRdd
"QueryExecution.scala"
121]
[org.apache.spark.sql.Dataset rdd$lzycompute "Dataset.scala" 3200]
[org.apache.spark.sql.Dataset rdd "Dataset.scala" 3198]
[org.apache.spark.ml.PredictorParams
extractInstances
"Predictor.scala"
80]
[org.apache.spark.ml.PredictorParams
extractInstances$
"Predictor.scala"
70]
[org.apache.spark.ml.Predictor
extractInstances
"Predictor.scala"
114]
[org.apache.spark.ml.classification.LogisticRegression
$anonfun$train$1
"LogisticRegression.scala"
488]
[org.apache.spark.ml.util.Instrumentation$
$anonfun$instrumented$1
"Instrumentation.scala"
191]
[scala.util.Try$ apply "Try.scala" 213]
[org.apache.spark.ml.util.Instrumentation$
instrumented
"Instrumentation.scala"
191]
[org.apache.spark.ml.classification.LogisticRegression
train
"LogisticRegression.scala"
487]
[org.apache.spark.ml.classification.LogisticRegression
train
"LogisticRegression.scala"
482]
[org.apache.spark.ml.classification.LogisticRegression
train
"LogisticRegression.scala"
281]
[org.apache.spark.ml.Predictor fit "Predictor.scala" 150]
[org.apache.spark.ml.Predictor fit "Predictor.scala" 114]
[org.apache.spark.ml.Pipeline $anonfun$fit$5 "Pipeline.scala" 151]
[org.apache.spark.ml.MLEvents withFitEvent "events.scala" 132]
[org.apache.spark.ml.MLEvents withFitEvent$ "events.scala" 125]
[org.apache.spark.ml.util.Instrumentation
withFitEvent
"Instrumentation.scala"
42]
[org.apache.spark.ml.Pipeline $anonfun$fit$4 "Pipeline.scala" 151]
[scala.collection.Iterator foreach "Iterator.scala" 941]
[scala.collection.Iterator foreach$ "Iterator.scala" 941]
[scala.collection.AbstractIterator foreach "Iterator.scala" 1429]
[scala.collection.IterableViewLike$Transformed
foreach
"IterableViewLike.scala"
47]
[scala.collection.IterableViewLike$Transformed
foreach$
"IterableViewLike.scala"
47]
[scala.collection.SeqViewLike$AbstractTransformed
foreach
"SeqViewLike.scala"
40]
[org.apache.spark.ml.Pipeline $anonfun$fit$2 "Pipeline.scala" 147]
[org.apache.spark.ml.MLEvents withFitEvent "events.scala" 132]
[org.apache.spark.ml.MLEvents withFitEvent$ "events.scala" 125]
[org.apache.spark.ml.util.Instrumentation
withFitEvent
"Instrumentation.scala"
42]
[org.apache.spark.ml.Pipeline $anonfun$fit$1 "Pipeline.scala" 133]
[org.apache.spark.ml.util.Instrumentation$
$anonfun$instrumented$1
"Instrumentation.scala"
191]
[scala.util.Try$ apply "Try.scala" 213]
[org.apache.spark.ml.util.Instrumentation$
instrumented
"Instrumentation.scala"
191]
[org.apache.spark.ml.Pipeline fit "Pipeline.scala" 133]
[jdk.internal.reflect.NativeMethodAccessorImpl
invoke0
"NativeMethodAccessorImpl.java"
-2]
[jdk.internal.reflect.NativeMethodAccessorImpl
invoke
"NativeMethodAccessorImpl.java"
62]
[jdk.internal.reflect.DelegatingMethodAccessorImpl
invoke
"DelegatingMethodAccessorImpl.java"
43]
[java.lang.reflect.Method invoke "Method.java" 566]
[clojure.lang.Reflector invokeMatchingMethod "Reflector.java" 167]
[clojure.lang.Reflector invokeInstanceMethod "Reflector.java" 102]
[zero_one.geni.ml$fit invokeStatic "ml.clj" 164]
[zero_one.geni.ml$fit invoke "ml.clj" 163]
[spark_app.core$_main invokeStatic "core.clj" 46]
[spark_app.core$_main doInvoke "core.clj" 44]
[clojure.lang.RestFn invoke "RestFn.java" 397]
[clojure.lang.Var invoke "Var.java" 380]
[user$eval140 invokeStatic "form-init3765768439809039245.clj" 1]
[user$eval140 invoke "form-init3765768439809039245.clj" 1]
[clojure.lang.Compiler eval "Compiler.java" 7177]
[clojure.lang.Compiler eval "Compiler.java" 7167]
[clojure.lang.Compiler load "Compiler.java" 7636]
[clojure.lang.Compiler loadFile "Compiler.java" 7574]
[clojure.main$load_script invokeStatic "main.clj" 475]
[clojure.main$init_opt invokeStatic "main.clj" 477]
[clojure.main$init_opt invoke "main.clj" 477]
[clojure.main$initialize invokeStatic "main.clj" 508]
[clojure.main$null_opt invokeStatic "main.clj" 542]
[clojure.main$null_opt invoke "main.clj" 539]
[clojure.main$main invokeStatic "main.clj" 664]
[clojure.main$main doInvoke "main.clj" 616]
[clojure.lang.RestFn applyTo "RestFn.java" 137]
[clojure.lang.Var applyTo "Var.java" 705]
[clojure.main main "main.java" 40]],
:cause
"Scala module 2.10.0 requires Jackson Databind version >= 2.10.0 and < 2.11.0",
:phase :compile-syntax-check}}

Test failures when calling midje outside of dockerized CI

I have read through the quick start and installation sections of the README.

Info

Info	Value
Operating System	macOS Catalina
Geni Version	0.0.27
JDK	1.8
Spark Version	3.0.1

Problem / Steps to reproduce

Today I had some time to sit down and get my fork setup for local development. It's been a while since I used lein and I am new to midje, so I apologize in advance if I ask stupid questions! While following CONTRIBUTING.md, I was able to run make ci with no problems after I allowed the MacOS docker client to see the default directory of temporary files. In case you want to add a note about in the documentation, I have pasted the initial error message below:

Successfully tagged zeroonetechnology/geni:latest
cp -r . /var/folders/zl/hlg5bvnj1y9753gvjsf0_86r0000gn/T/tmp.nV2X5lAq
docker run --rm -v /var/folders/zl/hlg5bvnj1y9753gvjsf0_86r0000gn/T/tmp.nV2X5lAq:/root/geni -w /root/geni -t zeroonetechnology/geni \
                scripts/coverage
docker: Error response from daemon: Mounts denied: 
The path /var/folders/zl/hlg5bvnj1y9753gvjsf0_86r0000gn/T/tmp.nV2X5lAq
is not shared from OS X and is not known to Docker.
You can configure shared paths from Docker -> Preferences... -> File Sharing.
See https://docs.docker.com/docker-for-mac/osxfs/#namespaces for more info.

The dockerized CI is great and easy to use, but I also attempted to use lein midje directly so that I could filter to facts and have a faster workflow. Is this recommended? I am getting 6 failed tests when running lein midje, but I know these failures are configuration differences on my machine, because the dockerized CI passes. If using midje directly is intended to be supported, then a few notes in the documentation should be added (or the tests tweaked).

Attached to this issue is the saved stdout and stderr produce midje on the develop branch. I will summarize the issues I encountered below.

(fact "On XGB native" :slow ...) Raises and exception XGBoostError: XGBoostModel training failed and also bizarely causes a Clojure compilation error. The location of the compilation error is not the "On XBG native" fact, and yet when I comment out the fact the compilation error goes away. See the full stack trace here.
On JavaSparkContext methods checks (rdd/spark-home) => nil?. My development machine has a spark home set via environment, and thus (rdd/spark-home) is not nil. Why spark home be nil? Could this test be equally effective as (rdd/spark-home) => any?.
The RDD saving and loading tests fail due to a file-not-found. I suspect the root cause of this issue is MacOS's way of handling temporary files. These errors can be resolved by using the java.nio.file temporary file utilities that should be useable cross platform. For example, these tests passed once I modified create-temp-file! to the following.

(def -tmp-dir-attr
  (into-array FileAttribute '()))

(defn create-temp-file! [extension]
  (let [temp-dir (.toFile (Files/createTempDirectory "tmp-dir" -tmp-dir-attr))]
    (File/createTempFile "temporary" extension temp-dir)))

There are 2 date/time failures in sql_functions_test.clj which are simply due to timezones. I was able to fix them by adding :spark.sql.session.timeZone "UTC" to the spark session configuration in defaults.clj, although I don't think that is a good permanent fix.

I would recommend creating both the "actual" and "expected" dates from the same unix timestamp, but use Spark for "actual" and use java.time directly for the "expected". This way, both Spark and Java will be using the default timezone of the machine. For example:

(-> (Instant/ofEpochMilli 1)
    (.atZone (ZoneId/systemDefault))
    (.toLocalDate))

Stdout and Stderr output for 2 - 4 can be found here:
geni-test-results.txt

Add common style guide to format the code so that we could share.

Hi @anthony-khong

I would love for the project to have common style guide so that everyone are on the same page.

e.g. We could be using different IDE to edit the code, but then in the end we can all agree on one common style to avoid stepping on each other.
The tools like cljfmt make this very simple.

It works very well with Leiningen project via the following plugin

:plugins [[lein-cljfmt "0.7.0"]]

Can can be run very simple via the two command:

lein cljfmt check

It can fix using something like

lein cljfmt fix

This can be run as part of the build/cicd check.
If you are ok with it then I can create a PR and you can take a look.
For now I don't want to do this unless you are ok with it.

Thanks

extend geni cli to "transform" data to arrow ?

One of the use case I have in mind for "geni" and why I developed as well #284 , was to use geni/spark as a first step to transform "arbitrary data" into arrow files (for using them in TMD mainly)

Ideally I would have a cli tool for this, which does the following operation:

(->
 (g/read-xxx!        ; xxx-> "parquet" or "csv" or ....
 (g/repartition n)
 (g/collect-as-arrow m dir)

Maybe "geni" cli could become this tool.

So it gets run as
"geni repl" -> as now

or alternatively like this:

"geni to-arrow  xxxx.csv   10 50000 /tmp

I would hope that this "simple" case is enough for most cases. Eventually the "transform" need to be extended to allow 2 more things:

specify group-by columns and write arrow files partitioned
specify arbitrary "filter" criteria to shrink the data

The first would require to extend #284 to allow to write several arrow files which are partitioned by the groups.
I am not sure, if this is even possible to do, while assuming big data and therefore "limited heap space".

And to have it very useful, TDM need to have "multi-file dataset support" for arrow files in some form:
techascent/tech.ml.dataset#145

Spark UI has css errors

20/10/05 17:44:56 WARN HttpChannel: /static/spark-logo-77x50px-hd.png
java.lang.NoSuchMethodError: 'boolean javax.servlet.http.HttpServletRequest.isAsyncSupported()'
at org.sparkproject.jetty.server.ResourceService.sendData(ResourceService.java:697)
at org.sparkproject.jetty.server.ResourceService.doGet(ResourceService.java:294)
at org.sparkproject.jetty.servlet.DefaultServlet.doGet(DefaultServlet.java:457)

I saw some reports on this somewhere else, and it is about conflicting servlet implementations.
I got rid of it remove "reply", for example.

Document Azure Blob Storage support

@behrica, that sounds awesome! Would you mind sharing what config options to pass, and we can add it to the docs or the README? 😄

Originally posted by @anthony-khong in #228 (comment)

zero-one-group / geni Goto Github PK

geni's People

Contributors

Stargazers

Watchers

Forkers

geni's Issues

Info

Problem / Steps to reproduce

Expected results

Proposed solution

Info

Problem / Steps to reproduce

Info

Problem / Steps to reproduce

Expected Result

Proposed Solution

Info

Problem / Steps to reproduce

Info

Problem / Steps to reproduce

Info

Problem

Fix

Info

Problem / Steps to reproduce

Info

Problem / Steps to reproduce

Info

Problem / Steps to reproduce

Info

Problem / Steps to reproduce

Info

Problem / Steps to reproduce

Info

Problem / Steps to reproduce

Recommend Projects

Recommend Topics

Recommend Org