Git Product home page Git Product logo

logarithmotechnia's Introduction

Logarithmotechnia

This project is an implementaton of a dataframe akin to Python's Pandas or R's tibble/dplyr/tidyverse. R's influence is significantly stronger for this project, although I do borrow ideas from Pandas as well.

Main advantages are decent data organization; full support of NA-values; good extensibility.

Dataframes and vectors (series in Pandas) are immutable.

Supported types are: Integer, Float, Complex, String, Boolean, Time, Any and Vector (each element of which is a vector).

Loading from CSV

iris, err := dataframe.FromCSVFile("iris.csv")

To skip the first line use CSVOptionSkipFirstLine(true).

iris, err := dataframe.FromCSVFile("iris.csv", dataframe.CSVOptionSkipFirstLine(true))

If you need to pass options for the new dataframe, use CSVOptionDataframeOptions(options...).

Loading from SQL

db, err := sql.Open("sqlite3", "./test_data/items.sqlite")
if err != nil {
	...
}

tx, err := db.Begin()
if err != nil {
	...
}

df, err := FromSQL(tx, "SELECT * FROM sku", []any{})

If you need to pass options for the new dataframe, use SQLOptionDataframeOptions(options...).

Filtering rows

Filtering is done with df.Filter(whicher). Two fundamental whichers are []int with elements indices and []bool. Filter() filters out elements corresponding to false. In most cases you do not need to pass []int or []bool directly as they are returned by many column functions.

Important: in Logarithmothechnia first index is 1! (This betrays R's roots of Logarithmotechnia).

Let's select elements with "setosa" in "species" vector (column).

filteredDf := iris.Filter(iris.C("species").Eq("setosa"))

Or those with sepal length greater than 5.

filteredDf := iris.Filter(iris.C("sepal_length").Gt(5))

Other comparing functions are: Neq(), Lt(), Gte(), Lte().

It is possible to filter by using vector Which() function.

Filtering by several conditions

What if you need to filter by two conditions at the same time? Here is two ways to do this:

filteredIris := iris.Filter(iris.C("species").Eq("setosa")).Filter(iris.C("sepal_length").Gte(5))

Or

filteredIris := iris.Filter(vector.And(
    iris.C("species").Eq("setosa"),
    iris.C("sepal_length").Gt(5),
))

The second approach is more general. What if you need to select all elements which are either of "setosa" species or have sepal length more than 5? It is easy to do by changing vector.And to vector.Or.

filteredIris := iris.Filter(vector.Or(
    iris.C("species").Eq("setosa"),
    iris.C("sepal_length").Gt(5), 
))

Filtering by function

It is also possible to filter by passing a function to column's Which().

filteredIris = iris.Filter(iris.C("sepal_length").Which(
	func(val float64) bool {
		return val >= 5 && val < 7
	},
))

Function has to have a signature supported by the vector (column) type.

Selecting dataframe subset

Select rows from 10th to 20th (including).

subsetIris := iris.FromTo(10, 20)

Sorting

To sort a dataframe use Arrange() function. For example,

sortedBySepalLength := iris.Arrange("sepal_length")

In reverse:

sortedBySepalLengthReverse := iris.Arrange("sepal_length", dataframe.OptionArrangeReverse(true))

By two columns:

sortedBySepalLength := iris.Arrange("series", "sepal_length")

Adding new columns

Mutate() allows creates a new data frame with new columns, but preserving all columns of the old one. For example, let's add a column which indicate a one of two buckets based on the sepal length.

bucketed := iris.Mutate(dataframe.Column{
	"bucket",
	iris.Cn("sepal_length").Apply(
		func(val float64) int {
			if val < 5 {
				return 1
			}
			
			return 2
		},
	),
})

Here you can also an example of vector's Apply() function which allows to generate a new vector from the other one.

Selecting and dropping columns

Select() function allows selecting and dropping dataframe's columns.

Let's select species and sepal length from iris dataset.

compactIris := iris.Select("species", "sepal_length")

Or just drop petal length and petal_width.

compactIris := iris.Select("-petal_length", "-petal_width")

It is also possible to use column indices instead of names.

compactIris := iris.Select(5, 1)

Changing order of columns

Make "species" column appear before "sepal_length":

relocated := iris.Relocate("species", dataframe.OptionBeforeColumn("sepal_length"))

Or "petal_length" and "petal_width" after "species":

relocated := iris.Relocate("petal_length", "petal_width", dataframe.OptionAfterColumn("species"))

Joining dataframes

There are several types of joins available: InnerJoin(), LeftJoin(), RightJoin(), FullJoin(), SemiJoin() and AntiJoin(). Last two are from dplyr. Here is an example of left join:

joined := employee.LeftJoin(department, OptionJoinBy("DepType"))

More examples of the joins can be found in tests.

Converting vectors to slices

Columns (and stand-alone vectors) can be converted to slices. For example:

data, na := iris.Cn("species").Strings()

If an element of na is true, it means a corresponding element of the column is NA-value.

Available converting functions are

  • Booleans() ([]bool, []bool)
  • Integers() ([]int, []bool)
  • Floats() ([]float64, []bool)
  • Complexes() ([]complex128, []bool)
  • Strings() ([]string, []bool)
  • Times() ([]time.Time, []bool)
  • Anies() ([]any, []bool)

Converting vectors to other types

There are also similar functions for converting a vector to other type:

  • AsInteger(options ...Option) Vector
  • AsFloat(options ...Option) Vector
  • AsComplex(options ...Option) Vector
  • AsBoolean(options ...Option) Vector
  • AsString(options ...Option) Vector
  • AsTime(options ...Option) Vector
  • AsAny(options ...Option) Vector

Another way is to use Apply() function as shown before.

Renaming columns

To rename a column, use a Rename() function. There are several ways to pass which column to which value you would like to rename (check function comment). For example:

renamedIris := iris.Rename([]string{"sepal_width", "s_width"})

Summarization and analytical functions

Let's suggest you have a bucketed by "sepal_length" dataframe from the example above, and you want to find out max and min values for "petal_length" for every bucket. It is somewhat cumbersome for now as this is two-step operation. First, we group our bucketed dataframe by necessary columns:

	grouped := bucketed.GroupBy("bucket")

Then we summarize it:

	stats := grouped.Summarize(
		grouped.C("petal_length").Min(),
		grouped.C("petal_length").Max(),
	)
	
    fmt.Println(stats)

And we get the result:

# of columns: 3, # of rows: 2

petal_length_min: [(float)]1.200, 1.000]
petal_length_max: [(float)]6.900, 4.500]
bucket: [(integer)]2, 1]

logarithmotechnia's People

Contributors

tabellarius avatar

Stargazers

Darrell Gallion avatar Weverton Marques avatar cbluth avatar Ahmed avatar Alan Bunjevac avatar Edward Fernandes avatar Thomas Schlosser avatar Brandon Jaus avatar ringsaturn avatar bbrodriges avatar Christoph Berger avatar Aurélien Rainone avatar Antares avatar Jan avatar Benjamin Kane avatar bxio avatar

Watchers

 avatar  avatar

Forkers

super-rain

logarithmotechnia's Issues

Factor payloads

FactorInt and FactorString payloads are needed to support categorical values. Ideally, generics should be used to choose for payload's data storage from uint8 to uint64.

Function repository for Apply

Vector has Apply() function which allows to apply a function with the compatible signature to a payload. It would be nice to have a repository of ready-to-use functions which would match math/string/integer... functions in the standard library, so a user could do something like this:

iris = iris.Mutate(dataframe.Column{ "rounded_sepal_length", iris.Cn("sepal_length").Apply(repos.Round), })

For functions like power, repository function might be a constructor of functions, so it will looks like:

iris = iris.Mutate(dataframe.Column{ "powered_sepal_length", iris.Cn("sepal_length").Apply(repos.Pow(2)), })

repos.Pow here should return a function of powering to 2.

Vector payload

Currently cumulative functions do not work with grouped dataframes. The reason is that they return vector with more than one elements. The solution is to create Vector payload which will contain vectors. In this case cumulative functions will be able to return Vector payload with elements containing cumulative values as a vector for each group.

Unpack() for Vector and Any payloads

We should be able to unpack dataframe by column of Vector payload (#86).

Unpackable interface for Any and Vector payloads.

This (and #86) will give ability to load array types from Postrgres and Elastic in future.

Refactor built-in whichers into external ones

The I went initially was to make a fat Vector interface which was supposed to contain all necessary functions and interfaces. For example, it contains Odd(). After some consideration, I decided to move the way I did in #94, making some whichers, which are external to vectors and reside in "which" package. We will proceed in this directions, so it is necessary move as much whichers as possible from Vector to "which" package. The same will be done with summarizers as well.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.