Git Product home page Git Product logo

explorer's Introduction

Explorer

CI Documentation Package

Explorer brings series (one-dimensional) and dataframes (two-dimensional) for fast data exploration to Elixir.

Getting started

Inside an Elixir script or Livebook:

Mix.install([
  {:explorer, "~> 0.5.0"}
])

Or in the mix.exs file of your application:

def deps do
  [
    {:explorer, "~> 0.5.0"}
  ]
end

A glimpse of the API

We have two ways to represent data with Explorer:

  • using a series, that is similar to a list, but is guaranteed to contain items of one data type only - or one dtype for short. Notice that nil values are permitted in series of any dtype.

  • using a dataframe, that is just a way to represent one or more series together, and work with them as a whole. The only restriction is that all the series shares the same size.

A series can be created from a list:

fruits = Explorer.Series.from_list(["apple", "mango", "banana", "orange"])

Your newly created series is going to look like:

#Explorer.Series<
  Polars[4]
  string ["apple", "mango", "banana", "orange"]
>

And you can, for example, sort that series:

Explorer.Series.sort(fruits)

Resulting in the following:

#Explorer.Series<
  Polars[4]
  string ["apple", "banana", "mango", "orange"]
>

Dataframes

Dataframes can be created in two ways:

  • by reading from files or memory using the IO functions. This is by far the most common way to load dataframes in Explorer. We accept Parquet, IPC, CSV, and NDJSON files.

  • by using the Explorer.DataFrame.new/2 function, that is neat for small experiments. We are going to use this function here.

You can pass either series or lists to it:

mountains = Explorer.DataFrame.new(name: ["Everest", "K2", "Aconcagua"], elevation: [8848, 8611, 6962])

Your dataframe is going to look like this:

#Explorer.DataFrame<
  Polars[3 x 2]
  name string ["Everest", "K2", "Aconcagua"]
  elevation integer [8848, 8611, 6962]
>

It's also possible to see a dataframe like a table, using the Explorer.DataFrame.table/2 function:

Explorer.DataFrame.table(mountains)

Prints:

+-------------------------------------------+
| Explorer DataFrame: [rows: 3, columns: 2] |
+---------------------+---------------------+
|        name         |      elevation      |
|      <string>       |      <integer>      |
+=====================+=====================+
| Everest             | 8848                |
+---------------------+---------------------+
| K2                  | 8611                |
+---------------------+---------------------+
| Aconcagua           | 6962                |
+---------------------+---------------------+

And now I want to show you how to filter our dataframe. But first, let's require the Explorer.DataFrame module and give a short name to it:

require Explorer.DataFrame, as: DF

The "require" is needed to load the macro features of that module. We give it a shorter name to simplify our examples.

Now let's go to the filter. I want to filter the mountains that are above the mean elevation in our dataframe:

DF.filter(mountains, elevation > mean(elevation))

You can see that we can refer to the columns using their names, and use functions without define them. This is possible due the powerful Explorer.Query features, and it's the main reason we need to "require" the Explorer.DataFrame module.

The result is going to look like this:

#Explorer.DataFrame<
  Polars[2 x 2]
  name string ["Everest", "K2"]
  elevation integer [8848, 8611]
>

There is an extensive guide that you can play with Livebook: Ten Minutes to Explorer

You can also check the Explorer.DataFrame and Explorer.Series docs for further details.

Features and design

Explorer high-level features are:

  • Simply typed series: :binary, :boolean, :category, :date, :datetime, :float, :integer, :string, and :time.

  • A powerful but constrained and opinionated API, so you spend less time looking for the right function and more time doing data manipulation.

  • Pluggable backends, providing a uniform API whether you're working in-memory or (forthcoming) on remote databases or even Spark dataframes.

  • The first (and default) backend is based on NIF bindings to the blazing-fast polars library.

The API is heavily influenced by Tidy Data and borrows much of its design from dplyr. The philosophy is heavily influenced by this passage from dplyr's documentation:

  • By constraining your options, it helps you think about your data manipulation challenges.

  • It provides simple “verbs”, functions that correspond to the most common data manipulation tasks, to help you translate your thoughts into code.

  • It uses efficient backends, so you spend less time waiting for the computer.

The aim here isn't to have the fastest dataframe library around (though it certainly helps that we're building on Polars, one of the fastest). Instead, we're aiming to bridge the best of many worlds:

  • the elegance of dplyr
  • the speed of polars
  • the joy of Elixir

That means you can expect the guiding principles to be 'Elixir-ish'. For example, you won't see the underlying data mutated, even if that's the most efficient implementation. Explorer functions will always return a new dataframe or series.

Contributing

Explorer uses Rust for its default backend implementation, and while Rust is not necessary to use Explorer as a package, you need Rust tooling installed on your machine if you want to compile from source, which is the case when contributing to Explorer.

We require Rust Nightly, which can be installed with Rustup.

Once you have made your changes, run mix ci, to lint and format both Elixir and Rust code.

Sponsors

Amplified

explorer's People

Contributors

benjamin-philip avatar blutack avatar cigrainger avatar cristineguadelupe avatar drowzy avatar ingvardsen avatar jeregrine avatar jhonatannunessilva avatar jonatanklosko avatar josevalim avatar joshuataylor avatar kimjoaoun avatar linusdm avatar michalslaski avatar mixwui avatar mlineen avatar nallwhy avatar nyo16 avatar pgeraghty avatar philss avatar polvalente avatar ricpruss avatar sabiwara avatar sasikumar87 avatar sobolevn avatar thbar avatar thmsmlr avatar tmr08c avatar treebee avatar xavirodri avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.