sciruby / daru Goto Github PK

View Code? Open in Web Editor NEW

1.0K 1.0K 139.0 4.54 MB

Data Analysis in RUby

License: BSD 2-Clause "Simplified" License

Ruby 99.47% HTML 0.53%

daru's Introduction

SciRuby meta gem

Tools for Scientific Computing in Ruby

Description

This gem acts as a meta gem which collects and provides multiple scientific gems, including numeric and visualization libraries.

Getting started

Installation:

gem install sciruby
gem install sciruby-full

If you want to have a full-blown installation, install sciruby-full.

Start a notebook server:

iruby notebook

Enter commands:

require 'sciruby'
# Scientific gems are auto loaded, you can use them directly!
plot = Nyaplot::Plot.new
sc = plot.add(:scatter, [0,1,2,3,4], [-1,2,-3,4,-5])

Take a look at gems.yml or the list of gems for interesting gems which are included in sciruby-full.

License

SciRuby is licensed under the BSD 3-clause license. See LICENSE for details.

Donations

Support a SciRuby Fellow via .

daru's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger jerrywhite0 wlevine peterktung vidok mrkn gnilrets ruby-statml kaulanishasj dansbits gtamba ansarbek sebastianiorga deepakkoli93 shahsaurabh0605 reelman yaoyuyang orthodex arafatkatze phitherek shreyanshd inside-track zverok gerry-v prasunanand hanmd82 mottalrd kidush distroid jtoy yui-knk shekharrajak dylandy genya0407 matugm krishnak9 baarkerlounger shardulparab97 mezbahalam gulshan73 jannishuebl anshuman23 dshvimer ananyo2012 athityakumar rickhull tamuratak abinoam sivagollapalli gargiz cpankaj parthm intersimone999 wealthsimple xprazak2 clustertv info-rchitect rainchen koishimasato hujunxianligong arissetyawan ajit2704 kamalpaneru bmpacifique shivani29456 rohitner subhash-saurabh takkanm ronaq13 hibariya mhammiche axxonback shkrt codetriage-readme-bot paisible-wanderer knight388 cedricraud hafnas ismailm prakriti-nith vanitu iiwakuralain estryker nikhil92katiyar nursultan91 rishics bruzos krivenkop nowlinuxing yuki-inoue inambioinfo kgpdragon07 iostreamatlab petalsonwind chepchaf ncs1 milespucarelli kunal1244 ilanasivan sylfrena

daru's Issues

Daru::Vector#tail broken

> Daru::Vector.new([1,2,3]).tail(1)
=> TypeError: no implicit conversion from nil to integer

In the case of Vector#head, you simply cannot request more elements than the Vector contains. I have yet to find a Vector instance for which tail works.

from_sql not working

DataFrame.from_sql is not working for me with the follwing:

conn = DBI.connect("DBI:SQLite3:daru_test")
df = Daru::DataFrame.from_sql(conn, 'select * from accounts')

I get this error:

RuntimeError: can't add a new key into hash during iteration
    from /Users/daniel/.rvm/gems/ruby-2.2.2/gems/dbi-0.4.5/lib/dbi/columninfo.rb:49:in `block in initialize'
    from /Users/daniel/.rvm/gems/ruby-2.2.2/gems/dbi-0.4.5/lib/dbi/columninfo.rb:42:in `each_key'
    from /Users/daniel/.rvm/gems/ruby-2.2.2/gems/dbi-0.4.5/lib/dbi/columninfo.rb:42:in `initialize'
    from /Users/daniel/.rvm/gems/ruby-2.2.2/gems/dbi-0.4.5/lib/dbi/handles/statement.rb:185:in `new'
    from /Users/daniel/.rvm/gems/ruby-2.2.2/gems/dbi-0.4.5/lib/dbi/handles/statement.rb:185:in `block in column_info'
    from /Users/daniel/.rvm/gems/ruby-2.2.2/gems/dbi-0.4.5/lib/dbi/handles/statement.rb:185:in `collect'
    from /Users/daniel/.rvm/gems/ruby-2.2.2/gems/dbi-0.4.5/lib/dbi/handles/statement.rb:185:in `column_info'
    from /Users/daniel/.rvm/gems/ruby-2.2.2/gems/daru-0.1.0/lib/daru/io/io.rb:131:in `from_sql'
    from /Users/daniel/.rvm/gems/ruby-2.2.2/gems/daru-0.1.0/lib/daru/dataframe.rb:72:in `from_sql'
    from (irb):9
    from /Users/daniel/.rvm/rubies/ruby-2.2.2/bin/irb:11:in `<main>'

This appears to be an issue with the way the dbi column_info method is being used in IO.from_sql.

I forked the repository and have a working version now but wanted to make sure this isn't a known issue before submitting a PR. Any guidance is appreciated.

Thanks!
Dan

Would you ever consider adding a Singular value decomposition method? Its useful in latent semantic indexing among other things and there isn't currently a good plain ruby SVD implementation anywhere(that I can find.)

Sorting vectors and dataframes with nils

I was starting to work on figuring out a solution for sorting vectors and data frames with nil values. I noticed you've got a pending test - https://github.com/v0dro/daru/blob/master/spec/vector_spec.rb#L488.

I'd argue that nils should be placed at the beginning of the vector.

SQL sorts nulls at the beginning.
Empty ('') and blank (' ') strings are functionally similar to nil when dealing with datasets, and they sort before all other strings. It would be strange to have empty/blank strings at the top and nils at the bottom.
nil data is usually "special" and if it's at the top of a sort, you're more likely to notice.

Vector mode function bug

require 'daru'
foo = Daru::Vector.new [1,2,3,2,4,4,4,4]
foo.mode gives 2 instead of 4.
This was discovered in this stackoverflow question

Daru::Index#[] not returning proper sub-Vector on Integer Index

On a Daru::Vector with index of datatype Integer, #[] method doesn't return the required Daru::Vector on passing Range as a parameter.

Here's an example

In [3]: require 'daru'
index = Daru::Index.new([5,4,3,2,1])
vector = Daru::Vector.new(['five','four','three','two','one'], name: "TestVector", index: index)
   ...: 
Out[3]: 

#<Daru::Vector:70133768745100 @name = TestVector @size = 5 >
           TestVector
         5       five
         4       four
         3      three
         2        two
         1        one


In [4]: vector[5..3]
   ...: 
Out[4]: 

#<Daru::Vector:70133765384060 @name = TestVector @size = 0 >
           TestVector

But on passing range (0,3) it is giving the necessary Vector.

In [5]: vector[0..3]
   ...: 
Out[5]: 

#<Daru::Vector:70133765251080 @name = TestVector @size = 4 >
           TestVector
         5       five
         4       four
         3      three
         2        two

Daru::Index#[] method is assuming that since the index is of type Integer, it is handling it as the default indexing scheme starting from 0.
But it is not the case here.

Is this kind of implementation intentional and necessary? @v0dro

recode_rows does not seem to work when Daru::DataFrame is indexed by DateTimeIndex

Greetings,

I would like to recode a DataFrame indexed by a DateTimeIndex. This produces a

NoMethodError: undefined method `keys' for nil:NilClass

error. A minimal example reproducing the error looks as follows:

require 'narray'
require 'daru'
include Daru

index = DateTimeIndex.date_range(start: "2016-02-11", periods: 3)
df = DataFrame.new({a: [1,2,3],
  b: [4,5,6]},
  index: index)

df.recode_rows do |row|
  row
end

I am using daru-0.1.1 with ruby-2.2.1p85.
If I am missing any crucial information, I am very sorry. I will provide them as soon as prompted.
Best regards.

Vector#[]= with an unknown index

v = Daru::Vector.new([1,2,3], index: [:a, :b, :c])
v[:a] = 999 # from docs, works 
v[:x] = 999 # fails with cryptic undefined method `each' for :x:Symbol

The error message displayed should precisely convey what went wrong.

Accessing multiple vectors in Daru::DataFrame with `#[]` or `#vector` is extremely slow

In a Daru::DataFrame with 1000 rows and 3 columns (named a, b and c) I have benchmarked
v1 = df.vector[:c] against
v2 = df.vector[:a..:b] and against
v3 = df.vector[:a, :c].
Accessing one vector takes less than a millisecond, but accessing two vectors at a time takes ages. The elapsed times don't change if I use df[...] instead of df.vector[...].

Here are the results:

*** Extract one vector :c ***
Elapsed time: 0.011007 milliseconds
*** Extract a range :a..:b of vectors ***
Elapsed time: 2400.41752 milliseconds
*** Extract multiple vectors, :a,:c ***
Elapsed time: 2414.85577 milliseconds

Here, the same experiment for a data with 10000 rows and 3 columns. It took in total more than 8 minutes on my Intel-core-i5-12GB-RAM-laptop:

*** Extract one vector :c ***
Elapsed time: 0.027409000000000003 milliseconds
*** Extract a range :a..:b of vectors ***
Elapsed time: 256343.461203 milliseconds
*** Extract multiple vectors, :a,:c ***
Elapsed time: 251004.597329 milliseconds

This is possibly related to issue #12.

Also, please let me know if you know of a workaround, because this issue makes the user friendly methods of mixed_models practically not applicable for data with >10000 rows.

(I am currently using the stable release daru gem, because gsl-nmatrix fails to get installed on my machine for some reason. So, I'm sorry if this is already fixed in the most recent development version.)

Vector metadata feature

I'd like to be able to associate arbitrary metadata with vectors in a way that as vectors get duped and passed around between dataframes, their metadata remains intact. A use case would be that I read a date from a CSV file. Some of the dates may be invalid or contain special characters, so I need to start out by just reading it as a string. Then after some cleaning and error handling, I want to enforce that the vector is actually meant to be parsed as a date with a format specified in the metadata.

Thoughts?

Faster loading from CSV files.

I'm really loving the idea of this project. My only concern is performance. Reading from a 4,000 line CSV file is taking 7s (WAY too long if I'm going to try to scale to even small data sizes on the order of 100k rows). I was going to try using NMatrix, but don't see how I could try that when reading from a CSV. For example, how could I convert something like this to use NMatrix?

df = Daru::DataFrame.from_csv 'myfile.txt', { headers: true, col_sep: "\t", encoding: "ISO-8859-1:UTF-8" }

Any other ideas on how to improve performance?

DataFrame::delete_row seems to be broken

DataFrame::delete_row behaves in unexpected ways; see the nils in the example below.

I am currently using the daru gem, because gsl-nmatrix fails to get installed on my machine for some reason. So, I'm sorry if this is already fixed in the most recent development version.

[30] pry(main)> df = Daru::DataFrame.new([(1..10).to_a, (1..10).to_a, (1..10).to_a], 
                                                                    order: [:a, :b, :c])
=> 
#<Daru::DataFrame:70169385618440 @name = 331f90f7-5b92-4d7c-92c3-1fb572917d49 @size = 10>
                    a          b          c 
         0          1          1          1 
         1          2          2          2 
         2          3          3          3 
         3          4          4          4 
         4          5          5          5 
         5          6          6          6 
         6          7          7          7 
         7          8          8          8 
         8          9          9          9 
         9         10         10         10 

[31] pry(main)> df.delete_row(0)
=> 9
[32] pry(main)> df
=> 
#<Daru::DataFrame:70169385618440 @name = 331f90f7-5b92-4d7c-92c3-1fb572917d49 @size = 9>
                    a          b          c 
         1          3          3          3 
         2          4          4          4 
         3          5          5          5 
         4          6          6          6 
         5          7          7          7 
         6          8          8          8 
         7          9          9          9 
         8         10         10         10 
         9        nil        nil        nil 

[33] pry(main)> df.delete_row(2)
=> 8
[34] pry(main)> df
=> 
#<Daru::DataFrame:70169385618440 @name = 331f90f7-5b92-4d7c-92c3-1fb572917d49 @size = 8>
                    a          b          c 
         1          3          3          3 
         3          6          6          6 
         4          7          7          7 
         5          8          8          8 
         6          9          9          9 
         7         10         10         10 
         8        nil        nil        nil 
         9        nil        nil        nil

conflict with NArray's NMatrix

Hi @v0dro

Thank you for your great gem!
But daru conflict with Tanaka's NArray.

 Error:  /narray-0.6.1.1/nmatrix.rb:8:in `<top (required)>': uninitialized constant NArray (NameError)

daru.rb

lib_underscore = library.to_s.gsub(/-/, '\_').gsub('/', '\_')
create_has_library :'nmatrix/nmatrix'

daru/accessors/nmatrix_wrapper.rb

end if Daru.has_nmatrix_nmatrix?

This work. But not cool.

Plotting timeseries

Hi @v0dro,

Cheers, this is amazing.

Is there a strong dependency on "gnuplotrb" to plot timeseries ?

Consider the following example:

historical_data = [
  ["2015-08-30", 309],
  ["2015-09-06", 331],
  ["2015-09-13", 279],
  ["2015-09-20", 294],
  ["2015-09-25", 253]
].map { |element|
  Hash.new.tap do |hash|
    hash["date"]  = element[0]
    hash["total"]  = element[1]
  end
}

df = Daru::DataFrame.new(data)
df.plot(type: :line, y: :date, x: :total) do |plot, diagrams|
 ...
end

This doesn't seem to plot out of the box the timeseries on y axis. I've tried converting date to Date or DateTime. Am i missing something or this only works through the datetime ranges and via gnuplotrb?

Thanks

Generic solution for sorting with nils?

I've been working on a way to improve the performance of joins and have once again run into an issue with sorting data that contains nil values. I've worked out what I think is a generic solution for sorting nil data. It does involve monkey patching NilClass, but I think it's pretty tame. Do you see any issues if we monkey patch NilClass and use Sortable when sorting DataFrames & Vectors? (Alternatively, we could use refinements instead of monkey patches).

# Puts nils before anything else
class NilClass
  include Comparable

  def <=>(other)
    other.nil? ? 0 : -1
  end
end

class Sortable
  include Comparable

  def initialize(value)
    @value = value
  end

  attr_reader :value

  def <=>(other)
    # when @value is nil, use <=> from NilClass
    # when other.value is nil, reverse comparison order and then use <=> from NilClass
    @value <=> other.value || -(other.value <=> @value)
  end
end

numbers_and_nil = (0.upto(10).to_a + [nil]*10).shuffle
puts numbers_and_nil.sort_by { |v| Sortable.new(v) }
# => [nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

strings_and_nil = (0.upto(10).to_a.map { |v| v.to_s } + [nil]*10).shuffle.map
puts strings_and_nil.sort_by { |v| Sortable.new(v) }
# => [nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, "0", "1", "10", "2", "3", "4", "5", "6", "7", "8", "9"]

arrays_and_nil = (0.upto(10).to_a + [nil]*10).shuffle.map { |v| [v] }
puts arrays_and_nil.sort_by { |v| Sortable.new(v) }
# => [[nil], [nil], [nil], [nil], [nil], [nil], [nil], [nil], [nil], [nil], [0], [1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]

arrays_and_nil_in_2nd_position = (0.upto(10).to_a + [nil]*10).shuffle.map { |v| [(v || 1).to_s, v] }
puts arrays_and_nil_in_2nd_position.sort_by { |v| Sortable.new(v) }
# => [["0", 0], ["1", nil], ["1", nil], ["1", nil], ["1", nil], ["1", nil], ["1", nil], ["1", nil], ["1", nil], ["1", nil], ["1", nil], ["1", 1], ["10", 10], ["2", 2], ["3", 3], ["4", 4], ["5", 5], ["6", 6], ["7", 7], ["8", 8], ["9", 9]]

Covariance calculation seems wrong

[52] pry(main)> dd.cov
=>
#<Daru::DataFrame:70361442947720 @name = ade7e257-db8a-467e-9a02-6d14e9ad824d @size = 3>
                   ax         ay         az
        ax 0.00014379 -1.0361854 -7.3389159
        ay -1.0361854 0.00012949 -3.6098704
        az -7.3389159 -3.6098704 7.21220195

pry(main)> dd.min
=>
#<Daru::Vector:70361441216360 @name = nil @size = 3 >
                                      nil
                  ax -0.03294628633705924
                  ay -0.02795710559369226
                  az -0.02536420404774847

pry(main)> dd.max
=>
#<Daru::Vector:70361441840300 @name = nil @size = 3 >
                                      nil
                  ax  0.03276209162313101
                  ay  0.03073589051661277
                  az 0.024760295165908133

pry(main)> dd.std
=>
#<Daru::Vector:70361441679800 @name = nil @size = 3 >
                                      nil
                  ax 0.011991490010508732
                  ay 0.011379778474542456
                  az 0.008492468398766329

I'm fairly certain the covariance for the third row/column (:az) is wrong here.

Vector.uniq not working when pulling vector from DataFrame

I'm getting an exception when I try to find unique values from a column in a data frame. My data frame is built from a csv similar to this:

PO_NAME STATE   ZCTA    ZIP ZIP_TYPE
Merizo  GU  96916   96916   Post Office or large volume customer
Inarajan    GU  96917   96917   Post Office or large volume customer
Agat    GU  96928   96928   Post Office or large volume customer

An excel version of the data can be found here: http://www.udsmapper.org/zcta-crosswalk.cfm

If I run the following code:

zip_to_zcta = Daru::DataFrame.from_csv('~/data/zip_to_zcta.csv')
zip_to_zcta['ZIP'].uniq

I get the following error:

IndexError: Expected index size >= vector size. Index size : 41038, vector size : 41270

Daru::Vector#head broken

Daru::Vector.new([1,2,3]).head
=> TypeError: no implicit conversion from nil to integer

This is probably because head (or more importantly, Vector#[]) doesn't check that the slice is a valid size.

group_by fails with nils

Think it would be possible to support group-by when the value is nil?

Thanks!

df = Daru::DataFrame.new(a: ["a",nil,"b","b","c"])
df.group_by([:a]).size

ArgumentError: comparison of Array with Array failed
/Users/gnilrets/git/itk/noodling/brdd/vendor/bundle/gems/daru-0.1.1/lib/daru/core/group_by.rb:21:in `sort'
/Users/gnilrets/git/itk/noodling/brdd/vendor/bundle/gems/daru-0.1.1/lib/daru/core/group_by.rb:21:in `initialize'
/Users/gnilrets/git/itk/noodling/brdd/vendor/bundle/gems/daru-0.1.1/lib/daru/dataframe.rb:1213:in `new'
/Users/gnilrets/git/itk/noodling/brdd/vendor/bundle/gems/daru-0.1.1/lib/daru/dataframe.rb:1213:in `group_by'
<main>:1:in `<main>'
/Users/gnilrets/git/itk/noodling/brdd/vendor/bundle/gems/iruby-0.2.7/lib/iruby/backend.rb:44:in `eval'
/Users/gnilrets/git/itk/noodling/brdd/vendor/bundle/gems/iruby-0.2.7/lib/iruby/backend.rb:44:in `eval'
/Users/gnilrets/git/itk/noodling/brdd/vendor/bundle/gems/iruby-0.2.7/lib/iruby/backend.rb:12:in `eval'
/Users/gnilrets/git/itk/noodling/brdd/vendor/bundle/gems/iruby-0.2.7/lib/iruby/kernel.rb:87:in `execute_request'
/Users/gnilrets/git/itk/noodling/brdd/vendor/bundle/gems/iruby-0.2.7/lib/iruby/kernel.rb:47:in `dispatch'
/Users/gnilrets/git/itk/noodling/brdd/vendor/bundle/gems/iruby-0.2.7/lib/iruby/kernel.rb:37:in `run'
/Users/gnilrets/git/itk/noodling/brdd/vendor/bundle/gems/iruby-0.2.7/lib/iruby/command.rb:70:in `run_kernel'
/Users/gnilrets/git/itk/noodling/brdd/vendor/bundle/gems/iruby-0.2.7/lib/iruby/command.rb:34:in `run'
/Users/gnilrets/git/itk/noodling/brdd/vendor/bundle/gems/iruby-0.2.7/bin/iruby:5:in `<top (required)>'
/Users/gnilrets/git/itk/noodling/brdd/vendor/bundle/bin/iruby:23:in `load'
/Users/gnilrets/git/itk/noodling/brdd/vendor/bundle/bin/iruby:23:in `<main>'

Poor scaling of filter functionality

I'm finding really poor non-linear scaling with the filter functionality in a DataFrame. Here's an example

require 'daru'
require 'benchmark'

def bench(name, &block)
  time = Benchmark.realtime do
    yield block
  end
  puts "#{name}: #{time}"
end


bench :filter_example do
  df = nil
  bench :create_df do
    to_df = 1.upto(10).inject({}) { |h, icol| h["col#{icol}".to_sym] = 1.upto(10000).to_a; h }
    df = Daru::DataFrame.new(to_df)
  end

  0.upto(4).each do |i|
    n = 100*2**i
    bench "filter #{n}" do
      df.filter(:row) { |r| r[:col1] < n }
    end
  end
end

Results:

create_df: 0.101626406001742    
filter 100: 0.4592461759893922  
filter 200: 0.6331805069930851  
filter 400: 1.573929047997808   
filter 800: 5.534721941992757   
filter 1600: 24.703956208002637

When the number of records retained doubles from 800 to 1600, the time required to filter increases by almost a factor of 5!

mikon vs daru?

It seems both of you are working on dataframe libraries?

https://github.com/v0dro/daru
https://github.com/domitry/mikon

What about merging these two?

is there ability of access to value by row and colum?

Hello, developers.

Very nice gem, my sincerely congratulations.

Is there ability to access to any value by row and column? Unfortunately, I cannot find in documentation, maybe I just missed it ..

For example, I want access to value by dataframe[:row][:column]

  def foo
     dataset = create_test_dataset
     value = dataset[:rec1][:DET]  # how to do it?
     # skipped ...
  end

  def create_test_dataset
    rows = []
    index = [:type, :DET, :RET, :FTR]
    rows << Daru::Vector.new([:ILF, 20, 2, nil], :name => :req1, :index => index)
    rows << Daru::Vector.new([:ILF, 10, 3, nil], :name => :req2, :index => index)
    rows << Daru::Vector.new([:EIF, 10, 3, nil], :name => :req3, :index => index)
    rows << Daru::Vector.new([:EIF, 20, 2, nil], :name => :req4, :index => index)
    rows << Daru::Vector.new([:EI,  20, nil, 2], :name => :req5, :index => index)
    rows << Daru::Vector.new([:EO,  10, nil, 1], :name => :req6, :index => index)
    rows << Daru::Vector.new([:EQ,  10, nil, 3], :name => :req7, :index => index)
    Daru::DataFrame.rows(rows, :name => :fpa)
  end

Unexpected result of boolean operation between vectors

vtrue = Daru::Vector.new([true]*2)
vfalse = Daru::Vector.new([false]*2)

bool = (vfalse and vtrue)

The result is a vector where all of the elements are true, where I would expect all of them to be false.

Add Travis badge to README

See title.

Default order of vectors

What are your thoughts on removing the alphabetic sort of vectors when initializing new data and just go with the order they come in?

https://github.com/v0dro/daru/blob/master/lib/daru/dataframe.rb#L2336

I read a lot of data from CSV files and would prefer the columns to show up in the same order that they appear in the file. Data typically comes in with the more important fields on the left (id's and names, for example). We could probably make some changes that would only affect CSVs, but I'm wondering if it makes sense to make it more general.

how to compare data from two dataframe

Hello, developers.

I try to use Daru::DataFrame for few tables, that are generated by another code. And now I need to compare two tables by id.

Suppose, I have two dataframes, one of them is loaded from csv file and another is empty generated

#<Daru::DataFrame:20391612 @name = old @size = 3>
           id field1 field2 field3 
     0   req1      1      2      3 
     1   req2      1      2      3 
     2   req3      1      2      3 

#<Daru::DataFrame:20391612 @name = new @size = 3>
           id field1 field2 field3 
     0   req1      nil      nil      nil 
     1   req2      nil      nil      nil
     2   req3      nil      nil      nil
     3   req4      nil      nil      nil

Is there any way to transfer\copy old [col][row] values from loaded data to newly generated? Something like

  new[:req1][:field1] = old[:req1][:field1] unless old[:req1][:field1].nil?}

to get

#<Daru::DataFrame:20391612 @name = old @size = 3>
           id field1 field2 field3 
     0   req1      1      2      3 
     1   req2      1      2      3 
     2   req3      1      2      3 
     3   req4      nil    nil     nil

PS
Before this question I've written code by [row][col] index, but now I see that index don't write to csv and I haven't found a way to set index to Id column

      def copy(source)
        each_row_col_val {|row, col, val|
          @dataframe[col][row] = source[col][row] unless source[col][row].nil?
        }
      end

      def each_row_col_val(&block)
        rows = @dataframe.index.map.to_a
        cols = @dataframe.vectors.map.to_a
        rows.each {|row|
          cols.each {|col| yield(row, col, @dataframe[col][row]) if block_given?}
        }
      end

problem with pivot table

Hi, @v0dro

Thank for your great work!!! But I have a problem with pivot table

I have next code

vectors = { visits: [], country:[], city:[], week: [], locale: [], os: [], bounce: [] }
visits.each do |row|
  vectors[:visits].push row.user_count
  vectors[:week].push row.week.to_i
  vectors[:country].push row.country || 'Unknown'
  vectors[:city].push row.city || 'Unknown'
  vectors[:locale].push row.locale || 'Unknown'
  vectors[:os].push row.os || 'Unknown'
  vectors[:bounce].push rand(100)
end
df = Daru::DataFrame.new( vectors )
index = [:week, :country, :os]
table = df.pivot_table(index: index, values: [:visits, :bounce], agg: :sum)

result:
https://gist.github.com/vidok/2845e64b5832746159a8
or write_csv method result https://gist.github.com/vidok/1f61efcb17262bcfbd27

In result log we can see what in some keys is a Daru::Vector rather than numerical values.
If I change count dimensions from 3 to 2 (eq index = [:week, :country]) than pivot table is valid.

for which reason this happens?

Daru::VERSION should give which version is being used

This enables you to check that you're using the version you think you're using from within IRB/Pry/iruby.

Nyaplot required?

If I try to require "daru", it gives me the following error message (twice):

cannot load such file -- nyaplot
cannot load such file -- nyaplot

Why is nyaplot required? Shouldn't it be possible to do analyses without it installed?

DataFrame.from_excel name of first column nil

To replicate the problem:

require 'daru'
# Create DataFrame
df = Daru::DataFrame.new({
  'col0' => [1,2,3,4,5,6],
  'col2' => ['a','b','c','d','e','f'],
  'col1' => [11,22,33,44,55,66]
  }, 
  index: ['one', 'two', 'three', 'four', 'five', 'six'], 
  order: ['col0', 'col1', 'col2']
)
# Check first column name
df['col0'].name
#"col0"

# Write an excel with DataFrame
df.write_excel('df.xls')
# Create DataFrame drom excel
df_xls = Daru::DataFrame.from_excel('df.xls')

# Problem check first column name
df_xls[:col0].name
# nil

# Other column name and vectors are ok
df_xls.vectors.to_a
# [:col0, :col1, :col2]

Daru::DataFrame.from_csv change original order of the columns

# Create csv with headers unordered
# Tipical csv example from other source
$> aux = Daru::DataFrame.new(
  {
    'XBeer' => ['Kingfisher', 'Snow', 'Bud Light', 'Tiger Beer', 'Budweiser'],
    'Gallons sold' => [500, 400, 450, 200, 250]
  },
  index: ['India', 'China', 'USA', 'Malaysia', 'Canada']
)
$> aux.vectors = Daru::Index.new(["Gallons sold", "Beer"])
# Now headers are unordered
$> aux
=>
#<Daru::DataFrame:40144608 @name = f6ec2382-11d8-4446-8325-802d1cdca032 @size = 5>
           Gallons so      Beer
     India        500 Kingfisher
     China        400       Snow
       USA        450  Bud Light
  Malaysia        200 Tiger Beer
    Canada        250  Budweiser

$> aux.write_csv('df_original_order.csv')
##############################################################
# Now with this file we can reproduce the problem
# Open the csv from other source
$> df = Daru::DataFrame.from_csv('df_original_order.csv')
# Doing some computation or concat new data
# ...
# Now saving the results (headers are ordered)
$> df.write_csv('df_other_order.csv')

why i can't keep the original order?
How can i read csv without change the order of the columns? or at least write the csv in arbitrary order?

Documentation style

I'm only familiar with TomDoc (http://tomdoc.org/). It looks like there might be two different styles of documentation going on (e.g., these two look very different to me: https://github.com/v0dro/daru/blob/master/lib/daru/dataframe.rb#L51, https://github.com/v0dro/daru/blob/master/lib/daru/dataframe.rb#L64).

Is there a convention you'd like to follow?

Daru::DataFrame::reindex_vectors! seems to not work properly

Reindexing with Daru::DataFrame::reindex_vectors! does not seem to work. Here is an example:

(1) Define a DataFrame

> df = Daru::DataFrame.new([[1,2,3],[13,13,13]], order: ['a','b']) 
=> 
#<Daru::DataFrame:70332675355200 @name = 662bfdbb-8894-49c6-92fc-a1f4d58e9a3c @size = 3>
                    a          b 
         0          1         13 
         1          2         13 
         2          3         13

(2) Reindex the vectors using df.reindex_vectors!([:b, :a])

(3) Now, what was 'a' is 'b', and vice versa:

> df
=> 
#<Daru::DataFrame:70332675355200 @name = 662bfdbb-8894-49c6-92fc-a1f4d58e9a3c @size = 3>
                    b          a 
         0          1         13 
         1          2         13 
         2          3         13

(4) But for some reason, when I try to access the vectors, I get the results according to the original indexing (prior to the reindexing):

> df[:a]
=> 
#<Daru::Vector:70332675314520 @name = nil @size = 3 >
    nil
  0   1
  1   2
  2   3

> df[:b]
=> 
#<Daru::Vector:70332675310020 @name = nil @size = 3 >
    nil
  0  13
  1  13
  2  13

(Additional problem) Assigning other indices than :a and :b doesn't work:

> df.reindex_vectors!([:x, :y])
=> #<Daru::Index:0x007fef341f34e0
 @index_class=Symbol,
 @relation_hash={:x=>nil, :y=>nil},
 @size=2>
> df
=> #<Daru::DataFrame:0x3ff79a217240>

MultiIndex [] method working with invalid indexes

[1] pry(main)> d =        Daru::MultiIndex.from_tuples([
[1] pry(main)*     [:c,:one,:bar],            
[1] pry(main)*     [:c,:one,:baz],            
[1] pry(main)*     [:c,:two,:foo],            
[1] pry(main)*     [:c,:two,:bar]            
[1] pry(main)* ])     
=> Daru::MultiIndex:18229240 (levels: [[:c], [:one, :two], [:bar, :baz, :foo]]
labels: [[0, 0, 0, 0], [0, 0, 1, 1], [0, 1, 2, 0]])
[2] pry(main)> v = Daru::Vector.new([1,2,3,4], index: d)
=> 
#<Daru::Vector:18002080 @name = nil @size = 4 >
                              nil
[:c, :one, :bar]                1
[:c, :one, :baz]                2
[:c, :two, :foo]                3
[:c, :two, :bar]                4

[4] pry(main)> v[:x, :one]
=> 
#<Daru::Vector:17572840 @name = nil @size = 2 >
          nil
[:bar]      1
[:baz]      2

But :x is not a valid index.

Support categorical variables

Currently the only types of data supported by daru are numeric (incorporating all sorts of Numeric data - ordinal, nominal, categorical, etc...) and object (stuff like Strings, symbols, etc.).

Having a separate data type for categorical data would be very helpful for statistical computations. Maybe something similar to pandas.Categorical.

'row, vector' Nomenclature is confusing

In several places you refer to 'rows' and 'vectors', and by default you seem to think of a data frame as organized in columns. I find this confusing, especially since both a row and a column in a DF are of class Daru::Vector. It would be nice if this nomenclature was agnostic and treated rows and columns equally. I.e., #filter currently defaults to columns, and #filter(:row) is how you filter across rows. I would vastly prefer either #filter_rows and #filter_columns or #filter(:col) and #filter(:row), etc. I don't see any particularly good reason to assume that rows or columns are primary, and if it comes to that I tend to assume row => record, so that by default I would filter across rows.

Is dup working properly?

Given

v1 = Daru::Vector.new(1.upto(5).to_a)
df1 = Daru::DataFrame.new(v1: v1)

I would expect that dup would create a new dataframe object with new vectors. However, if I recode the vector in df1, the vector in df2 is changed too.

df2 = df1.dup

df1[:v1].recode! { |v| v * 2 }
df2[:v1] == df1[:v1]

Returns true when I would expect it to return false.

I can get the result I want if I dup each vector individually. But this is cumbersome when I've got a larger dataframe.

df2 = df1.dup
df2[:v1] = df1[:v1].dup

df1[:v1].recode! { |v| v * 2 }
df2[:v1] == df1[:v1]

Returns false as expected.

Sorting a DataFrame and then creating a pivot_table for it produces wrong aggregate values

$> df = Daru::DataFrame.new({
    a: [1,2,3,4,5,6]*100,  
    b: ['a','b','c','d','e','f']*100,  
    c: [11,22,33,44,55,66]*100,  
    d: ['r']*600  
})    
=> 
#<Daru::DataFrame:54383340 @name = d132569b-0e1e-4516-ae37-fc41c07bde58 @size = 600>
                    a          b          c          d 
         0          1          a         11          r 
         1          2          b         22          r 
         2          3          c         33          r 
         3          4          d         44          r 
         4          5          e         55          r 
         5          6          f         66          r 
         6          1          a         11          r 
         7          2          b         22          r 
         8          3          c         33          r 
         9          4          d         44          r 
        10          5          e         55          r 
        11          6          f         66          r 
        12          1          a         11          r 
        13          2          b         22          r 
        14          3          c         33          r 
       ...        ...        ...        ...        ... 

$> a = df.dup.sort([:b]).dup.pivot_table(index: [:b, :d], values: [:a, :c], agg: :sum)
=> 
#<Daru::DataFrame:56265240 @name = f7fd6275-b6a4-4ad9-b456-7f2aad0fc3ac @size = 6>
                    a          c 
["a", "r"]        346       3806 
["b", "r"]        350       3850 
["c", "r"]        354       3894 
["d", "r"]        346       3806 
["e", "r"]        350       3850 
["f", "r"]        354       3894 

$> b = df.dup.pivot_table(index: [:b, :d], values: [:a, :c], agg: :sum)
=> 
#<Daru::DataFrame:57262440 @name = 49eb6d38-ed2f-42f0-a764-d026d94eb18c @size = 6>
                    a          c 
["a", "r"]        100       1100 
["b", "r"]        200       2200 
["c", "r"]        300       3300 
["d", "r"]        400       4400 
["e", "r"]        500       5500 
["f", "r"]        600       6600

I'm still chewing on what the cause is. I think the problem is related to the difference between how GroupBy#get_group and GroupBy#select_groups_from work. GroupBy#select_groups_from produces incorrect dataframes that look like what pivot_tale is making use of. #get_group produces correct results.

The code below is while inside a pry session triggered by a binding.pry inside GroupBy#apply_method.

$> self
=> #<Daru::Core::GroupBy:0x00000003d563c8
 @context=

#<Daru::DataFrame:32157200 @name = e15aa9fb-fbcf-45cd-ae01-44b8baf9fd0a @size = 600>
                    a          b          c          d 
       324          1          a         11          r 
       126          1          a         11          r 
       546          1          a         11          r 
       474          1          a         11          r 
       114          1          a         11          r 
       432          1          a         11          r 
       120          1          a         11          r 
       330          1          a         11          r 
       132          1          a         11          r 
       510          1          a         11          r 
       312          1          a         11          r 
       306          1          a         11          r 
       150          1          a         11          r 
       138          1          a         11          r 
       318          1          a         11          r 
       ...        ...        ...        ...        ... 
,
 @groups=
  {["a", "r"]=>
    [0,
     1,
     2,
     3,
     4,
     5,
     6,
     7,
     8,
     9,
     10,
     11,
     12,
     13,
     14,
$> first
=> 
#<Daru::DataFrame:26464220 @name = 9900ba68-a23c-454b-a821-4d04fab10eec @size = 6>
                    a          b          c          d 
         0          1          a         11          r 
       100          5          e         55          r 
       200          3          c         33          r 
       300          1          a         11          r 
       400          5          e         55          r 
       500          3          c         33          r 

$> get_group(['a', 'r'])
=> 
#<Daru::DataFrame:26321720 @name = f5fd645b-7769-4c75-a4b8-60c542d62a3e @size = 100>
                    a          b          c          d 
         0          1          a         11          r 
         1          1          a         11          r 
         2          1          a         11          r 
         3          1          a         11          r 
         4          1          a         11          r 
         5          1          a         11          r 
         6          1          a         11          r 
         7          1          a         11          r 
         8          1          a         11          r 
         9          1          a         11          r 
        10          1          a         11          r 
        11          1          a         11          r 
        12          1          a         11          r 
        13          1          a         11          r 
        14          1          a         11          r 
       ...        ...        ...        ...        ...

Looking into it some more.

EDIT

I think I've pinpointed the issue to GroupBy#apply_method, see sebastianiorga@58b9dd1.

Basically the indices that GroupBy stores in the values of @groups are positional. For dataframes with the default numerical index of 0 -> numRows indexes actually behaves the same whether you access a position in the vector or the key in the vector(since the keys are 0 -> numRows as well). If you sort the dataframe then the index keys won't have values 0 -> numRows and things break down.

At least I think that's what's going on. The tests are still green on the fork and I'm now seeing correct behavior for pivot tables on sorted data frames, so I'll go with it.

Link for nyaplot broken in the README

Got a 404 for the nyaplot link in the Basic Usage-Plotting section.

Daru::Index creates buggy Index

Daru::Index creates an buggy index as it accepts a Daru::Index object as one of the element in the array.
Example:

In [47]: x = Daru::Index.new([1,2])
   ....:
Out[47]: #<Daru::Index:0x007faad30c4e98 @relation_hash={1=>0, 2=>1}, @keys=[1, 2], @size=2>

In [48]: y = Daru::Index.new([x,3,4])
   ....:
Out[48]: #<Daru::Index:0x007faad3066870 @relation_hash={#<Daru::Index:0x007faad30c4e98 @relation_hash={1=>0, 2=>1}, @keys=[1, 2], @size=2>=>0, 3=>1, 4=>2}, @keys=[#<Daru::Index:0x007faad30c4e98 @relation_hash={1=>0, 2=>1}, @keys=[1, 2], @size=2>, 3, 4], @size=3>

In [49]: v = Daru::Vector.new(['one','two','three'], name: "bug", index: y)
   ....:
Out[49]:

#<Daru::Vector:70185830660700 @name = bug @size = 3 >
                                      bug
#<Daru::Index:0x007f <table><tr><th colsp
                   3                  two
                   4                three

I think an error should be raised and Index should not be created.
What do you say? @v0dro

Daru::Core::GroupBy#count broken

> frame = Daru::DataFrame.new [{label: 'a'}, {label: 'a'}, {label: 'b'}] 
> grouped = frame.group_by(:label).count
=> ArgumentError: comparison of NilClass with 36 failed

I may be using count incorrectly (there are no rdocs for the method), but this seems like a bug either way.

Sorting a sorted dataframe produces the wrong result

Applying a sort multiple times to a dataframe gives different results.

df = Daru::DataFrame.new({ a: [3,1,2] })
puts df.sort!([:a]).to_hash
puts df.sort!([:a]).to_hash

output:

{:a=>
#<Daru::Vector:70293990504740 @name = a @size = 3 >
      a
  1   1
  2   2
  0   3
}
{:a=>
#<Daru::Vector:70293990504740 @name = a @size = 3 >
      a
  2   2
  0   3
  1   1
}

Specifying vectors and index should create DataFrame of said dimension

Passing a 3 vector names to the :vectors option and a size 10 Array to the :index option should create a new 10x3 DataFrame filled with nils.

Example:

df = Daru::DataFrame.new([], vectors: [:a,:b], index: (1..10).to_a)

Daru::DataFrame methods involveing Daru::Vector extremely slow

I benchmarked DataFrame initialization from a Hash of Arrays

df = Daru::DataFrame.new({a: (1..10000).to_a, 
                          b: (1..10000).to_a,
                          c: (1..10000).to_a})

against an initialization from a Hash of Daru::Vectors

df = Daru::DataFrame.new({a: Daru::Vector.new((1..10000).to_a), 
                          b: Daru::Vector.new((1..10000).to_a),
                          c: Daru::Vector.new((1..10000).to_a)})

Then I appended a new column to the DataFrame, as an Array input and as a Daru::Vector input; i.e. df[:d] = Daru::Vector.new((1..10000).to_a) compared to df[:e] = (1..10000).to_a.

In both cases, the method involving a Daru::Vector is extremely slow. The above example with 10000 rows and 3 columns takes over 8 minutes on my Intel-core-i5-12GB-RAM-laptop.

Here are the results:

*** Initialize a Daru::DataFrame ***
Using Array: 49.448592 milliseconds
Using Daru::Vector: 372070.893382 milliseconds
*** Append a new column ***
Using Daru::Vector: 123282.889606 milliseconds
Using Array: 12.778248000000001 milliseconds

(I am currently using the daru gem, because gsl-nmatrix fails to get installed on my machine for some reason. So, I'm sorry if this is already fixed in the most recent development version.)

DataFrame#sort mutates its object

df = Daru::DataFrame.new({
    a: [1,2,3,4,5,6]*100,
    b: ['a','b','c','d','e','f']*100,
    c: [11,22,33,44,55,66]*100,
    d: ['r']*600
})
=>
#<Daru::DataFrame:45513140 @name = 2703c737-8676-4450-9f75-5be3cb1773c1 @size = 600>
                    a          b          c          d
         0          1          a         11          r
         1          2          b         22          r
         2          3          c         33          r
         3          4          d         44          r
         4          5          e         55          r
         5          6          f         66          r
         6          1          a         11          r
         7          2          b         22          r
         8          3          c         33          r
         9          4          d         44          r
        10          5          e         55          r
        11          6          f         66          r
        12          1          a         11          r
        13          2          b         22          r
        14          3          c         33          r
       ...        ...        ...        ...        ...

df.sort([:b])
=>
#<Daru::DataFrame:65307920 @name = 2703c737-8676-4450-9f75-5be3cb1773c1 @size = 600>
                    a          b          c          d
       324          1          a         11          r
       126          1          a         11          r
       546          1          a         11          r
       474          1          a         11          r
       114          1          a         11          r
       432          1          a         11          r
       120          1          a         11          r
       330          1          a         11          r
       132          1          a         11          r
       510          1          a         11          r
       312          1          a         11          r
       306          1          a         11          r
       150          1          a         11          r
       138          1          a         11          r
       318          1          a         11          r
       ...        ...        ...        ...        ...

df
=>
#<Daru::DataFrame:45513140 @name = 2703c737-8676-4450-9f75-5be3cb1773c1 @size = 600>
                    a          b          c          d
         0          1          a         11          r
         1          1          a         11          r
         2          1          a         11          r
         3          1          a         11          r
         4          1          a         11          r
         5          1          a         11          r
         6          1          a         11          r
         7          1          a         11          r
         8          1          a         11          r
         9          1          a         11          r
        10          1          a         11          r
        11          1          a         11          r
        12          1          a         11          r
        13          1          a         11          r
        14          1          a         11          r
       ...        ...        ...        ...        ...

The problem is in DataFrame#sort: in ruby Object#dup only performs a shallow copy of the object. So, for instance, the old object and the dupped object will both reference the same @data object. Which becomes a problem in DataFrame#partition(if using the quick sort alg) since, I think, that mutates the @data object.

The easy solution is probably something like ActiveSupport's deep_dup which, reasonably quickly, dups the object and all its child objects.

Fix Travis builds going overtime

Many Travis builds seem to be going on for hours at a stretch and eventually fail because they exceed the maximum allotted for builds.

https://travis-ci.org/v0dro/daru/builds/114568818
https://travis-ci.org/v0dro/daru/pull_requests

Can't rename columns?

It seems to me that there is no way to rename columns, currently. You can access information about rownames and column names via #vectors and #index (which I think is unclear, see #19), both of which return a Daru::Index. However, there is no straightforward way to modify these values without remapping to an entirely new data frame, which seems a fairly large hole. Internally this information is stored both in @relation_hash, which is exposed but frozen, and in @keys, which is not exposed but seems redundant with @relation_hash.keys.

It seems like renaming rows or columns could be done by supporting an interface to rename keys in the Index, perhaps by keeping the @relation_hash unfrozen and exposing a []= method to remap a key.

Vector describe function

Currently Daru is having describe function for only dataframes and not for Vectors. describe function can be implemented for vectors. Here's how panda implements this function for both vectors and dataframes.
I have already written down the code for this function. Thought it might be good if I open an issue first.

Use NMatrix instead of Matrix

Hi Sameer, great job getting started on this problem.

However, one the benefits of having a "DataFrame-like" gem is to improve performance in several other libraries dependant on it simultaneously. So NMatrix is a better chance than the pure Ruby alternative, Matrix.

Other benefit of this approach is that you will probably uncover more bugs in NMatrix. :)

If you have any problems, please ask on the mailing list!

Daru::Core::GroupBy#apply_method sometimes chokes on nils

Using daru from the master branch on github.
Example:

$> d = Daru::DataFrame.new({ key: [:a, :b, :c, :b], val: [1, nil, nil, 2] })
=> 
#<Daru::DataFrame:42220140 @name = 2ff8be51-8af2-4a10-8ea9-248409523960 @size = 4>
                  key        val 
         0          a          1 
         1          b        nil 
         2          c        nil 
         3          b          2 

$> d.pivot_table(index: [:key])
NoMethodError: undefined method `mean' for nil:NilClass
from /home/seb/.rvm/gems/ruby-2.2.1/bundler/gems/daru-99db9f85ab6e/lib/daru/core/group_by.rb:228:in `block (2 levels) in apply_method'
[6] pry(#<#<Class:0x007fd28039b110>>)>

Things that combine to cause the problem:

Vector#[] returns a value when it receives a single integer index
GroupBy#apply_method, for this example, @groups is set to {[:a]=>[0], [:b]=>[1, 3], [:c]=>[2]} so the call below to slice = vec[*indexes] just passes one value.
The next line, daru/core/group_by.rb @ line 228: single_row << (slice.is_a?(Numeric) ? slice : slice.send(method)) presumably assumes that if slice is not a numeric it must be an Enumerable? Regardless it explodes when it gets a single nil.

I'm not sure what the correct fix is. Some ideas:

Vector#[] could always return a Vector. That's a consistency plus, but probably a usability hit since extracting the value easily is nice.
Guard against nil in GroupBy#apply_method and just append nil to single_row if slice.nil?. Or something along those lines?
It seems like the existing check, at daru/core/group_by.rb @ line 228 is essentially a type check to see if the slice is an Enumerable or not. Maybe just have it be single_row << (slice.is_a?(Enumerable) ? slice.send(method)) : slice.

Just getting started with Daru so not sure if Vector#[] returning a bare nil causes issues elsewhere. I've added fix 3. to my fork and will see if it ends up breaking other things as I do more stuff based on pivot tables. The tests are green for what that matters.