Git Product home page Git Product logo

Comments (6)

mrocklin avatar mrocklin commented on August 26, 2024

Hrm, running off of master this works fine for me.

At the same time, I don't recall any fix that has happened between the last anaconda channel release and now that would account for the difference. I suggest that you try your problem from the development channel

conda install into -c blaze

And if that doesn't work then lets dive more deeply.

In [1]: from into import into, discover

In [2]: d = into('sqlite:///../db.sqlite3::data_event', 'file.csv')
/home/mrocklin/tmp/file.csv:1: expected 5 columns but found 4 - extras ignored
/home/mrocklin/tmp/file.csv:4: expected 5 columns but found 1 - filling the rest with NULL
/home/mrocklin/tmp/file.csv:4: INSERT failed: NOT NULL constraint failed: data_event.Unnamed: 0

In [3]: d
Out[3]: Table('data_event', MetaData(bind=Engine(sqlite:///../db.sqlite3)), Column('Unnamed: 0', BigInteger(), table=<data_event>, nullable=False), Column('operation', Text(), table=<data_event>, nullable=False), Column('name', Text(), table=<data_event>, nullable=False), Column('aircraft', Text(), table=<data_event>, nullable=False), Column('datetime_nearest', DateTime(), table=<data_event>, nullable=False), schema=None)

In [4]: discover(d)
Out[4]: 
dshape("""var * {
  Unnamed: 0: int64,
  operation: string,
  name: string,
  aircraft: string,
  datetime_nearest: datetime
  }""")

In [6]: from blaze import Data

In [7]: d = Data('file.csv')

In [8]: d2 = into('sqlite:///../db.sqlite3::data_event2', d)

In [9]: d2
Out[9]: Table('data_event2', MetaData(bind=Engine(sqlite:///../db.sqlite3)), Column('Unnamed: 0', BigInteger(), table=<data_event2>, nullable=False), Column('operation', Text(), table=<data_event2>, nullable=False), Column('name', Text(), table=<data_event2>, nullable=False), Column('aircraft', Text(), table=<data_event2>, nullable=False), Column('datetime_nearest', DateTime(), table=<data_event2>, nullable=False), schema=None)

In [10]: discover(d2)
Out[10]: 
dshape("""var * {
  Unnamed: 0: int64,
  operation: string,
  name: string,
  aircraft: string,
  datetime_nearest: datetime
  }""")

from odo.

FRidh avatar FRidh commented on August 26, 2024

I should note that I was trying to load the data into an already existing table. For this ticket I exported the data to a CSV file, and this inadvertently included the first column of integers (the pandas index).
Therefore, in this case, the data would not fit in the table anyway.

Nevertheless, I cannot manage to load this data into an already existing but empty table.
Loading the data into a new table works just fine.

In the table the field operation is called operation_id because it is the primary key. This however shouldn't affect the storing of data. At least, it worked fine in other cases, or am I making the wrong assumption here that into looks at column names and doesn't care about the order?

Looking at the error, into connects the wrong column names to the values. Having the data in the right layout, using

d = Data('tmp.csv', fields=['name', 'operation', 'datetime_nearest', 'aircraft'])

doesn't seem to make a difference.

In [1]: from blaze import *

In [2]: table = 'sqlite:///../db.sqlite3::data_event'

In [3]: d = Data('tmp.csv')

In [4]: t = Table(table)

In [5]: discover(d)
Out[5]: 
dshape("""var * {
  operation: string,
  aircraft: string,
  name: string,
  datetime_nearest: datetime
  }""")

In [6]: discover(t)
Out[6]: 
dshape("""var * {
  name: string,
  operation_id: string,
  datetime_nearest: datetime,
  aircraft: string
  }""")

In [7]: d.head()
Out[7]: 
  operation aircraft         name           datetime_nearest
0       L14     E190  01_021_E190 2013-08-19 06:47:05.050000
1       L14     B738  01_022_B738 2013-08-19 06:48:22.200000
2       L14     CRJ9  01_023_CRJ9 2013-08-19 06:50:04.250000
3       L14     CRJ2  01_024_CRJ2 2013-08-19 06:55:00.900000
4       L14     A333  01_025_A333        2013-08-19 06:56:47
5       L14     A320  01_026_A320 2013-08-19 06:58:52.350000
6       L14     A332  01_027_A332 2013-08-19 07:00:24.450000
7       L14     A320  01_030_A320 2013-08-19 07:12:49.700000
8       L14     RJ1H  01_034_RJ1H 2013-08-19 08:48:52.150000
9       L14     B733  01_036_B733 2013-08-19 08:53:24.700000

In [8]: t.head()
Out[8]: 
Empty DataFrame
Columns: [name, operation_id, datetime_nearest, aircraft]
Index: []

In [9]: into(table, d)
---------------------------------------------------------------------------
StatementError                            Traceback (most recent call last)
<ipython-input-9-c1c61a8491a1> in <module>()
----> 1 into(table, d)

/home/rifr/anaconda3/envs/sonair/lib/python3.4/site-packages/multipledispatch/dispatcher.py in __call__(self, *args, **kwargs)
    161             self._cache[types] = func
    162         try:
--> 163             return func(*args, **kwargs)
    164 
    165         except MDNotImplementedError:

/home/rifr/anaconda3/envs/sonair/lib/python3.4/site-packages/blaze/interactive.py in into(a, b, **kwargs)
    258     result = compute(b, **kwargs)
    259     kwargs['dshape'] = b.dshape
--> 260     return into(a, result, **kwargs)
    261 
    262 

/home/rifr/anaconda3/envs/sonair/lib/python3.4/site-packages/multipledispatch/dispatcher.py in __call__(self, *args, **kwargs)
    161             self._cache[types] = func
    162         try:
--> 163             return func(*args, **kwargs)
    164 
    165         except MDNotImplementedError:

/home/rifr/anaconda3/envs/sonair/lib/python3.4/site-packages/into/into.py in into_string(uri, b, **kwargs)
     72         pass
     73     a = resource(uri, **kwargs)
---> 74     return into(a, b, **kwargs)
     75 
     76 

/home/rifr/anaconda3/envs/sonair/lib/python3.4/site-packages/multipledispatch/dispatcher.py in __call__(self, *args, **kwargs)
    161             self._cache[types] = func
    162         try:
--> 163             return func(*args, **kwargs)
    164 
    165         except MDNotImplementedError:

/home/rifr/anaconda3/envs/sonair/lib/python3.4/site-packages/into/into.py in into_object(a, b, **kwargs)
     58     except NotImplementedError:
     59         pass
---> 60     return append(a, b, **kwargs)
     61 
     62 

/home/rifr/anaconda3/envs/sonair/lib/python3.4/site-packages/multipledispatch/dispatcher.py in __call__(self, *args, **kwargs)
    161             self._cache[types] = func
    162         try:
--> 163             return func(*args, **kwargs)
    164 
    165         except MDNotImplementedError:

/home/rifr/anaconda3/envs/sonair/lib/python3.4/site-packages/into/backends/sql.py in append_anything_to_sql_Table(t, o, **kwargs)
    234 @append.register(sa.Table, object)
    235 def append_anything_to_sql_Table(t, o, **kwargs):
--> 236     return append(t, convert(Iterator, o, **kwargs), **kwargs)
    237 
    238 

/home/rifr/anaconda3/envs/sonair/lib/python3.4/site-packages/multipledispatch/dispatcher.py in __call__(self, *args, **kwargs)
    161             self._cache[types] = func
    162         try:
--> 163             return func(*args, **kwargs)
    164 
    165         except MDNotImplementedError:

/home/rifr/anaconda3/envs/sonair/lib/python3.4/site-packages/into/backends/sql.py in append_iterator_to_table(t, rows, **kwargs)
    227     with engine.connect() as conn:
    228         for chunk in partition_all(1000, rows):  # TODO: 1000 is hardcoded
--> 229             conn.execute(t.insert(), chunk)
    230 
    231     return t

/home/rifr/anaconda3/envs/sonair/lib/python3.4/site-packages/sqlalchemy/engine/base.py in execute(self, object, *multiparams, **params)
    727                 type(object))
    728         else:
--> 729             return meth(self, multiparams, params)
    730 
    731     def _execute_function(self, func, multiparams, params):

/home/rifr/anaconda3/envs/sonair/lib/python3.4/site-packages/sqlalchemy/sql/elements.py in _execute_on_connection(self, connection, multiparams, params)
    320 
    321     def _execute_on_connection(self, connection, multiparams, params):
--> 322         return connection._execute_clauseelement(self, multiparams, params)
    323 
    324     def unique_params(self, *optionaldict, **kwargs):

/home/rifr/anaconda3/envs/sonair/lib/python3.4/site-packages/sqlalchemy/engine/base.py in _execute_clauseelement(self, elem, multiparams, params)
    824             compiled_sql,
    825             distilled_params,
--> 826             compiled_sql, distilled_params
    827         )
    828         if self._has_events or self.engine._has_events:

/home/rifr/anaconda3/envs/sonair/lib/python3.4/site-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
    891             self._handle_dbapi_exception(e,
    892                                          util.text_type(statement), parameters,
--> 893                                          None, None)
    894 
    895         if context.compiled:

/home/rifr/anaconda3/envs/sonair/lib/python3.4/site-packages/sqlalchemy/engine/base.py in _handle_dbapi_exception(self, e, statement, parameters, cursor, context)
   1157                 util.raise_from_cause(
   1158                     sqlalchemy_exception,
-> 1159                     exc_info
   1160                 )
   1161             else:

/home/rifr/anaconda3/envs/sonair/lib/python3.4/site-packages/sqlalchemy/util/compat.py in raise_from_cause(exception, exc_info)
    186             exc_info = sys.exc_info()
    187         exc_type, exc_value, exc_tb = exc_info
--> 188         reraise(type(exception), exception, tb=exc_tb, cause=exc_value)
    189 else:
    190     exec("def reraise(tp, value, tb=None, cause=None):\n"

/home/rifr/anaconda3/envs/sonair/lib/python3.4/site-packages/sqlalchemy/util/compat.py in reraise(tp, value, tb, cause)
    179             value.__cause__ = cause
    180         if value.__traceback__ is not tb:
--> 181             raise value.with_traceback(tb)
    182         raise value
    183 

/home/rifr/anaconda3/envs/sonair/lib/python3.4/site-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
    887                 conn = self._revalidate_connection()
    888 
--> 889             context = constructor(dialect, self, conn, *args)
    890         except Exception as e:
    891             self._handle_dbapi_exception(e,

/home/rifr/anaconda3/envs/sonair/lib/python3.4/site-packages/sqlalchemy/engine/default.py in _init_compiled(cls, dialect, connection, dbapi_connection, compiled, parameters)
    571                 for key in self.compiled.positiontup:
    572                     if key in processors:
--> 573                         param.append(processors[key](compiled_params[key]))
    574                     else:
    575                         param.append(compiled_params[key])

/home/rifr/anaconda3/envs/sonair/lib/python3.4/site-packages/sqlalchemy/dialects/sqlite/base.py in process(value)
    370                 }
    371             else:
--> 372                 raise TypeError("SQLite DateTime type only accepts Python "
    373                                 "datetime and date objects as input.")
    374         return process

StatementError: SQLite DateTime type only accepts Python datetime and date objects as input. (original cause: TypeError: SQLite DateTime type only accepts Python datetime and date objects as input.) 'INSERT INTO data_event (name, operation_id, datetime_nearest, aircraft) VALUES (?, ?, ?, ?)' ({'aircraft': datetime.datetime(2013, 8, 19, 6, 47, 5, 50000), 'name': 'L14', 'datetime_nearest': '01_021_E190', 'operation_id': 'E190'}, {'aircraft': datetime.datetime(2013, 8, 19, 6, 48, 22, 200000), 'name': 'L14', 'datetime_nearest': '01_022_B738', 'operation_id': 'B738'}, {'aircraft': datetime.datetime(2013, 8, 19, 6, 50, 4, 250000), 'name': 'L14', 'datetime_nearest': '01_023_CRJ9', 'operation_id': 'CRJ9'}, {'aircraft': datetime.datetime(2013, 8, 19, 6, 55, 0, 900000), 'name': 'L14', 'datetime_nearest': '01_024_CRJ2', 'operation_id': 'CRJ2'}, {'aircraft': datetime.datetime(2013, 8, 19, 6, 56, 47), 'name': 'L14', 'datetime_nearest': '01_025_A333', 'operation_id': 'A333'}, {'aircraft': datetime.datetime(2013, 8, 19, 6, 58, 52, 350000), 'name': 'L14', 'datetime_nearest': '01_026_A320', 'operation_id': 'A320'}, {'aircraft': datetime.datetime(2013, 8, 19, 7, 0, 24, 450000), 'name': 'L14', 'datetime_nearest': '01_027_A332', 'operation_id': 'A332'}, {'aircraft': datetime.datetime(2013, 8, 19, 7, 12, 49, 700000), 'name': 'L14', 'datetime_nearest': '01_030_A320', 'operation_id': 'A320'}  ... displaying 10 of 1000 total bound parameter sets ...  {'aircraft': datetime.datetime(2013, 8, 23, 14, 38, 14, 300000), 'name': 'L14', 'datetime_nearest': '05_206_B737', 'operation_id': 'B737'}, {'aircraft': datetime.datetime(2013, 8, 23, 14, 39, 51, 550000), 'name': 'L14', 'datetime_nearest': '05_207_A320', 'operation_id': 'A320'})

from odo.

mrocklin avatar mrocklin commented on August 26, 2024

At least, it worked fine in other cases, or am I making the wrong assumption here that into looks at column names and doesn't care about the order?

When you specify fields= you're overriding the column names found in the CSV file, but not re-ordering them; this causes the mismatch.

Loading into SQL will be order-independent because SQL doesn't care much. Looking at your two datashapes it looks like you have one column named operation and another named operation_id. You could do your same trick with Data(..., fields=) and supply the new name but this time in an order that matches the csv file

In [1]: from blaze import *

In [2]: d = Data('file.csv', fields=['name', 'operation_id', 'datetime_nearest', 'aircraft'])

Or you could use relabel

In [2]: d = Data('file.csv')

In [3]: d
Out[3]: 
  operation         name aircraft           datetime_nearest
0       S16  24_002_A320     A320 2014-05-15 07:46:15.100000
1       S16  24_005_A321     A321 2014-05-15 08:08:23.200000

In [4]: d.relabel(operation='operation_id')
Out[4]: 
  operation_id         name aircraft           datetime_nearest
0          S16  24_002_A320     A320 2014-05-15 07:46:15.100000
1          S16  24_005_A321     A321 2014-05-15 08:08:23.200000

In both cases your data now matches your column names which should match in with what you find in your database.

If for some reason the order does need to be changed. The standard way here is to index with a list of column names

In [5]: d.relabel(operation='operation_id')[['name', 'operation_id', 'datetime_nearest', 'aircraft']]
Out[5]: 
          name operation_id           datetime_nearest aircraft
0  24_002_A320          S16 2014-05-15 07:46:15.100000     A320
1  24_005_A321          S16 2014-05-15 08:08:23.200000     A321

from odo.

FRidh avatar FRidh commented on August 26, 2024

Thank you Matthew. I had some other data where I didn't append the _id to the label and in those cases loading the data in an already existing table worked just fine.

I had difficulties with another table as well. All the labels were correct, including the _id part, but still the data wouldn't go into the database. Then, changing the order of the fields to match with the order of the database as shown by discover(table), it did work. I should note that in this case the source had a date and time represented by a string (according to discover(data)) while in the database this field was of type datetime.

Therefore, if I am correct, into first tries to match label and type, and if that doesn't work goes for the order instead.

from odo.

mrocklin avatar mrocklin commented on August 26, 2024

When using Python iterators most of this logic is actually handled by sqlalchemy, not into. The into project turns your thing into a sequence of dictionaries of the form {'label1': value, 'label2': value, ...} for your various field names and then sends to this SQLAlchemy to transfer to the database.

(relevant into code here https://github.com/ContinuumIO/into/blob/master/into/backends/sql.py#L213-L234)

It looks like we are using the target (sql database) to determine field names which is why you had to change the order. This should probably be fixed so that it uses the source datashape rather than the target's.

In regards to coercing datetime types to strings, SQLAlchemy is probably handling this. They have their own type system and coercion logic on which into relies heavily.

from odo.

mrocklin avatar mrocklin commented on August 26, 2024

Improved handling of mismatched column names in #21 .

Two major changes

  1. Column-name order is taken from the source if possible
  2. We give more informative errors before we send data on to SQLAlchemy

from odo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.