Git Product home page Git Product logo

doltpy's Introduction

Current State - Deprecated

DoltPy was created in an era where Dolt was a command-line tool. We've worked very hard to make Dolt a MySQL compatible database. To use Dolt with Python, start a Dolt SQL server using dolt sql-server and connect with any Python MySQL client. Use the exposed Dolt stored procedures or system tables to access version control functionality. Popular libraries like Pandas all support MySQL connectivity.

Dolt MySQL client support works in any language, not just Python. Thus, it is more time efficient for us to focus on that interface.

DoltPy

This is DoltPy, the Python API for Dolt. Python is the language of choice for data science and data engineering, and thus we thought it would be wise to publish an API for building automated workflows on top of Dolt and DoltHub, a collaboration platform for Dolt databases.

Installation

You need to install Dolt, which is documented here. It's easy for *nix users:

$ sudo bash -c 'curl -L https://github.com/liquidata-inc/dolt/releases/latest/download/install.sh | sudo bash'

We also distribute Dolt as a Homebrew formula:

$ brew install dolt

Finally, for Windows users our release page has .zip and .msi files.

Once Dolt is installed you can install Doltpy using pip:

$ pip install doltpy

Overview

Doltpy is broken up into modules.

doltpy.cli

This is the most important module. It effectively wraps the Dolt command-line-interface (CLI) that is exposed by the Go binary. The CLI is exposed more or less exactly as it is implemented, returning wrapper objects where appropriate.

It's implementation has moved to a separate repository here

doltpy.cli.read and doltpy.cli.write

These modules provide basic read and write interfaces for reading and writing a variety of tabular data formats, including:

  • CSV files
  • pandas.DataFrame
  • columns, that is dictionaries of lists, i.e. {'col': [...vals...], ...}
  • rows, that is lists of dictionaries, i.e. [{'col': val, ...}, ...]

doltpy.sql

This module provides tools for interacting with Dolt via a Python based SQL connector. The most important class is DoltSQLContext, which has concrete subclasses DoltSQLServerContext and DoltSQLEngineContext. DoltSQLServerContext is for users that want to write Python scripts that use and manage the Dolt SQL Server instance as a child process. DoltSQLEngineContext is for users who want to interact with a remote Dolt SQL Server.

These classes have equivalents of the read and write functions in doltpy.cli.read and doltpy.cli.write for writing CSV files, pandas.DataFrame objects, rows, and columns.

doltpy.sql.sql_sync

This package provides tools for syncing data to and from Dolt, and other relational databases. Currently there is support for MySQL, Postgres, and Oracle. You can find a more detailed description of how to use SQL Sync tools here.

doltpy.etl

This module provides a set of tools for scripting ETL/ELT workflows. At Liquidata we use it internally to push datasets onto DoltHub.

More Information

As alluded to above, you can find a more detailed description of Doltpy here.

doltpy's People

Contributors

andy-wm-arthur avatar captainstabs avatar coffeegoddd avatar dependabot[bot] avatar itdependsnetworks avatar jycor avatar jzcruiser avatar lucab avatar max-hoffman avatar oscarbatori avatar tbantle22 avatar timsehn avatar vinairachakonda avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

doltpy's Issues

`write_pandas` silently failing

Problem

This is happening in the context of the hospitals bounty.

write_pandas does not write to the database, but doesn't return an error message either.

Steps to reproduce

I tried to add the following code data to the database dolt. The data is in the form of a dataframe called

df_cpt_hcpcs_no_dupes

which has had all of the codes removed that already exist in the database. (None of the codes in the DataFrame are present in the dolt database.)

The cell executes without a problem but we can still see that the code does not appear in the database when it is read back.

Screen Shot 2021-02-17 at 5 24 35 PM

Things I have tried

  1. checking the database in the terminal. I can confirm that adding to the hospital table is working, because I am able to see my changes, but adding to the codes table is not. Or at least, it is failing in a way that is hard to detect.
  2. checking the datatypes of the elements being added to the database (all appear to match)

Mac and Windows test runs

We should run our test suite on Mac and Windows boxes to validate that it works across possible usage platform. Currently we run on Ubuntu, but Python can break on OS distinct features (not super frequently, but I've seen it).

importing OrderedDict from typing breaks doltpy on Python 3.6

This is because doltpy/types/dolt.py imports OrderedDict from typing, and it was only added in Python 3.7. I was able to get this to work locally by monkey-patching / editing that import out of the file.

Admittedly, 3.6 is a somewhat older version....... but its not something as ancient as Python2, and Python 3.6 is still default on Ubuntu 18.04.

Untested functions in Doltpy

There are a number of untested functions in doltpy.core that were introduced to in 1.0.0 to match the CLI. A list of them is:

  • reset: no testing, hard and soft modes need to be tested
  • diff: no explicit testing, not used in testing, currently just prints output, we may want to rethink this to use some kind of objects or bind it to queryable diffs and populate the diff in a DataFrame type structure
  • blame: no explicit testing, not used in testing, currently just prints output, like diff might benefit from being stored in a programmatically accessible data structure
  • push: not tested, and actually missing a host of more escoteric options that exist in the CLI
  • pull: same as push
  • fetch: same as push and pull
  • clone: again, untested
  • creds_*: untested
  • config: untested
  • schema_*: no testing, and not clear what the interface should be given that it can do a number of things
  • table_*: no testing, and not clear what the interface should be given it can a number of things

Incomplete testing:

  • commit: used in testing but date and allow_empty not tested
  • sql: basic query execution is tested but non the others modes (result_format, etc.)
  • sql_server: most basic execution of query tested but not other modes
  • branch: basic branch creation is tested, but force, start_point etc., are not.
  • checkout: used in basic testing of branch, but start_point not tested
  • ls: no testing of system and all switches

Others that are used in tests but not explicitly tested:

  • init: no explicit tests but relied on by other tests
  • add: no explicit testing, but used everywhere in tests

I think we can just reference use this ticket to hash out a plan for testing (what needs to be tested and the priorities) and interface design, as there are few functions that would benefit from a programmatic interface, not just printing a string.

Clean up server and connection management in Doltpy

Doltpy makes a number of bad design decisions with regard to interacting with the Dolt server:

  • it fails to clean up the server child process in the event of some failure, leaving zombie processes lurking around (that have caused confusing bugs)
  • before interacting with the server it checks for the server being running by checking for a process object, meaning that if the server was started from the shell this check will fail, cause Doltpy to try and fire up a server, and fail because the port is in use
  • it requires the server to be running on port 3306
  • the connection management is handled in the the library, rather than exposed to the user meaning the connection is restricted to the assumptions Doltpy makes (database name, port, host, etc.)

Several enhancements need to be made:

  • expose connection management to users where functions require a connection
  • expose more configurations from the connection to the get_connection helper function
  • properly clean up when sql_server is called so zombie processes are not lying around
  • augment documentation to reflect the changes

pydoc3 no worky

timsehn$ pydoc3 doltpy | cat
Help on package doltpy:

NAME
    doltpy

PACKAGE CONTENTS
    core (package)
    etl (package)

FILE
    /usr/local/lib/python3.7/site-packages/doltpy/__init__.py


Database error

I am receiving a database error when calling the sync_to_dolt function from doltpy.sql.sync, where it almost looks like the database name is null, but I don't seem to have the option of setting it in the ServerConfiguration instance or Dolt instance.

  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlalchemy\engine\base.py", line 1705, in _execute_context
    self.dialect.do_execute(
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlalchemy\engine\default.py", line 681, in do_execute
    cursor.execute(statement, parameters)
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\mysql\connector\cursor.py", line 569, in execute     
    self._handle_result(self._connection.cmd_query(stmt))
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\mysql\connector\connection.py", line 599, in cmd_query
    result = self._handle_result(self._send_cmd(ServerCmd.QUERY, query))
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\mysql\connector\connection.py", line 487, in _handle_result
    raise errors.get_exception(packet)
mysql.connector.errors.DatabaseError: 1105 (HY000): syntax error at position 25 near 'FROM'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\evani\git\metamonster\metamonster\main.py", line 16, in <module>
    main()
  File "C:\Users\evani\git\metamonster\metamonster\main.py", line 12, in main
    args.func(args)
  File "C:\Users\evani\git\metamonster\metamonster\sync.py", line 137, in sync
    database.sync(config.dolt_config)
  File "C:\Users\evani\git\metamonster\metamonster\sync.py", line 58, in sync
    sync_to_dolt(reader, writer, self.mappings)
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\doltpy\sql\sync\sync_tools.py", line 37, in sync_to_dolt
    _sync_helper(source_reader, target_writer, table_map)
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\doltpy\sql\sync\sync_tools.py", line 66, in _sync_helper
    target_writer(remapped)
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\doltpy\sql\sync\dolt.py", line 35, in inner
    metadata.reflect()
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlalchemy\sql\schema.py", line 4670, in reflect     
    available = util.OrderedSet(insp.get_table_names(schema))
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlalchemy\engine\reflection.py", line 266, in get_table_names
    return self.dialect.get_table_names(
  File "<string>", line 2, in get_table_names
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlalchemy\engine\reflection.py", line 55, in cache  
    ret = fn(self, con, *args, **kw)
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlalchemy\dialects\mysql\base.py", line 2956, in get_table_names
    rp = connection.exec_driver_sql(
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlalchemy\engine\base.py", line 1575, in exec_driver_sql
    return self._exec_driver_sql(
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlalchemy\engine\base.py", line 1484, in _exec_driver_sql
    ret = self._execute_context(
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlalchemy\engine\base.py", line 1748, in _execute_context
    self._handle_dbapi_exception(
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlalchemy\engine\base.py", line 1929, in _handle_dbapi_exception
    util.raise_(
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlalchemy\util\compat.py", line 198, in raise_      
    raise exception
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlalchemy\engine\base.py", line 1705, in _execute_context
    self.dialect.do_execute(
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlalchemy\engine\default.py", line 681, in do_execute
    cursor.execute(statement, parameters)
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\mysql\connector\cursor.py", line 569, in execute     
    self._handle_result(self._connection.cmd_query(stmt))
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\mysql\connector\connection.py", line 599, in cmd_query
    result = self._handle_result(self._send_cmd(ServerCmd.QUERY, query))
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\mysql\connector\connection.py", line 487, in _handle_result
    raise errors.get_exception(packet)
sqlalchemy.exc.DatabaseError: (mysql.connector.errors.DatabaseError) 1105 (HY000): syntax error at position 25 near 'FROM'
[SQL: SHOW FULL TABLES FROM ``]
(Background on this error at: http://sqlalche.me/e/14/4xp6)```

doltpy version 2.0.0 doesn't include core

if you install doltpy via pip it will download version 2 which doesn't have the doltpy.core module.

I had to install doltpy version 1.1.10 in order to get the core module

Fix broken CI

Currently GitHub Action running tests fails for opaque reasons that are not clear since tests run locally, see here for examples. The most probable cause seems to be some strange MySQL connector/docker interaction, though the root cause is not clear. Furthermore the failures are not deterministic in the sense that they don't consistently produce the same failure message.

let get_df_table_writer infer primary keys from df.index

the function get_df_table_writer expects a list of primary keys which are columns of the data frame. I however have a MultiIndex with names that should be used for the compound primary key. this is very much like the df.to_sql works as well. But I need a way to "REPLACE INTO" instead of inserting. I had the hope the get_df_table_writer function could do that.

unknown error: row read wait bigger than connection timeout

More complicated queries result in a timeout:

engine.execute("SELECT o.symbol, coalesce(max(y.eod), '1900-01-06') - INTERVAL 5 DAY FROM optionable o LEFT OUTER JOIN yahoo y on y.symbol = o.symbol GROUP BY o.symbol")

result in:

Traceback (most recent call last):
  File ".venv/lib/python3.7/site-packages/mysql/connector/connection_cext.py", line 489, in cmd_query
    raw_as_string=raw_as_string)
_mysql_connector.MySQLInterfaceError: unknown error: row read wait bigger than connection timeout

Deleting branch reports error even when it succeeds

repo.branch("foo")
# branch foo now exists
repo.branch("foo", delete=True)
# branch foo is deleted

However, doltpy reports:

doltpy.core.dolt.DoltException: (['dolt', 'branch', '--delete', 'foo', 'f', 'o', 'o'], b'', b"fatal: branch 'foo' not found\n", 1)

Tests should run via setuptools (or some other way)

Currently we maintain a requirements.txt file for installing our requirements before runnings tests, yet use setup.py for configuration management. We want a unified dependency management mechanism that allows us to run the tests from scratch (including installing dependencies).

README woefully out of date

I think we should focus on doltpy.core over doltpy.etl to get people eased into the interface. At the start, it's more likely that people will come for simple shit like:

  1. I want to read a table into a dataframe
  2. I want to import a CSV
  3. I want to run a SQL query and get a dictionary of the results

It's much less likely that they'll want a full service importer that they can integrate with Airflow easily. We should ease them up to that.

DB error when empty commit

The sync_to_dolt function doesn't catch empty commit errors properly. Also, if there aren't any changes to commit, why is the function trying to commit them?

Traceback (most recent call last):
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\sqlalchemy\engine\base.py", line 1705, in _execute_context
self.dialect.do_execute(
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\sqlalchemy\engine\default.py", line 681, in do_execute
cursor.execute(statement, parameters)
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\mysql\connector\cursor.py", line 569, in execute
self._handle_result(self._connection.cmd_query(stmt))
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\mysql\connector\connection.py", line 599, in cmd_query
result = self._handle_result(self._send_cmd(ServerCmd.QUERY, query))
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\mysql\connector\connection.py", line 487, in _handle_result
raise errors.get_exception(packet)
mysql.connector.errors.DatabaseError: 1105 (HY000): Cannot commit an empty commit. See the --allow-empty if you want to.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "C:\Users\evani\git\metamonster\metamonster\main.py", line 16, in
main()
File "C:\Users\evani\git\metamonster\metamonster\main.py", line 12, in main
args.func(args)
File "C:\Users\evani\git\metamonster\metamonster\sync.py", line 169, in sync
database.sync_data(config.dolt_config)
File "C:\Users\evani\git\metamonster\metamonster\sync.py", line 78, in sync_data
sync_to_dolt(reader, writer, self.mappings)
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\doltpy\sql\sync\sync_tools.py", line 37, in sync_to_dolt
_sync_helper(source_reader, target_writer, table_map)
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\doltpy\sql\sync\sync_tools.py", line 66, in _sync_helper
target_writer(remapped)
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\doltpy\sql\sync\dolt.py", line 45, in inner
return dsc.commit_tables(message or "Executed SQL sync", tables, True)
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\doltpy\sql\sql.py", line 103, in commit_tables
result = [dict(row) for row in conn.execute(f"SELECT DOLT_COMMIT({dolt_commit_args})")]
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\sqlalchemy\engine\base.py", line 1185, in execute
return self._exec_driver_sql(
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\sqlalchemy\engine\base.py", line 1484, in _exec_driver_sql
ret = self._execute_context(
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\sqlalchemy\engine\base.py", line 1748, in _execute_context
self.handle_dbapi_exception(
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\sqlalchemy\engine\base.py", line 1929, in handle_dbapi_exception
util.raise
(
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\sqlalchemy\util\compat.py", line 198, in raise

raise exception
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\sqlalchemy\engine\base.py", line 1705, in _execute_context
self.dialect.do_execute(
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\sqlalchemy\engine\default.py", line 681, in do_execute
cursor.execute(statement, parameters)
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\mysql\connector\cursor.py", line 569, in execute
self._handle_result(self._connection.cmd_query(stmt))
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\mysql\connector\connection.py", line 599, in cmd_query
result = self._handle_result(self._send_cmd(ServerCmd.QUERY, query))
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\mysql\connector\connection.py", line 487, in _handle_result
raise errors.get_exception(packet)
sqlalchemy.exc.DatabaseError: (mysql.connector.errors.DatabaseError) 1105 (HY000): Cannot commit an empty commit. See the --allow-empty if you want to.
[SQL: SELECT DOLT_COMMIT('-m', 'Executed SQL sync')]
(Background on this error at: http://sqlalche.me/e/14/4xp6)

Expose logging configurations to users

Currently Doltpy implements logging in a suboptimal way: each file has a logging object with it's name passed. This means the library does not have an ability to configure the log level without importing the logger and then manually setting the log level (it's not that bad, but it requires knowing about the variable, and it's not documented).

Instead, we should implement a global log level variable for doltpy.core and expose it in the user facing API with a sensible default. Loggers should then be instantiated from a shared factory method that uses this shared configuration. We can also use this to enable library users to specify handlers etc.

More interface options for Dolt.import_df

I think we need to blow out the import interface, as cool as Pandas DataFrames are.

I think we should accept at least three formats:

  1. Dataframes
  2. [{col: val, ...}]
  3. [[val, val, ...]] - with and without header row.

I also think the import type should be a instance of the import not a parameter you pass if we can:

  1. dolt.import.create()
  2. dolt.import.replace()
  3. dolt.import.update()

Those can be underscores not . if having an import class is not "python-y". I'm a python noob.

Fix doltpy.etl interfaces

The current ETL interfaces are clunky and hard to work with. They are designed primarily to solve a problem that comes out of the way in which we manage Airflow. Specifically there are too many layers of delayed execution implemented by passing functions down the call stack that make the code clunky to work with.

Dolt.branch() options `delete`, `copy` and `move` broken

If I want to rename a branch, for example renaming master into main:

repo.branch('master', new_branch='main', force=True, move=True)

results in this being executed: dolt branch --force master instead of dolt branch --force --move master main.

I'm pretty sure the issue is in core/dolt.py, line 612

if branch_name and not(delete and copy and move):

should be

if branch_name and not(delete or copy or move):

Dolt.config(...) should support local and global configurations

Currently config(...) is defined on the Dolt object, or instances thereof. This is not quite right as it requires the creation of a repository to work. In contexts such as Docker where the dolt configurations are unlikely to be set when Dolt.init(...) is run.

We likely want a static method called config_global and a method called config_local, with the commonalities factored out.

Merge commit breaks log() parsing

Having a merge commit in the dolt log breaks log() parsing, since there's an extra "Merge:" line between "commit" and "Author:" in the output.

Stacktrace (sanitized)

  File "venv/lib/python3.8/site-packages/doltpy/core/dolt.py", line 500, in log
    date = datetime.strptime(output[i + date_offset].split(':', maxsplit=1)[1].lstrip(),
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/_strptime.py", line 568, in _strptime_datetime
    tt, fraction, gmtoff_fraction = _strptime(data_string, format)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/_strptime.py", line 349, in _strptime
    raise ValueError("time data %r does not match format %r" %
ValueError: time data 'author <[email protected]>' does not match format '%a %b %d %H:%M:%S %z %Y'

dolt log output (sanitized)

commit 53lmlmpumo499i5puf12ip875taql2ad
Merge: 48l4ovo7c2mrqkagi9io1g3eam1s5m0b 0puirbrqe5sgtjrr8j2ka3dabeo6nk58
Author: author <[email protected]>
Date:   Thu Oct 01 21:17:20 +0200 2020

	Merged PR #1

[...]

Go API?

Hi,

Nice work on Dolt and this Python API.

Is there a clean and simple Go API for performing Dolt operations and getting results back?

Obviously, the backend is Go, but I don't see an easily accessible API for a client program to utilise. Perhaps I overlooked something?

Apologies if this is not the best place to ask this question.

Thanks!

Enhancements to writing of Python data structures

In Doltpy 1.0.0 we introduced a number of new interfaces for writing data in doltpy.core.write:

  • import_dict(...)
  • import_list(...)
    These import a dictionary of lists, i.e. data laid out "column major" in memory keyed on the column name, and lists of dictionaries that model data "row major".

The requirements for this are very strict, in the sense that the rows need to have the same set of keys as the table, and writing is only supported in update and create mode. Further more update doesn't work very well as it will fail on duplicate primary key errors rather than updating existing rows with new values.

We want to enhance

  • add a strict option that allows users to pass dicts with a subset of the columns to be passed (in both interfaces)
  • expose the ability for the user to specify the connection rather than having the import functions manage it themselves
  • implement something akin to UPDATE ON DUPLICATE semantics for updating values when duplicate primary keys are encountered on the update path
  • remove primary key requirement for update
  • augment documentation and examples with these changes

Use SQL Alchemy for table schema sync

Syncing data is one thing, but in fact the most frustrating thing about SQL sync might be having to translate the schemas. The reason is that the if done by hand it requires the user to engage in mapping types, constraints, and syntax differences across SQL implementations. SQL Alchemy provides a Python based DSL for expressing SQL manipulations that abstracts across SQL implementations (to the degree that is possible). We have a private POC of using it to copy the schema of SQL database into an equivalent for MySQL and sync the remaining data.

The main issues to deal with are:

  • constraints and columns properties that are supported/not supported in various SQL implementations
  • type mappings (for example arrays in Postgres not being supported in MySQL)

The following snippet is crude first draft at the type mapping when translating from Postgres to MySQL:

def clean_column(column):
    # Postgres uses next_val in autoincrement, and mysql does not. If autoincrement
    # and has a default, remove. This likely has other use cases this would not be
    # desirable for, and should be adjusted as those are found.
    if hasattr(column, "server_default") and column.autoincrement is True:
        delattr(column, 'server_default')
    # Postgres has native IP address types, that should be converted to VARCHAR(43)
    if str(column.type) == 'CIDR':
        column = Column(column.name, VARCHAR(length=43), autoincrement=column.autoincrement, nullable=column.nullable)
    elif str(column.type) == 'INET':
        column = Column(column.name, VARCHAR(length=43), autoincrement=column.autoincrement, nullable=column.nullable)
    elif str(column.type) == 'MACADDR':
        column = Column(column.name, VARCHAR(length=43), autoincrement=column.autoincrement, nullable=column.nullable)
    elif str(column.type) == 'JSONB':
        column = Column(column.name, mysql.LONGTEXT, autoincrement=column.autoincrement, nullable=column.nullable)
    elif str(column.type) == 'ARRAY':
        column = Column(column.name, mysql.LONGTEXT, autoincrement=column.autoincrement, nullable=column.nullable)
    # Postgres can have an ARRAY of other types, such as SMALLINT[]
    elif str(column.type).endswith('[]'):
        column = Column(column.name, mysql.LONGTEXT, autoincrement=column.autoincrement, nullable=column.nullable)
    elif str(column.type) == 'UUID':
        column = Column(column.name, VARCHAR(length=36), autoincrement=column.autoincrement, nullable=column.nullable)
    elif str(column.type) == 'BYTEA':
        column = Column(column.name, mysql.LONGTEXT, autoincrement=column.autoincrement, nullable=column.nullable)
    return column

Renaming branch fails

If I want to rename a branch, for example renaming master into main:

repo.branch("master", new_branch="main", move=True)

it fails

doltpy.core.dolt.DoltException: (['dolt', 'branch', '--move', 'master', 'm', 'a', 'i', 'n'], b'usage: dolt branch [--list] [-v] [-a] [-r]\n   or: dolt branch [-f] <branchname> [<start-point>]\n   or: dolt branch -m [-f] [<oldbranch>] <newbranch>\n   or: dolt branch -c [-f] [<oldbranch>] <newbranch>\n   or: dolt branch -d [-f] [-r] <branchname>...\n\nSpecific dolt branch options\n    <start-point>\n      A commit that a new branch should point at.\n    \n    --list\n      List branches\n    \n    -f, --force\n      Reset <branchname> to <startpoint>, even if <branchname> exists already. Without -f, dolt branch refuses to change an\n      existing branch. In combination with -d (or --delete), allow deleting the branch irrespective of its merged status. In\n      combination with -m (or --move), allow renaming the branch even if the new branch name already exists, the same applies\n      for -c (or --copy).\n    \n    -c, --copy\n      Create a copy of a branch.\n    \n    -m, --move\n      Move/rename a branch\n    \n    -d, --delete\n      Delete a branch. The branch must be fully merged in its upstream branch.\n    \n    --D\n      Shortcut for --delete --force.\n    \n    -v, --verbose\n      When in list mode, show the hash and commit subject line for each head\n    \n    -a, --all\n      When in list mode, shows remote tracked branches\n    \n    -r, --remote\n      When in list mode, show only remote tracked branches. When with -d, delete a remote tracking branch.\n    \n    --show-current\n      Print the name of the current branch\n    \n', b'', 1)

Make replace mode work in import interfaces

Currently interfaces for importing Python data structures do not support replace mode. We should ensure they do so.

Since they use the SQL server this is a bit trickier than just using the CLI correctly.

Support Python 3.8

Is Python 3.8 supported? https://www.dolthub.com/docs/tutorials/installation/#doltpy says:

We currently do not support Python versions higher than 3.7 due to an incompatibility with the MySQL Server connectors we are using.

Well, this is really bad, and this means I can't use this package. It's not a sustainable approach. The beta of 3.9 is already out.

In any case, shouldn't setup.py document which versions of Python are required for this package?

More helpful error messages

Passing True to Dolt.checkout(...) treats True as a branch, and eventually errors out via a subprocess. It should error more clearly.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.