dolthub / doltpy Goto Github PK

A Python API for Dolt

License: Apache License 2.0

Python 99.27% Shell 0.73%

doltpy's Introduction

Current State - Deprecated

DoltPy was created in an era where Dolt was a command-line tool. We've worked very hard to make Dolt a MySQL compatible database. To use Dolt with Python, start a Dolt SQL server using dolt sql-server and connect with any Python MySQL client. Use the exposed Dolt stored procedures or system tables to access version control functionality. Popular libraries like Pandas all support MySQL connectivity.

Dolt MySQL client support works in any language, not just Python. Thus, it is more time efficient for us to focus on that interface.

DoltPy

This is DoltPy, the Python API for Dolt. Python is the language of choice for data science and data engineering, and thus we thought it would be wise to publish an API for building automated workflows on top of Dolt and DoltHub, a collaboration platform for Dolt databases.

Installation

You need to install Dolt, which is documented here. It's easy for *nix users:

$ sudo bash -c 'curl -L https://github.com/liquidata-inc/dolt/releases/latest/download/install.sh | sudo bash'

We also distribute Dolt as a Homebrew formula:

$ brew install dolt

Finally, for Windows users our release page has .zip and .msi files.

Once Dolt is installed you can install Doltpy using pip:

$ pip install doltpy

Overview

Doltpy is broken up into modules.

`doltpy.cli`

This is the most important module. It effectively wraps the Dolt command-line-interface (CLI) that is exposed by the Go binary. The CLI is exposed more or less exactly as it is implemented, returning wrapper objects where appropriate.

It's implementation has moved to a separate repository here

`doltpy.cli.read` and `doltpy.cli.write`

These modules provide basic read and write interfaces for reading and writing a variety of tabular data formats, including:

CSV files
pandas.DataFrame
columns, that is dictionaries of lists, i.e. {'col': [...vals...], ...}
rows, that is lists of dictionaries, i.e. [{'col': val, ...}, ...]

`doltpy.sql`

This module provides tools for interacting with Dolt via a Python based SQL connector. The most important class is DoltSQLContext, which has concrete subclasses DoltSQLServerContext and DoltSQLEngineContext. DoltSQLServerContext is for users that want to write Python scripts that use and manage the Dolt SQL Server instance as a child process. DoltSQLEngineContext is for users who want to interact with a remote Dolt SQL Server.

These classes have equivalents of the read and write functions in doltpy.cli.read and doltpy.cli.write for writing CSV files, pandas.DataFrame objects, rows, and columns.

`doltpy.sql.sql_sync`

This package provides tools for syncing data to and from Dolt, and other relational databases. Currently there is support for MySQL, Postgres, and Oracle. You can find a more detailed description of how to use SQL Sync tools here.

`doltpy.etl`

This module provides a set of tools for scripting ETL/ELT workflows. At Liquidata we use it internally to push datasets onto DoltHub.

More Information

As alluded to above, you can find a more detailed description of Doltpy here.

doltpy's People

Contributors

Stargazers

Watchers

Forkers

gridl itdependsnetworks rjones-imb lucab jzcruiser cubrink fairhopeweb captainstabs isabella232 chaunceydust pbronez abmyii johnnyistrying

doltpy's Issues

`write_pandas` silently failing

Problem

This is happening in the context of the hospitals bounty.

write_pandas does not write to the database, but doesn't return an error message either.

Steps to reproduce

I tried to add the following code data to the database dolt. The data is in the form of a dataframe called

df_cpt_hcpcs_no_dupes

which has had all of the codes removed that already exist in the database. (None of the codes in the DataFrame are present in the dolt database.)

The cell executes without a problem but we can still see that the code does not appear in the database when it is read back.

Things I have tried

checking the database in the terminal. I can confirm that adding to the hospital table is working, because I am able to see my changes, but adding to the codes table is not. Or at least, it is failing in a way that is hard to detect.
checking the datatypes of the elements being added to the database (all appear to match)

Mac and Windows test runs

We should run our test suite on Mac and Windows boxes to validate that it works across possible usage platform. Currently we run on Ubuntu, but Python can break on OS distinct features (not super frequently, but I've seen it).

Dolt.schema_import should support a dataframe, {}, and [] interface

We should support DataFrames, lists and dictionaries as supported input to schema_import. We should also allow the user to control replace, create, etc.

importing OrderedDict from typing breaks doltpy on Python 3.6

This is because doltpy/types/dolt.py imports OrderedDict from typing, and it was only added in Python 3.7. I was able to get this to work locally by monkey-patching / editing that import out of the file.

Admittedly, 3.6 is a somewhat older version....... but its not something as ancient as Python2, and Python 3.6 is still default on Ubuntu 18.04.

Inserting Nulls Into A DataFrame Object Causes A Row To Not Be Imported

Untested functions in Doltpy

There are a number of untested functions in doltpy.core that were introduced to in 1.0.0 to match the CLI. A list of them is:

reset: no testing, hard and soft modes need to be tested
diff: no explicit testing, not used in testing, currently just prints output, we may want to rethink this to use some kind of objects or bind it to queryable diffs and populate the diff in a DataFrame type structure
blame: no explicit testing, not used in testing, currently just prints output, like diff might benefit from being stored in a programmatically accessible data structure
push: not tested, and actually missing a host of more escoteric options that exist in the CLI
pull: same as push
fetch: same as push and pull
clone: again, untested
creds_*: untested
config: untested
schema_*: no testing, and not clear what the interface should be given that it can do a number of things
table_*: no testing, and not clear what the interface should be given it can a number of things

Incomplete testing:

commit: used in testing but date and allow_empty not tested
sql: basic query execution is tested but non the others modes (result_format, etc.)
sql_server: most basic execution of query tested but not other modes
branch: basic branch creation is tested, but force, start_point etc., are not.
checkout: used in basic testing of branch, but start_point not tested
ls: no testing of system and all switches

Others that are used in tests but not explicitly tested:

init: no explicit tests but relied on by other tests
add: no explicit testing, but used everywhere in tests

I think we can just reference use this ticket to hash out a plan for testing (what needs to be tested and the priorities) and interface design, as there are few functions that would benefit from a programmatic interface, not just printing a string.

Dolt.push() set_upstream param should be bool instead of str

Minor typing issue with push(), set_upstream param should be typed as bool instead of str

Add support for Oracle to SQL sync

As per this issue raised on Dolt repository, we want to add support for Oracle.

This Docker image looks like a promising tool for the test harness, as each database implementation requires an image.

Clean up server and connection management in Doltpy

Doltpy makes a number of bad design decisions with regard to interacting with the Dolt server:

it fails to clean up the server child process in the event of some failure, leaving zombie processes lurking around (that have caused confusing bugs)
before interacting with the server it checks for the server being running by checking for a process object, meaning that if the server was started from the shell this check will fail, cause Doltpy to try and fire up a server, and fail because the port is in use
it requires the server to be running on port 3306
the connection management is handled in the the library, rather than exposed to the user meaning the connection is restricted to the assumptions Doltpy makes (database name, port, host, etc.)

Several enhancements need to be made:

expose connection management to users where functions require a connection
expose more configurations from the connection to the get_connection helper function
properly clean up when sql_server is called so zombie processes are not lying around
augment documentation to reflect the changes

clone_repo should have output directory as not required

If you call dolt clone /. It makes the repo directory in your current directory called .

doltpy should do the same. Create the repo in and run Dolt(<repo>.

Clone busted

https://github.com/liquidata-inc/doltpy/blob/be78050c2876a0c8f0558d71d47d0d7763e14d58/doltpy/core/dolt.py#L720

There is an extra "dolt" there that breaks it.

pydoc3 no worky

timsehn$ pydoc3 doltpy | cat
Help on package doltpy:

NAME
    doltpy

PACKAGE CONTENTS
    core (package)
    etl (package)

FILE
    /usr/local/lib/python3.7/site-packages/doltpy/__init__.py

Database error

I am receiving a database error when calling the sync_to_dolt function from doltpy.sql.sync, where it almost looks like the database name is null, but I don't seem to have the option of setting it in the ServerConfiguration instance or Dolt instance.

  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlalchemy\engine\base.py", line 1705, in _execute_context
    self.dialect.do_execute(
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlalchemy\engine\default.py", line 681, in do_execute
    cursor.execute(statement, parameters)
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\mysql\connector\cursor.py", line 569, in execute     
    self._handle_result(self._connection.cmd_query(stmt))
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\mysql\connector\connection.py", line 599, in cmd_query
    result = self._handle_result(self._send_cmd(ServerCmd.QUERY, query))
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\mysql\connector\connection.py", line 487, in _handle_result
    raise errors.get_exception(packet)
mysql.connector.errors.DatabaseError: 1105 (HY000): syntax error at position 25 near 'FROM'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\evani\git\metamonster\metamonster\main.py", line 16, in <module>
    main()
  File "C:\Users\evani\git\metamonster\metamonster\main.py", line 12, in main
    args.func(args)
  File "C:\Users\evani\git\metamonster\metamonster\sync.py", line 137, in sync
    database.sync(config.dolt_config)
  File "C:\Users\evani\git\metamonster\metamonster\sync.py", line 58, in sync
    sync_to_dolt(reader, writer, self.mappings)
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\doltpy\sql\sync\sync_tools.py", line 37, in sync_to_dolt
    _sync_helper(source_reader, target_writer, table_map)
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\doltpy\sql\sync\sync_tools.py", line 66, in _sync_helper
    target_writer(remapped)
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\doltpy\sql\sync\dolt.py", line 35, in inner
    metadata.reflect()
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlalchemy\sql\schema.py", line 4670, in reflect     
    available = util.OrderedSet(insp.get_table_names(schema))
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlalchemy\engine\reflection.py", line 266, in get_table_names
    return self.dialect.get_table_names(
  File "<string>", line 2, in get_table_names
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlalchemy\engine\reflection.py", line 55, in cache  
    ret = fn(self, con, *args, **kw)
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlalchemy\dialects\mysql\base.py", line 2956, in get_table_names
    rp = connection.exec_driver_sql(
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlalchemy\engine\base.py", line 1575, in exec_driver_sql
    return self._exec_driver_sql(
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlalchemy\engine\base.py", line 1484, in _exec_driver_sql
    ret = self._execute_context(
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlalchemy\engine\base.py", line 1748, in _execute_context
    self._handle_dbapi_exception(
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlalchemy\engine\base.py", line 1929, in _handle_dbapi_exception
    util.raise_(
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlalchemy\util\compat.py", line 198, in raise_      
    raise exception
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlalchemy\engine\base.py", line 1705, in _execute_context
    self.dialect.do_execute(
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlalchemy\engine\default.py", line 681, in do_execute
    cursor.execute(statement, parameters)
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\mysql\connector\cursor.py", line 569, in execute     
    self._handle_result(self._connection.cmd_query(stmt))
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\mysql\connector\connection.py", line 599, in cmd_query
    result = self._handle_result(self._send_cmd(ServerCmd.QUERY, query))
  File "C:\Users\evani\AppData\Local\Programs\Python\Python39\lib\site-packages\mysql\connector\connection.py", line 487, in _handle_result
    raise errors.get_exception(packet)
sqlalchemy.exc.DatabaseError: (mysql.connector.errors.DatabaseError) 1105 (HY000): syntax error at position 25 near 'FROM'
[SQL: SHOW FULL TABLES FROM ``]
(Background on this error at: http://sqlalche.me/e/14/4xp6)```

doltpy version 2.0.0 doesn't include core

if you install doltpy via pip it will download version 2 which doesn't have the doltpy.core module.

I had to install doltpy version 1.1.10 in order to get the core module

Fix broken CI

Currently GitHub Action running tests fails for opaque reasons that are not clear since tests run locally, see here for examples. The most probable cause seems to be some strange MySQL connector/docker interaction, though the root cause is not clear. Furthermore the failures are not deterministic in the sense that they don't consistently produce the same failure message.

let get_df_table_writer infer primary keys from df.index

the function get_df_table_writer expects a list of primary keys which are columns of the data frame. I however have a MultiIndex with names that should be used for the compound primary key. this is very much like the df.to_sql works as well. But I need a way to "REPLACE INTO" instead of inserting. I had the hope the get_df_table_writer function could do that.

pip install doltpy brings in a numpy version that breaks on Windows

Running:

PS C:\Users\timse\dolthub\dolt\us-president-precinct-results> pip install numpy==1.19.3 --user

Fixes the issue. Maybe we should pin to that version until the tip numpy works on Windows?

fix sql() to properly respect result format argument

sql() should use result format argument, validate it, and return appropriate Python data structures that correspond to the Python equivalents for tabular, JSON, CSV.

unknown error: row read wait bigger than connection timeout

More complicated queries result in a timeout:

engine.execute("SELECT o.symbol, coalesce(max(y.eod), '1900-01-06') - INTERVAL 5 DAY FROM optionable o LEFT OUTER JOIN yahoo y on y.symbol = o.symbol GROUP BY o.symbol")

result in:

Traceback (most recent call last):
  File ".venv/lib/python3.7/site-packages/mysql/connector/connection_cext.py", line 489, in cmd_query
    raw_as_string=raw_as_string)
_mysql_connector.MySQLInterfaceError: unknown error: row read wait bigger than connection timeout

Deleting branch reports error even when it succeeds

repo.branch("foo")
# branch foo now exists
repo.branch("foo", delete=True)
# branch foo is deleted

However, doltpy reports:

doltpy.core.dolt.DoltException: (['dolt', 'branch', '--delete', 'foo', 'f', 'o', 'o'], b'', b"fatal: branch 'foo' not found\n", 1)

Documentation for syncing between other databases and Dolt seems to be out of date

The sync documentation linked in the readme looks to be out of date. Some examples:

doltpy.core does not exist
function names and locations of those function names do not always match. (Like where is ‘sql_server()’?)

conda release for Doltpy

Tests should run via setuptools (or some other way)

Currently we maintain a requirements.txt file for installing our requirements before runnings tests, yet use setup.py for configuration management. We want a unified dependency management mechanism that allows us to run the tests from scratch (including installing dependencies).

README woefully out of date

I think we should focus on doltpy.core over doltpy.etl to get people eased into the interface. At the start, it's more likely that people will come for simple shit like:

I want to read a table into a dataframe
I want to import a CSV
I want to run a SQL query and get a dictionary of the results

It's much less likely that they'll want a full service importer that they can integrate with Airflow easily. We should ease them up to that.

implement generation of markdown of doc strings

See this for inspiration https://github.com/PrefectHQ/prefect/blob/master/docs/generate_docs.py

DB error when empty commit

The sync_to_dolt function doesn't catch empty commit errors properly. Also, if there aren't any changes to commit, why is the function trying to commit them?

Traceback (most recent call last):
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\sqlalchemy\engine\base.py", line 1705, in _execute_context
self.dialect.do_execute(
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\sqlalchemy\engine\default.py", line 681, in do_execute
cursor.execute(statement, parameters)
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\mysql\connector\cursor.py", line 569, in execute
self._handle_result(self._connection.cmd_query(stmt))
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\mysql\connector\connection.py", line 599, in cmd_query
result = self._handle_result(self._send_cmd(ServerCmd.QUERY, query))
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\mysql\connector\connection.py", line 487, in _handle_result
raise errors.get_exception(packet)
mysql.connector.errors.DatabaseError: 1105 (HY000): Cannot commit an empty commit. See the --allow-empty if you want to.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "C:\Users\evani\git\metamonster\metamonster\main.py", line 16, in
main()
File "C:\Users\evani\git\metamonster\metamonster\main.py", line 12, in main
args.func(args)
File "C:\Users\evani\git\metamonster\metamonster\sync.py", line 169, in sync
database.sync_data(config.dolt_config)
File "C:\Users\evani\git\metamonster\metamonster\sync.py", line 78, in sync_data
sync_to_dolt(reader, writer, self.mappings)
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\doltpy\sql\sync\sync_tools.py", line 37, in sync_to_dolt
_sync_helper(source_reader, target_writer, table_map)
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\doltpy\sql\sync\sync_tools.py", line 66, in _sync_helper
target_writer(remapped)
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\doltpy\sql\sync\dolt.py", line 45, in inner
return dsc.commit_tables(message or "Executed SQL sync", tables, True)
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\doltpy\sql\sql.py", line 103, in commit_tables
result = [dict(row) for row in conn.execute(f"SELECT DOLT_COMMIT({dolt_commit_args})")]
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\sqlalchemy\engine\base.py", line 1185, in execute
return self._exec_driver_sql(
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\sqlalchemy\engine\base.py", line 1484, in _exec_driver_sql
ret = self._execute_context(
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\sqlalchemy\engine\base.py", line 1748, in _execute_context
self.handle_dbapi_exception(
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\sqlalchemy\engine\base.py", line 1929, in handle_dbapi_exception
util.raise(
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\sqlalchemy\util\compat.py", line 198, in raise
raise exception
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\sqlalchemy\engine\base.py", line 1705, in _execute_context
self.dialect.do_execute(
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\sqlalchemy\engine\default.py", line 681, in do_execute
cursor.execute(statement, parameters)
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\mysql\connector\cursor.py", line 569, in execute
self._handle_result(self._connection.cmd_query(stmt))
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\mysql\connector\connection.py", line 599, in cmd_query
result = self._handle_result(self._send_cmd(ServerCmd.QUERY, query))
File "C:\Users\evani\AppData\Local\pypoetry\Cache\virtualenvs\metamonster-upuCDUwE-py3.9\lib\site-packages\mysql\connector\connection.py", line 487, in _handle_result
raise errors.get_exception(packet)
sqlalchemy.exc.DatabaseError: (mysql.connector.errors.DatabaseError) 1105 (HY000): Cannot commit an empty commit. See the --allow-empty if you want to.
[SQL: SELECT DOLT_COMMIT('-m', 'Executed SQL sync')]
(Background on this error at: http://sqlalche.me/e/14/4xp6)

Expose logging configurations to users

Currently Doltpy implements logging in a suboptimal way: each file has a logging object with it's name passed. This means the library does not have an ability to configure the log level without importing the logger and then manually setting the log level (it's not that bad, but it requires knowing about the variable, and it's not documented).

Instead, we should implement a global log level variable for doltpy.core and expose it in the user facing API with a sensible default. Loggers should then be instantiated from a shared factory method that uses this shared configuration. We can also use this to enable library users to specify handlers etc.

More interface options for Dolt.import_df

I think we need to blow out the import interface, as cool as Pandas DataFrames are.

I think we should accept at least three formats:

Dataframes
[{col: val, ...}]
[[val, val, ...]] - with and without header row.

I also think the import type should be a instance of the import not a parameter you pass if we can:

dolt.import.create()
dolt.import.replace()
dolt.import.update()

Those can be underscores not . if having an import class is not "python-y". I'm a python noob.

Fix doltpy.etl interfaces

The current ETL interfaces are clunky and hard to work with. They are designed primarily to solve a problem that comes out of the way in which we manage Airflow. Specifically there are too many layers of delayed execution implemented by passing functions down the call stack that make the code clunky to work with.

Add commit message to DoltCommit object

A collection of DoltCommit instances are returned by Dolt.log(), we should add the commit message to those and parse it from the log output.

Dolt.branch() options `delete`, `copy` and `move` broken

If I want to rename a branch, for example renaming master into main:

repo.branch('master', new_branch='main', force=True, move=True)

results in this being executed: dolt branch --force master instead of dolt branch --force --move master main.

I'm pretty sure the issue is in core/dolt.py, line 612

if branch_name and not(delete and copy and move):

should be

if branch_name and not(delete or copy or move):

Users confused by dolt.engine. Should have a public dolt.get_engine()

It's also in the docs as get_engine().

https://www.dolthub.com/docs/tutorials/reading-from-dolt/#python

We need to fix this asap.

Doltpy does not support max connections or server with password

I can make the fixes for this. Just leaving a note for myself

Implement SQL Sync for Postgres

doltpy.etl.sql_sync currently only supports syncing between Dolt and MySQL, we should expand this to support Postgres.

Dolt.config(...) should support local and global configurations

Currently config(...) is defined on the Dolt object, or instances thereof. This is not quite right as it requires the creation of a repository to work. In contexts such as Docker where the dolt configurations are unlikely to be set when Dolt.init(...) is run.

We likely want a static method called config_global and a method called config_local, with the commonalities factored out.

Merge commit breaks log() parsing

Having a merge commit in the dolt log breaks log() parsing, since there's an extra "Merge:" line between "commit" and "Author:" in the output.

Stacktrace (sanitized)

  File "venv/lib/python3.8/site-packages/doltpy/core/dolt.py", line 500, in log
    date = datetime.strptime(output[i + date_offset].split(':', maxsplit=1)[1].lstrip(),
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/_strptime.py", line 568, in _strptime_datetime
    tt, fraction, gmtoff_fraction = _strptime(data_string, format)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/_strptime.py", line 349, in _strptime
    raise ValueError("time data %r does not match format %r" %
ValueError: time data 'author <[email protected]>' does not match format '%a %b %d %H:%M:%S %z %Y'

dolt log output (sanitized)

commit 53lmlmpumo499i5puf12ip875taql2ad
Merge: 48l4ovo7c2mrqkagi9io1g3eam1s5m0b 0puirbrqe5sgtjrr8j2ka3dabeo6nk58
Author: author <[email protected]>
Date:   Thu Oct 01 21:17:20 +0200 2020

	Merged PR #1

[...]

implement merge on Dolt

Dolt needs a merge function`, which is not currently implemented

Go API?

Hi,

Nice work on Dolt and this Python API.

Is there a clean and simple Go API for performing Dolt operations and getting results back?

Obviously, the backend is Go, but I don't see an easily accessible API for a client program to utilise. Perhaps I overlooked something?

Apologies if this is not the best place to ask this question.

Thanks!

Update Documentation

Hello,

Thanks for all the work on updating this library to v2. I saw that some of the documentation was updated, but it looks like there are still a few broken links / outdated guides.
This link is on the Readme here and doesn't seem to go anywhere: https://docs.dolthub.com/reference/python/
Also the notebook example / guide does not seem to run on Deepnote properly anymore: https://docs.dolthub.com/integrations/notebooks
Thanks!

Enhancements to writing of Python data structures

In Doltpy 1.0.0 we introduced a number of new interfaces for writing data in doltpy.core.write:

import_dict(...)
import_list(...)
These import a dictionary of lists, i.e. data laid out "column major" in memory keyed on the column name, and lists of dictionaries that model data "row major".

The requirements for this are very strict, in the sense that the rows need to have the same set of keys as the table, and writing is only supported in update and create mode. Further more update doesn't work very well as it will fail on duplicate primary key errors rather than updating existing rows with new values.

We want to enhance

add a strict option that allows users to pass dicts with a subset of the columns to be passed (in both interfaces)
expose the ability for the user to specify the connection rather than having the import functions manage it themselves
implement something akin to UPDATE ON DUPLICATE semantics for updating values when duplicate primary keys are encountered on the update path
remove primary key requirement for update
augment documentation and examples with these changes

Use SQL Alchemy for table schema sync

Syncing data is one thing, but in fact the most frustrating thing about SQL sync might be having to translate the schemas. The reason is that the if done by hand it requires the user to engage in mapping types, constraints, and syntax differences across SQL implementations. SQL Alchemy provides a Python based DSL for expressing SQL manipulations that abstracts across SQL implementations (to the degree that is possible). We have a private POC of using it to copy the schema of SQL database into an equivalent for MySQL and sync the remaining data.

The main issues to deal with are:

constraints and columns properties that are supported/not supported in various SQL implementations
type mappings (for example arrays in Postgres not being supported in MySQL)

The following snippet is crude first draft at the type mapping when translating from Postgres to MySQL:

def clean_column(column):
    # Postgres uses next_val in autoincrement, and mysql does not. If autoincrement
    # and has a default, remove. This likely has other use cases this would not be
    # desirable for, and should be adjusted as those are found.
    if hasattr(column, "server_default") and column.autoincrement is True:
        delattr(column, 'server_default')
    # Postgres has native IP address types, that should be converted to VARCHAR(43)
    if str(column.type) == 'CIDR':
        column = Column(column.name, VARCHAR(length=43), autoincrement=column.autoincrement, nullable=column.nullable)
    elif str(column.type) == 'INET':
        column = Column(column.name, VARCHAR(length=43), autoincrement=column.autoincrement, nullable=column.nullable)
    elif str(column.type) == 'MACADDR':
        column = Column(column.name, VARCHAR(length=43), autoincrement=column.autoincrement, nullable=column.nullable)
    elif str(column.type) == 'JSONB':
        column = Column(column.name, mysql.LONGTEXT, autoincrement=column.autoincrement, nullable=column.nullable)
    elif str(column.type) == 'ARRAY':
        column = Column(column.name, mysql.LONGTEXT, autoincrement=column.autoincrement, nullable=column.nullable)
    # Postgres can have an ARRAY of other types, such as SMALLINT[]
    elif str(column.type).endswith('[]'):
        column = Column(column.name, mysql.LONGTEXT, autoincrement=column.autoincrement, nullable=column.nullable)
    elif str(column.type) == 'UUID':
        column = Column(column.name, VARCHAR(length=36), autoincrement=column.autoincrement, nullable=column.nullable)
    elif str(column.type) == 'BYTEA':
        column = Column(column.name, mysql.LONGTEXT, autoincrement=column.autoincrement, nullable=column.nullable)
    return column

Remove space in Bug Tracker link in setup.py

There is a space which shouldn't exist in the link:

https://github.com/liquidata-inc/doltpy/blob/bceee2c32af36d9fc455b70bfff78d8b80b7e371/setup.py#L16

Renaming branch fails

If I want to rename a branch, for example renaming master into main:

repo.branch("master", new_branch="main", move=True)

it fails

doltpy.core.dolt.DoltException: (['dolt', 'branch', '--move', 'master', 'm', 'a', 'i', 'n'], b'usage: dolt branch [--list] [-v] [-a] [-r]\n   or: dolt branch [-f] <branchname> [<start-point>]\n   or: dolt branch -m [-f] [<oldbranch>] <newbranch>\n   or: dolt branch -c [-f] [<oldbranch>] <newbranch>\n   or: dolt branch -d [-f] [-r] <branchname>...\n\nSpecific dolt branch options\n    <start-point>\n      A commit that a new branch should point at.\n    \n    --list\n      List branches\n    \n    -f, --force\n      Reset <branchname> to <startpoint>, even if <branchname> exists already. Without -f, dolt branch refuses to change an\n      existing branch. In combination with -d (or --delete), allow deleting the branch irrespective of its merged status. In\n      combination with -m (or --move), allow renaming the branch even if the new branch name already exists, the same applies\n      for -c (or --copy).\n    \n    -c, --copy\n      Create a copy of a branch.\n    \n    -m, --move\n      Move/rename a branch\n    \n    -d, --delete\n      Delete a branch. The branch must be fully merged in its upstream branch.\n    \n    --D\n      Shortcut for --delete --force.\n    \n    -v, --verbose\n      When in list mode, show the hash and commit subject line for each head\n    \n    -a, --all\n      When in list mode, shows remote tracked branches\n    \n    -r, --remote\n      When in list mode, show only remote tracked branches. When with -d, delete a remote tracking branch.\n    \n    --show-current\n      Print the name of the current branch\n    \n', b'', 1)

Dolt.ls() reports 1 table when there are no tables in the repo

Dolt.ls() in a repo with no tables returns a list with a single DoltTable with name = 'No', rows = 'in'

Make replace mode work in import interfaces

Currently interfaces for importing Python data structures do not support replace mode. We should ensure they do so.

Since they use the SQL server this is a bit trickier than just using the CLI correctly.

Support Python 3.8

Is Python 3.8 supported? https://www.dolthub.com/docs/tutorials/installation/#doltpy says:

We currently do not support Python versions higher than 3.7 due to an incompatibility with the MySQL Server connectors we are using.

Well, this is really bad, and this means I can't use this package. It's not a sustainable approach. The beta of 3.9 is already out.

In any case, shouldn't setup.py document which versions of Python are required for this package?