civisanalytics / civis-python Goto Github PK

View Code? Open in Web Editor NEW

33.0 33.0 26.0 4.09 MB

Civis API Client for Python: https://civisanalytics.com/products/civis-platform/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

civis-python's People

Contributors

Stargazers

Watchers

civis-python's Issues

Consolidate dependencies across requirements.txt and setup.py

I'm paraphrasing some ideas @stephen-hoover had and sent over email:

Repeating dependencies in both setup.py and requirements.txt is problematic. There are a few solutions to this:

remove requirements.txt
Replace the contents of requirements.txt with

--index-url https://pypi.python.org/simple/

-e .

as recommended in this article.

We should also relax the dependencies in civis-python even further. We should put version requirements in places only where we know we require that version. For example, we require requests>=2.12. But are all of the other strict requirements necessary? I think we could remove the cloudpickle version restrictions entirely, for example. Maybe also for joblib.

Parsing <schema.table> correctly where schema/table may contain periods

Apparently, the period is a valid character in a (standard) identifier on Redshift (docs). This creates parsing issues when there's a period in either "schema" or "table" for a given table schema.table and the Civis Python client attempts to parse out "schema" and "table" with code like this:

schema, table = name.split('.')

It looks like there are three places in the Civis Python client code base where this strategy is used:

civis-python/civis/civis.py

Line 211 in 5becc24

schema, name = table.split('.')

civis-python/civis/io/_tables.py

Line 632 in 4b94eb8

name = table.split('.')[-1]

civis-python/civis/io/_tables.py

Line 813 in 4b94eb8

schema, table = table.split(".", 1)

I've confirmed that one can successfully run SQL queries on Civis Platform for a table whose name is in the form of foo."bar.baz" (with double quotes as the necessary delimiters because of a period in the table bar.baz). We should update the client to handle periods (and delimiters) correctly.

Add Tests for Windows Platforms

The issue in #8 might have been caught if we were testing Windows platforms. We might able to use something like https://www.appveyor.com/ for this.

document serialization of client variable in civis.ml.ModelPipeline

civis.ml.ModelPipeline has a client init argument that lets users specify an APIClient object. This could be used, for example, to run with restricted permissions. When ModelPipeline is serialized, however, the member variable _client is discarded, and a new one with the default API key is created during deserialization (code).

We probably don't want to serialize the API client objects since they have API keys (i.e., credentials), but it'd be helpful to document this behavior a little better so that users don't have wrong assumptions about what API keys will get used.

ENH Include poller arguments in API exception error messages

This would help to eliminate the dreaded CivisJobFailure: None.

Buggy error in click parameter type

This line is wrong

https://github.com/civisanalytics/civis-python/blob/master/civis/cli/__main__.py#L54

It needs to read

self.fail("Could not load YAML from path: %s" % value, param, ctx)

See this example from the click docs given in the comments in the code.

import click

class BasedIntParamType(click.ParamType):
    name = 'integer'

    def convert(self, value, param, ctx):
        try:
            if value[:2].lower() == '0x':
                return int(value[2:], 16)
            elif value[:1] == '0':
                return int(value, 8)
            return int(value, 10)
        except ValueError:
            self.fail('%s is not a valid integer' % value, param, ctx)

BASED_INT = BasedIntParamType()

Speed up read_civis by avoiding round trip to disk

Initially pointed out by @cmmalone. Let's try to speed up read_civis by avoiding a write to disk followed by a read from disk.

Test in Python 3.7

Now that Python v3.7 is out, we should add a 3.7 test to Travis and update our documentation to indicate that we're compatible.

add development section to README

We should probably have a development section in the README explaining cassettes and the cached API specs used for the tests.

NoneType error for scripts.delete_containers_shares_groups

I am getting the following error when trying to call this method though it seems as if my arguments are correct:

>>> client.scripts.delete_containers_shares_groups(4909147, 4)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/civisemployee/Developer/civis-python/civis/resources/_resources.py", line 239, in f
    return self._call_api(verb, url, query, body, iterator=iterator)
  File "/Users/civisemployee/Developer/civis-python/civis/base.py", line 88, in _call_api
    return_type=self._return_type)
  File "/Users/civisemployee/Developer/civis-python/civis/response.py", line 84, in convert_response_data_type
    return Response(data, headers=headers)
  File "/Users/civisemployee/Developer/civis-python/civis/response.py", line 116, in __init__
    for key, v in json_data.items():
AttributeError: 'NoneType' object has no attribute 'items'

cc: @keithing

Connections Not Closed when No Client Passed to civis.ml.ModelPipeline

I've got some code like:

for i in range(500):
    model = civis.ml.ModelPipeline(clf)
    model.train()

Running this on my system eventually results in errors due to running out of file handles (my system's default limit is 1024.)
Each call to model.train() results in three connections to the civis platform which are never closed, consuming file handles (you can observe this using lsof -i and strace on linux.)

This similar bit of code results in connections being closed and no file handle leaks:

for i in range(500):
    client = civis.APIClient()
    model = civis.ml.ModelPipeline(clf, client=client)
    model.train()

I have vague suspicions about the first case creating its own APIClient object since one was not supplied, and its getting garbage-collected somehow prevents connections from being closed.

dependency on scikit-learn in tests

I think we have an accidental development dependency on scikit-learn in our tests. It only affects the tests, so I think regular users can use the client without sklearn.

I get the following error from test_setup_remote_backend:

AttributeError: <module 'civis.parallel' from '/.../civis-python/civis/parallel.pyc'> does not have the attribute '_sklearn_reg_para_backend'

I think the issue is that this check is False when using a mock.

This doesn't cause CI failures because the CI environment has sklearn.

One option would be to add sklearn to dev-requirements.txt, but I think fixing the mocking wouldn't be all that difficult.

Caching confusion in the client

Heard of a use case recently where a user had a notebook instance open for a while and ran into an issue with caching. The user called APIClient(resources='all') but did not see new endpoints they expected to see (because they already called it earlier and the results were cached before the endpoints were created).

My impression is that it's not trivial to refresh lru_cache in smart way to avoid this and since this is a rare occurrence, we should address it with documentation.

Model polling fails intermittently

For some reason, when one of the models submitted for parallel training using the civis.futures module has cross_validation_parameters specified, polling with futures.wait does not work.

I get the following error message:

Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.5.2_1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/Users/dch024/virtualenvs/scipy3/lib/python3.5/site-packages/civis/polling.py", line 41, in run
    if self.poller(*self.poller_args).state in DONE:
AttributeError: 'NoneType' object has no attribute 'state'

Can be replicated by running the parallel training example

Encoding Error

When using read_civis() on a table that contains accented characters, the function will fail with a UnicodeDecodeError even if you specify the encoding argument.

So something like

civis.io.read_civis(table='schema.table_with_accented_characters', database='Database Name', use_pandas=False, encoding='latin-1')

Will fail with:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 106622: unexpected end of data

It sounds like we do some utf-8 encoding before the client reads the kwargs of read_civis() and that's where the error is surfacing.

Pandas Column Typing

Overview: Add functionality to get column types, including date, and sortkeys from Redshift meta info for Pandas IO

I modified the client to allow me to pass arguments to the pd.read_csv on line 227 in civis/io/_tables.py. I gather the column types using client.tables.get after which I map the SQL types to their corresponding Python types. Doing this eliminates the automatic str to int conversion, e.g. zipcodes, and automatically identifies the date columns. Is there any interest in adding this in a more thoughtful way to the code?

Add A CLI for Workflows

@beckermr wrote

I wrote a similar CLI for workflows ... Are we open to a similar PR for that?

in #197. I'm moving the discussion here.

If you have written code for a workflows CLI, I think it would be a great addition. Some thoughts before we do it:

Are we going to run into the civis workflow and civis workflows confusion we did with notebooks? I like the solution we did in #197, but the CLI right now feels overloaded with both high level and low level functions in the same namespace.
We should make sure to write documentation. It's easier to add the documentation for new features up front rather than waiting and forgetting all the intricacies that need to be documented.

get_table_id does not correctly handle quoted schemas and tablenames

The get_table_id function throws a ValueError if the schema and tablename are quoted, which doesn't seem to be the correct behavior.

Civis library version

1.9.2

How to reproduce the bug

In [1]: import civis

In [2]: client = civis.APIClient()

In [3]: client.get_table_id(table='foo.bar', database='baz')
Out[3]: 123

In [4]: client.get_table_id(table='"foo"."bar"', database='baz')
ValueError: Given table "foo"."bar" is not an exact match for returned table foo.bar.

Desired behavior

In [1]: import civis

In [2]: client = civis.APIClient()

In [3]: client.get_table_id(table='foo.bar', database='baz')
Out[3]: 123

In [4]: client.get_table_id(table='"foo"."bar"', database='baz')
Out[4]: 123

Likely cause/fix

While get_table_id derives the schema and tablename via civis.io.split_schema_tablename:

civis-python/civis/civis.py

Line 271 in eb5d46c

schema, name = civis.io.split_schema_tablename(table)

which (correctly) gives us 'foo' as the schema and 'bar' as the tablename (both without quotes) with '"foo"."bar"' (with quotes) as the input, it looks like the following bit (whose purpose is unclear to me) can be removed for the desired behavior:

civis-python/civis/civis.py

Lines 277 to 280 in eb5d46c

 found_table = ".".join((tables[0].schema, tables[0].name)) 

 if table != found_table: 

 msg = "Given table {} is not an exact match for returned table {}." 

 raise ValueError(msg.format(table, found_table))

That being said, perhaps there's more to it, and the current maintainers would probably know more.

document the civis.find function

The civis.find function currently has no documentation. That's all.

Asynchronous File Downloads

It would be really nice to have asynchronous file downloads. I came across this because I was using civis_to_multifile_csv which produces multiple files.

Compatibility with joblib >0.11

Our custom joblib backend relies on some internal joblib features which changed in the v0.12 release. The most recent joblib release is now 0.13. We should update our code so that we're compatible with v0.11 through v0.13.

Civis joblib backend can ignore n_jobs restriction

In v1.7, the Civis joblib backend gained the ability to set itself as the backend inside child jobs. This means that child jobs which use joblib could also use the Civis joblib backend to paralellize themselves in different container scripts. These child jobs would be unaware of any jobs limit which the user set on the original joblib call, and wouldn't have a way to coordinate between themselves in any case. This could lead to users consuming drastically more resources than they expected.

A further two points:

The addition of this feature in v1.7 defaulted to setting the remote backend to "civis", which altered the previous behavior of the Civis joblib backend (it didn't launch jobs from inside jobs before).
This behavior differs from the way that joblib treats parallel computation with multiprocessing. Attempting to multiprocess from inside a multiprocessed job results in ~~an error~~ joblib defaulting to running serially.

I think that in principle, the Civis joblib backend is doing what it should -- resources in the cloud aren't limited in the same way that resources on a local machine are. But the inability to control resource use isn't good, and the hard-to-notice change in default behavior also isn't good. We should switch to "threading" as the default remote_backend so that users need to request the new, more aggressive behavior (and manage the total number of jobs through n_jobs parameters inside whatever they're calling). It would be much more difficult to try to make the n_jobs parameter on the Civis joblib backend actually restrict the total number of jobs when using remote_backend = "civis", so for now I think we should settle for documenting.

Make _robust_schema_table_split public

The private function civis.io._tables._robust_schema_table_split is pretty useful. I find myself needing to do that in other pieces of code. I suggest making it public, perhaps with a name like split_schema_table. We may also want to handle the case where a table name is provided without a schema, e.g. for a temporary table.

Incompatibility between Sphinx 1.7.0 and numpydoc 0.7.0

Sphinx v1.7.0 (released on PyPI in Feb 2018) has made changes (by removing sphinx.util.compat at v1.7.0b1) that break the current PyPI release of numpydoc v.0.7.0 (released in Jun 2017). As a workaround for now, the easiest is to use Sphinx < 1.7.0 (a change done at #224).

When numpydoc makes a new release on PyPI, we should update the dependencies for both Sphinx and numpydoc.

Clear up examples in civis_to_multifile_csv docs

@gdj0nes mentioned that the docs for civis_to_multifile_csv could be a little clearer on how to download all the files in the manifest. Currently the example is just demonstrating how to download the first id, but we should loop over all of the ids.

In Python 2.7, an uncaught exception prevents the API spec from refreshing

In this block, Python 2.7's os.path.getmtime appears to raise OSError (not FileNotFoundError or ValueError) when _CACHED_SPEC_PATH does not exist. As a result, the API spec isn't downloaded.

I discovered this using the CLI (civis files download ...) in a container with a 2.7 install.

Surface the documentation for civis.find and civis.find_one

The docstrings for civis.find and civis.find_one have been recently added at #224, but the two functions aren't explicitly mentioned anywhere in the current Sphinx documentation. Currently there isn't a page dedicated to utility functions of this sort. Given that their major use case is to filter civis.response.Response objects, how about adding a section like "Utility Functions" or something in responses.rst?

@stephen-hoover @keithing

cache last response

It might be nice to save the last response in the APIClient class so that users could retrieve it if they forget to assign it when using the client in a REPL or notebook.

e.g.,

> client = civis.APIClient()
> client.scripts.post_containers(...)
> client.last_response
{"id": ..., }

This idea came from a comment by @skomanduri. I can't think of drawbacks to this, but perhaps there are.

parameterize index in civis.io.dataframe_to_civis

@keithing I use civis.io.dataframe_to_civis to write to redshift, and had meaningfull indexes but these weren't written and it doesn't appear there's the ability at all to write them. We may want to make this a parameter (which would get passed to to_csv).

BUG make sure to catch OSError when reading the cached API spec for python 2.7

It looks like the CLI for python 2.7 is choking when trying to get a cached API spec. In python 2.7, the file not existing raises an OSError which is not caught in the right spot:

https://github.com/civisanalytics/civis-python/blob/master/civis/cli/__main__.py#L204

Add `query_to_civis_file` function

@stephen-hoover I wonder if you've mentioned this before? It seems like it might be useful to return the file_id from a SQL query without requiring the user to download the result (as in read_civis_sql).

cc @jkulzick

Examples: Parallel Training -- Shouldn't use if/else when reassinging labels

In the parallel training example, the labels from the Wisconsin breast cancer data set are transformed from 4,2 to 1,0.

Although this doesn't impact the result in this instance due to the data set, it is bad practice to use a single if/else loop and in an example which people for better or worse learn off, should probably be changed to something like:

df['is_cancer'] = [1 if val == 4 else (0 if val == 2 else '?') for val in df['is_cancer'] ]

This would prevent a null value in the dependent variable from being mapped to 0 which could impact the results and further illustrate that null values should be passed through to any algorithm that accepts them due to possible underlying issues with data collection.

Make the Response object immutable

The civis.response.Response object subclasses dict, but it also copies its dictionary keys into attributes. I've seen code alternate between accessing its contents as attributes and as items. This is mostly okay, but as soon as you try to alter the contents, the two ways of accessing data get out of sync. If we want to keep the dual object / dictionary identity (we may not), then we should make the object, or at least the attributes which are part of the dictionary, immutable.

See this confusing exchange:

In [25]: resp = civis.response.Response({'spam': 'eggs'})

In [26]: resp.spam
Out[26]: 'eggs'

In [27]: resp['spam']
Out[27]: 'eggs'

In [28]: resp['spam'] = 'foo'

In [29]: resp['spam']
Out[29]: 'foo'

In [30]: resp.spam
Out[30]: 'eggs'

In [31]: resp.spam = 'bar'

In [32]: resp['spam']
Out[32]: 'foo'

In [33]: resp
Out[33]: {'spam': 'foo'}

Endpoints with hyphens are not parsed properly

(Issue originally raised by @stephen-hoover on #156)

Despite that we are replacing hyphens here:

civis-python/civis/resources/_resources.py

Line 372 in 30d4591

return re.sub("-", "_", method_name)

Hyphens are still not being parsed correctly as they need to also be replaced here:

civis-python/civis/resources/_resources.py

Line 399 in 30d4591

base_path = path.split('/')[0].lower()

If hyphens are not replaced then endpoints with hyphens will be parsed as classes and functions that contain hyphens (which is invalid Python syntax).

Retry file I/O in joblib backend helper function

In the Civis joblib backend, the remote helper function downloads the function to run and uploads results. API calls to the Civis platform automatically retry, but the communications with S3 do not retry. We occasionally have errors when communicating with S3, so we should add retries to the file I/O.

Appveyor tests don't show up in PRs

We have an Appveyor Github App, and the tests trigger for commits to open PRs, but they don't show up with the Travis tests at the bottom of the PR.

progress bar for the file upload CLI

It'd be nice to have a progress bar show up for civis files upload and civis files download.

possibilities:

Add Python version to User-Agent header

The API client is currently recording its name ("civis-python"), its version number, and the version of requests which it's using. We should add the Python version to this. It will be useful to know how many people are using different language versions, so we can get an idea of which versions are most important to support.

Coalesce README and index.rst

@stephen-hoover what do you think about replacing index.rst (which is the landing page on readthedocs) to the README? It looks like there is a lot of overlapping documentation and I've had users expression confusion to me about why certain docs appear in the readme vs readthedocs.

Use range headers when retrying S3 transfers

It'd be nice to use range headers to avoid re-downloading already-downloaded bytes when an S3 connection error happens. See #273.

Convenience function for persisting dataframe to files endpoint

civis.io has a function file_to_dataframe, but no function going in the opposite direction. I find myself rewriting this simple function for myself a lot, and I thought it would make a nice little convenience for others. @keithing, would you be interested in including something like this in the civis.io function suite (I would make a PR)?

`civis.io.dataframe_to_civis` doesn't auto-detect headers

I used dataframe_to_civis on a dataframe with column names. The resulting table on Redshift added the header as an additional row in the table, and gave the table deafult column names (column_0, column_1). I think the expected behavior would be for the function to auto-detect the header and add it as the column names in the table. The real bug here is the fact that the function writes the header to file by default, but does not enforce the header being read as column names in Platform. In case it's relevant, the R client did automatically handle the header correctly for this same dataset, so there's also a discrepancy between the API clients.

Add the `cancel_timeout` parameter to executors and joblib backend

Container scripts take a cancel_timeout parameter which provides a delay between a user's cancel request and the system killing the container. This is useful to let code running in the container perform cleanup tasks. We should add this option to the futures._ContainerShellExecutor and the joblib backend.

Retry on Intermittent S3 Errors

Recently hit an error here:

civis-python/civis/io/_tables.py

Line 346 in 7a08b8c

response.raise_for_status()

where S3 returned an intermittent 500 on our GET to retrieve an object, causing the read_civis_sql call to raise an exception.
It'd be nice if we had a retry loop for S3 interactions to avoid these intermittent failures.

Possible to update to use newest version of pyyaml dependency?

I think the latest pyyaml release is 5.1 but the requirements.txt lists only 3.99.
This causes incompatibility with the newest version of the 'googleads' module for example which depends on 5.1.
Possible to update?

`civis.find` has unintuitive behavior with boolean kwargs

The civis.find (and civis.find_one) function allows users to filter a list of Response objects for user-supplied attribute/value pairs. However, the behavior when given a keyword argument with a boolean value is different than when given a keyword argument with any other value. If for example, one calls civis.find(objs, spam='eggs'), then the returned list contains all objects which have an attribute "spam" with value "eggs". However, civis.find(objs, spam=True) will return all objects which have an attribute "spam", regardless of value.

Perhaps it's worthwhile to have a way to filter on presence or absence of an attribute, but this means that there's no way to filter objects based on attributes with boolean values. I suggest that we treat booleans the same way as any other value type.

Regardless of whether we treat booleans the same way as other values, there's also a bug in the way we handle them now. If a user supplies a keyword argument with a value of False, then the returned list will always be empty, no matter what's in the objects.

I noticed this issue when reviewing #224 .

Release v1.5.0

There have been extensive changes to the client between now and v1.4.0 (36 commits right now). We should push another minor version.

Retry on Sharing Endpoint 404s?

The sharing endpoints (put_shares_uers, put_shares_groups) intermittently return 404s. Sometimes they represent a permanent failure, but sometimes they represent an intermittent failure which may be retried. We should consider retrying on 404s in case they are intermittent, but it's hard to know what the best course here would be.

Pushing Tables with Index

I think it would be good to raise a warning when a user tries to push a table with io.dataframe_to_civis that has a named index, i.e. they set the index from a column and thus it is likely necessary to understand the data. The warning would say that they need to reset the index for those columns to appear.

Test with newer version of click and update dependencies

civis-python requires click version 6, but there's now a v7. We should test with it and update the requirements.txt file.

	found_table = ".".join((tables[0].schema, tables[0].name))
	if table != found_table:
	msg = "Given table {} is not an exact match for returned table {}."
	raise ValueError(msg.format(table, found_table))

civisanalytics / civis-python Goto Github PK

civis-python's People

Contributors

Stargazers

Watchers

Forkers

civis-python's Issues

Civis library version

How to reproduce the bug

Desired behavior

Likely cause/fix

Recommend Projects

Recommend Topics

Recommend Org