civisanalytics / civis-python Goto Github PK
View Code? Open in Web Editor NEWCivis API Client for Python: https://civisanalytics.com/products/civis-platform/
License: BSD 3-Clause "New" or "Revised" License
Civis API Client for Python: https://civisanalytics.com/products/civis-platform/
License: BSD 3-Clause "New" or "Revised" License
I'm paraphrasing some ideas @stephen-hoover had and sent over email:
Repeating dependencies in both setup.py and requirements.txt is problematic. There are a few solutions to this:
--index-url https://pypi.python.org/simple/
-e .
as recommended in this article.
We should also relax the dependencies in civis-python even further. We should put version requirements in places only where we know we require that version. For example, we require requests>=2.12. But are all of the other strict requirements necessary? I think we could remove the cloudpickle version restrictions entirely, for example. Maybe also for joblib.
Apparently, the period is a valid character in a (standard) identifier on Redshift (docs). This creates parsing issues when there's a period in either "schema" or "table" for a given table schema.table
and the Civis Python client attempts to parse out "schema" and "table" with code like this:
schema, table = name.split('.')
It looks like there are three places in the Civis Python client code base where this strategy is used:
Line 211 in 5becc24
civis-python/civis/io/_tables.py
Line 632 in 4b94eb8
civis-python/civis/io/_tables.py
Line 813 in 4b94eb8
I've confirmed that one can successfully run SQL queries on Civis Platform for a table whose name is in the form of foo."bar.baz"
(with double quotes as the necessary delimiters because of a period in the table bar.baz
). We should update the client to handle periods (and delimiters) correctly.
The issue in #8 might have been caught if we were testing Windows platforms. We might able to use something like https://www.appveyor.com/ for this.
civis.ml.ModelPipeline
has a client
init argument that lets users specify an APIClient
object. This could be used, for example, to run with restricted permissions. When ModelPipeline
is serialized, however, the member variable _client
is discarded, and a new one with the default API key is created during deserialization (code).
We probably don't want to serialize the API client objects since they have API keys (i.e., credentials), but it'd be helpful to document this behavior a little better so that users don't have wrong assumptions about what API keys will get used.
This would help to eliminate the dreaded CivisJobFailure: None
.
This line is wrong
https://github.com/civisanalytics/civis-python/blob/master/civis/cli/__main__.py#L54
It needs to read
self.fail("Could not load YAML from path: %s" % value, param, ctx)
See this example from the click docs given in the comments in the code.
import click
class BasedIntParamType(click.ParamType):
name = 'integer'
def convert(self, value, param, ctx):
try:
if value[:2].lower() == '0x':
return int(value[2:], 16)
elif value[:1] == '0':
return int(value, 8)
return int(value, 10)
except ValueError:
self.fail('%s is not a valid integer' % value, param, ctx)
BASED_INT = BasedIntParamType()
Initially pointed out by @cmmalone. Let's try to speed up read_civis
by avoiding a write to disk followed by a read from disk.
Now that Python v3.7 is out, we should add a 3.7 test to Travis and update our documentation to indicate that we're compatible.
We should probably have a development section in the README explaining cassettes and the cached API specs used for the tests.
I am getting the following error when trying to call this method though it seems as if my arguments are correct:
>>> client.scripts.delete_containers_shares_groups(4909147, 4)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/civisemployee/Developer/civis-python/civis/resources/_resources.py", line 239, in f
return self._call_api(verb, url, query, body, iterator=iterator)
File "/Users/civisemployee/Developer/civis-python/civis/base.py", line 88, in _call_api
return_type=self._return_type)
File "/Users/civisemployee/Developer/civis-python/civis/response.py", line 84, in convert_response_data_type
return Response(data, headers=headers)
File "/Users/civisemployee/Developer/civis-python/civis/response.py", line 116, in __init__
for key, v in json_data.items():
AttributeError: 'NoneType' object has no attribute 'items'
cc: @keithing
I've got some code like:
for i in range(500):
model = civis.ml.ModelPipeline(clf)
model.train()
Running this on my system eventually results in errors due to running out of file handles (my system's default limit is 1024.)
Each call to model.train() results in three connections to the civis platform which are never closed, consuming file handles (you can observe this using lsof -i
and strace
on linux.)
This similar bit of code results in connections being closed and no file handle leaks:
for i in range(500):
client = civis.APIClient()
model = civis.ml.ModelPipeline(clf, client=client)
model.train()
I have vague suspicions about the first case creating its own APIClient object since one was not supplied, and its getting garbage-collected somehow prevents connections from being closed.
I think we have an accidental development dependency on scikit-learn in our tests. It only affects the tests, so I think regular users can use the client without sklearn.
I get the following error from test_setup_remote_backend
:
AttributeError: <module 'civis.parallel' from '/.../civis-python/civis/parallel.pyc'> does not have the attribute '_sklearn_reg_para_backend'
I think the issue is that this check is False when using a mock.
This doesn't cause CI failures because the CI environment has sklearn.
One option would be to add sklearn to dev-requirements.txt
, but I think fixing the mocking wouldn't be all that difficult.
Heard of a use case recently where a user had a notebook instance open for a while and ran into an issue with caching. The user called APIClient(resources='all')
but did not see new endpoints they expected to see (because they already called it earlier and the results were cached before the endpoints were created).
My impression is that it's not trivial to refresh lru_cache
in smart way to avoid this and since this is a rare occurrence, we should address it with documentation.
For some reason, when one of the models submitted for parallel training using the civis.futures
module has cross_validation_parameters
specified, polling with futures.wait
does not work.
I get the following error message:
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/local/Cellar/python3/3.5.2_1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/Users/dch024/virtualenvs/scipy3/lib/python3.5/site-packages/civis/polling.py", line 41, in run
if self.poller(*self.poller_args).state in DONE:
AttributeError: 'NoneType' object has no attribute 'state'
Can be replicated by running the parallel training example
When using read_civis() on a table that contains accented characters, the function will fail with a UnicodeDecodeError even if you specify the encoding argument.
So something like
civis.io.read_civis(table='schema.table_with_accented_characters', database='Database Name', use_pandas=False, encoding='latin-1')
Will fail with:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 106622: unexpected end of data
It sounds like we do some utf-8 encoding before the client reads the kwargs of read_civis() and that's where the error is surfacing.
Overview: Add functionality to get column types, including date, and sortkeys from Redshift meta info for Pandas IO
I modified the client to allow me to pass arguments to the pd.read_csv on line 227 in civis/io/_tables.py. I gather the column types using client.tables.get after which I map the SQL types to their corresponding Python types. Doing this eliminates the automatic str to int conversion, e.g. zipcodes, and automatically identifies the date columns. Is there any interest in adding this in a more thoughtful way to the code?
@beckermr wrote
I wrote a similar CLI for workflows ... Are we open to a similar PR for that?
in #197. I'm moving the discussion here.
If you have written code for a workflows CLI, I think it would be a great addition. Some thoughts before we do it:
Are we going to run into the civis workflow
and civis workflows
confusion we did with notebooks? I like the solution we did in #197, but the CLI right now feels overloaded with both high level and low level functions in the same namespace.
We should make sure to write documentation. It's easier to add the documentation for new features up front rather than waiting and forgetting all the intricacies that need to be documented.
The get_table_id
function throws a ValueError
if the schema and tablename are quoted, which doesn't seem to be the correct behavior.
1.9.2
In [1]: import civis
In [2]: client = civis.APIClient()
In [3]: client.get_table_id(table='foo.bar', database='baz')
Out[3]: 123
In [4]: client.get_table_id(table='"foo"."bar"', database='baz')
ValueError: Given table "foo"."bar" is not an exact match for returned table foo.bar.
In [1]: import civis
In [2]: client = civis.APIClient()
In [3]: client.get_table_id(table='foo.bar', database='baz')
Out[3]: 123
In [4]: client.get_table_id(table='"foo"."bar"', database='baz')
Out[4]: 123
While get_table_id
derives the schema and tablename via civis.io.split_schema_tablename
:
Line 271 in eb5d46c
which (correctly) gives us 'foo'
as the schema and 'bar'
as the tablename (both without quotes) with '"foo"."bar"'
(with quotes) as the input, it looks like the following bit (whose purpose is unclear to me) can be removed for the desired behavior:
Lines 277 to 280 in eb5d46c
That being said, perhaps there's more to it, and the current maintainers would probably know more.
The civis.find
function currently has no documentation. That's all.
It would be really nice to have asynchronous file downloads. I came across this because I was using civis_to_multifile_csv which produces multiple files.
Our custom joblib backend relies on some internal joblib features which changed in the v0.12 release. The most recent joblib release is now 0.13. We should update our code so that we're compatible with v0.11 through v0.13.
In v1.7, the Civis joblib backend gained the ability to set itself as the backend inside child jobs. This means that child jobs which use joblib could also use the Civis joblib backend to paralellize themselves in different container scripts. These child jobs would be unaware of any jobs limit which the user set on the original joblib call, and wouldn't have a way to coordinate between themselves in any case. This could lead to users consuming drastically more resources than they expected.
A further two points:
I think that in principle, the Civis joblib backend is doing what it should -- resources in the cloud aren't limited in the same way that resources on a local machine are. But the inability to control resource use isn't good, and the hard-to-notice change in default behavior also isn't good. We should switch to "threading" as the default remote_backend
so that users need to request the new, more aggressive behavior (and manage the total number of jobs through n_jobs
parameters inside whatever they're calling). It would be much more difficult to try to make the n_jobs
parameter on the Civis joblib backend actually restrict the total number of jobs when using remote_backend = "civis"
, so for now I think we should settle for documenting.
The private function civis.io._tables._robust_schema_table_split
is pretty useful. I find myself needing to do that in other pieces of code. I suggest making it public, perhaps with a name like split_schema_table
. We may also want to handle the case where a table name is provided without a schema, e.g. for a temporary table.
Sphinx v1.7.0 (released on PyPI in Feb 2018) has made changes (by removing sphinx.util.compat
at v1.7.0b1) that break the current PyPI release of numpydoc v.0.7.0 (released in Jun 2017). As a workaround for now, the easiest is to use Sphinx < 1.7.0 (a change done at #224).
When numpydoc makes a new release on PyPI, we should update the dependencies for both Sphinx and numpydoc.
@gdj0nes mentioned that the docs for civis_to_multifile_csv
could be a little clearer on how to download all the files in the manifest. Currently the example is just demonstrating how to download the first id, but we should loop over all of the ids
.
In this block, Python 2.7's os.path.getmtime
appears to raise OSError
(not FileNotFoundError
or ValueError
) when _CACHED_SPEC_PATH
does not exist. As a result, the API spec isn't downloaded.
I discovered this using the CLI (civis files download ...
) in a container with a 2.7 install.
The docstrings for civis.find
and civis.find_one
have been recently added at #224, but the two functions aren't explicitly mentioned anywhere in the current Sphinx documentation. Currently there isn't a page dedicated to utility functions of this sort. Given that their major use case is to filter civis.response.Response
objects, how about adding a section like "Utility Functions" or something in responses.rst
?
It might be nice to save the last response in the APIClient
class so that users could retrieve it if they forget to assign it when using the client in a REPL or notebook.
e.g.,
> client = civis.APIClient()
> client.scripts.post_containers(...)
> client.last_response
{"id": ..., }
This idea came from a comment by @skomanduri. I can't think of drawbacks to this, but perhaps there are.
@keithing I use civis.io.dataframe_to_civis
to write to redshift, and had meaningfull indexes but these weren't written and it doesn't appear there's the ability at all to write them. We may want to make this a parameter (which would get passed to to_csv).
It looks like the CLI for python 2.7 is choking when trying to get a cached API spec. In python 2.7, the file not existing raises an OSError
which is not caught in the right spot:
https://github.com/civisanalytics/civis-python/blob/master/civis/cli/__main__.py#L204
@stephen-hoover I wonder if you've mentioned this before? It seems like it might be useful to return the file_id from a SQL query without requiring the user to download the result (as in read_civis_sql
).
cc @jkulzick
In the parallel training example, the labels from the Wisconsin breast cancer data set are transformed from 4,2 to 1,0.
Although this doesn't impact the result in this instance due to the data set, it is bad practice to use a single if/else loop and in an example which people for better or worse learn off, should probably be changed to something like:
df['is_cancer'] = [1 if val == 4 else (0 if val == 2 else '?') for val in df['is_cancer'] ]
This would prevent a null value in the dependent variable from being mapped to 0 which could impact the results and further illustrate that null values should be passed through to any algorithm that accepts them due to possible underlying issues with data collection.
The civis.response.Response
object subclasses dict
, but it also copies its dictionary keys into attributes. I've seen code alternate between accessing its contents as attributes and as items. This is mostly okay, but as soon as you try to alter the contents, the two ways of accessing data get out of sync. If we want to keep the dual object / dictionary identity (we may not), then we should make the object, or at least the attributes which are part of the dictionary, immutable.
See this confusing exchange:
In [25]: resp = civis.response.Response({'spam': 'eggs'})
In [26]: resp.spam
Out[26]: 'eggs'
In [27]: resp['spam']
Out[27]: 'eggs'
In [28]: resp['spam'] = 'foo'
In [29]: resp['spam']
Out[29]: 'foo'
In [30]: resp.spam
Out[30]: 'eggs'
In [31]: resp.spam = 'bar'
In [32]: resp['spam']
Out[32]: 'foo'
In [33]: resp
Out[33]: {'spam': 'foo'}
(Issue originally raised by @stephen-hoover on #156)
Despite that we are replacing hyphens here:
civis-python/civis/resources/_resources.py
Line 372 in 30d4591
Hyphens are still not being parsed correctly as they need to also be replaced here:
civis-python/civis/resources/_resources.py
Line 399 in 30d4591
If hyphens are not replaced then endpoints with hyphens will be parsed as classes and functions that contain hyphens (which is invalid Python syntax).
In the Civis joblib backend, the remote helper function downloads the function to run and uploads results. API calls to the Civis platform automatically retry, but the communications with S3 do not retry. We occasionally have errors when communicating with S3, so we should add retries to the file I/O.
We have an Appveyor Github App, and the tests trigger for commits to open PRs, but they don't show up with the Travis tests at the bottom of the PR.
It'd be nice to have a progress bar show up for civis files upload
and civis files download
.
possibilities:
The API client is currently recording its name ("civis-python"), its version number, and the version of requests
which it's using. We should add the Python version to this. It will be useful to know how many people are using different language versions, so we can get an idea of which versions are most important to support.
@stephen-hoover what do you think about replacing index.rst (which is the landing page on readthedocs) to the README? It looks like there is a lot of overlapping documentation and I've had users expression confusion to me about why certain docs appear in the readme vs readthedocs.
It'd be nice to use range headers to avoid re-downloading already-downloaded bytes when an S3 connection error happens. See #273.
civis.io
has a function file_to_dataframe
, but no function going in the opposite direction. I find myself rewriting this simple function for myself a lot, and I thought it would make a nice little convenience for others. @keithing, would you be interested in including something like this in the civis.io
function suite (I would make a PR)?
I used dataframe_to_civis
on a dataframe with column names. The resulting table on Redshift added the header as an additional row in the table, and gave the table deafult column names (column_0
, column_1
). I think the expected behavior would be for the function to auto-detect the header and add it as the column names in the table. The real bug here is the fact that the function writes the header to file by default, but does not enforce the header being read as column names in Platform. In case it's relevant, the R client did automatically handle the header correctly for this same dataset, so there's also a discrepancy between the API clients.
Container scripts take a cancel_timeout
parameter which provides a delay between a user's cancel request and the system killing the container. This is useful to let code running in the container perform cleanup tasks. We should add this option to the futures._ContainerShellExecutor
and the joblib backend.
Recently hit an error here:
civis-python/civis/io/_tables.py
Line 346 in 7a08b8c
I think the latest pyyaml release is 5.1 but the requirements.txt lists only 3.99.
This causes incompatibility with the newest version of the 'googleads' module for example which depends on 5.1.
Possible to update?
The civis.find
(and civis.find_one
) function allows users to filter a list of Response
objects for user-supplied attribute/value pairs. However, the behavior when given a keyword argument with a boolean value is different than when given a keyword argument with any other value. If for example, one calls civis.find(objs, spam='eggs')
, then the returned list contains all objects which have an attribute "spam" with value "eggs". However, civis.find(objs, spam=True)
will return all objects which have an attribute "spam", regardless of value.
Perhaps it's worthwhile to have a way to filter on presence or absence of an attribute, but this means that there's no way to filter objects based on attributes with boolean values. I suggest that we treat booleans the same way as any other value type.
Regardless of whether we treat booleans the same way as other values, there's also a bug in the way we handle them now. If a user supplies a keyword argument with a value of False
, then the returned list will always be empty, no matter what's in the objects.
I noticed this issue when reviewing #224 .
There have been extensive changes to the client between now and v1.4.0 (36 commits right now). We should push another minor version.
The sharing endpoints (put_shares_uers, put_shares_groups) intermittently return 404s. Sometimes they represent a permanent failure, but sometimes they represent an intermittent failure which may be retried. We should consider retrying on 404s in case they are intermittent, but it's hard to know what the best course here would be.
I think it would be good to raise a warning when a user tries to push a table with io.dataframe_to_civis
that has a named index, i.e. they set the index from a column and thus it is likely necessary to understand the data. The warning would say that they need to reset the index for those columns to appear.
civis-python requires click
version 6, but there's now a v7. We should test with it and update the requirements.txt
file.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.