qubole / qds-sdk-py Goto Github PK

View Code? Open in Web Editor NEW

51.0 19.0 128.0 1009 KB

Python SDK for accessing Qubole Data Service

Home Page: https://qubole.com

License: Apache License 2.0

Python 100.00%

qubole python sdk-python

qds-sdk-py's Introduction

Qubole Data Service Python SDK

A Python module that provides the tools you need to authenticate with, and use the Qubole Data Service API.

Installation

From PyPI

The SDK is available on PyPI.

$ pip install qds-sdk

From source

Get the source code:
- Either clone the project: git clone [email protected]:qubole/qds-sdk-py.git and checkout latest release tag from Releases.
- Or download one of the releases from https://github.com/qubole/qds-sdk-py/releases
Run the following command (may need to do this as root):
```
$ python setup.py install
```

Alternatively, if you use virtualenv, you can do this:

$ cd qds-sdk-py
$ virtualenv venv
$ source venv/bin/activate
$ python setup.py install

This should place a command line utility qds.py somewhere in your path

$ which qds.py
/usr/bin/qds.py

CLI

qds.py allows running Hive, Hadoop, Pig, Presto and Shell commands against QDS. Users can run commands synchronously - or submit a command and check its status.

$ qds.py -h  # will print detailed usage

Examples:

run a hive query and print the results

$ qds.py --token 'xxyyzz' hivecmd run --query "show tables"
$ qds.py --token 'xxyyzz' hivecmd run --script_location /tmp/myquery
$ qds.py --token 'xxyyzz' hivecmd run --script_location s3://my-qubole-location/myquery

pass in api token from bash environment variable
```
$ export QDS_API_TOKEN=xxyyzz
```

run the example hadoop command

$ qds.py hadoopcmd run streaming -files 's3n://paid-qubole/HadoopAPIExamples/WordCountPython/mapper.py,s3n://paid-qubole/HadoopAPIExamples/WordCountPython/reducer.py' -mapper mapper.py -reducer reducer.py -numReduceTasks 1 -input 's3n://paid-qubole/default-datasets/gutenberg' -output 's3n://example.bucket.com/wcout'

check the status of command # 12345678

$ qds.py hivecmd check 12345678
{"status": "done", ... }

If you are hitting api_url other than api.qubole.com, then you can pass it in command line as --url or set in as env variable

$ qds.py --token 'xxyyzz' --url https://<env>.qubole.com/api hivecmd ...

or

$ export QDS_API_URL=https://<env>.qubole.com/api

SDK API

An example Python application needs to do the following:

Set the api_token and api_url (if api_url other than api.qubole.com):

from qds_sdk.qubole import Qubole

Qubole.configure(api_token='ksbdvcwdkjn123423')

# or

Qubole.configure(api_token='ksbdvcwdkjn123423', api_url='https://<env>.qubole.com/api')

Use the Command classes defined in commands.py to execute commands. To run Hive Command:

from qds_sdk.commands import *

hc=HiveCommand.create(query='show tables')
print "Id: %s, Status: %s" % (str(hc.id), hc.status)

example/mr_1.py contains a Hadoop Streaming example

Reporting Bugs and Contributing Code

Want to report a bug or request a feature? Please open an issue.
Want to contribute? Fork the project and create a pull request with your changes against unreleased branch.

Where are the maintainers ?

Qubole was acquired. All the maintainers of this repo have moved on. Some of the employees founded ClearFeed. Others are at big data teams in Microsoft, Amazon et al.

qds-sdk-py's People

Stargazers

Watchers

Forkers

nextdoor nonbeing atanu1991 praveenseluka guptarajat vogxn randgalt mrwalker itsvikramagr amitmnnit rjainqb mineshpateltc snarayanank2 rohitagarwal003 raghav-ghaiee-komli ionrock guptavishal amelio-dx msumit rohitpruthi95 karandeep-johar jsensarma hiyer sriram137 jainavi-zz sourabh912 sameerkedia abshkmodi ninelives21 yogesh2021 prakharjain09 tanishgupta1 pingali adeshr amoghmargoor nitinik quboletest richardsamuelk aramanath harshashah16 rahul26goyal karuppayya mprice64 rajivsharma007 anweshadas ksr-qubole kals0390 adamchainz ashishsachdeva rohangarg akhileshmaurya anumahadevan rtungaraza amalakar mcarlsen nitin12388 dianafa adityaka rowillia namrataashettar chudsaviet bwei-nasf mrterry devjyotipatra goodwaygroup tomhettinger mkotasthane hidarapaneni guyke barce snehabarshamishra ashubhumca rachanathakar87 somani nvedder rohan9969 satyabollineni akaranjkar akaranjkar-qu chattarajoy prashant-pattawi aimboden rohit-srivastava chengat1314 hernandrew tarrygo abhijithshankar93 lordozb junwang09 jeanlescut thispc sanketsaurav samfunk scottypate lingarajg brickyard christine-osazuwa-wmg nikhilk218 rd-pub-ac kswap

qds-sdk-py's Issues

Automatic retries

Hi,

Say I have a (e.g. Hive) command that I prepare and submit using the library. Is there any way to have it automatically retry the query if it fails, for a given number of times?

DbTap - Additional Required Parameters for on-premise Location

The Qubole docs for creating a DbTap say this:

Gateway parameters (gateway_ip, gateway_username, gateway_private_key) can be specified only if db_location is ‘on-premise’.
Though the gateway parameters are optional, if any one of the gateway parameters is specified then all three must be specified.

The SDK supports adding "on-premise" as a location:

edit.add_argument("--location", dest="location", choices=["us-east-1", "us-west-2", "ap-southeast-1", "eu-west-1", "on-premise"]

But I'm not sure how to pass the additional required parameters:

gateway_ip
gateway_port
gateway_username
gateway_private_key

My Issue Request:

If it is possible, maybe add some code comments or documentation around this?
If it's not possible, I would suggest a 2 part resolution... (1) removing the "on-premise" as an option until (2) the additional required parameters can be added into the sdk

semantics of cmd.is_done() and cmd.is_success() seems incorrect

When querying for the status of a cmd I was using cmd.is_done("done") and cmd.is_success("done") to see if the cmd is completed and in done state AND cmd is completed and is successful. However these methods simply tell whether the status passed is a 'done' status or a 'success' status. This is an incorrect implementation. Ideally no status need be passed and cmd.is_Xxx() methods are expected to return whether the cmd reached that particular status in its transition or not.

This is the related code block in qds-sdk-py

commands.py
    @staticmethod
    def is_done(status):
        """
        Does the status represent a completed command
        Args:
            ``status``: a status string

        Returns:
            True/False
        """
        return (status == "cancelled" or status == "done" or status == "error")

    @staticmethod
    def is_success(status):
        return (status == "done")

bump up version to v1.3

The new cluster API will work only on the v1.3 route. Since all the other cmds in CmdArgs work with v1.3 now - it is redundant to specify --version=v1.2 for the cluster APIs and use defaults otherwise. This change is required post merge of pull 27

(PTC-W0010) File opened without the `with` statement

Description

Opening a file using with statement is preferred as function open implements the context manager protocol that releases the resource when it is outside of the with block. Not doing so requires you to manually release the resource.

Occurrences

There are 8 occurrences of this issue in the repository.

See all occurrences on DeepSource → deepsource.io/gh/qubole/qds-sdk-py/issue/PTC-W0010/occurrences/

Cluster actions always go through the v1.2 of the API

When creating clusters with the v1.3 of the API we still use the v1.2 endpoint during cluster creation. The _parse_create_update method accepts the api-version only to distinguish the request parameters. But the create, update and clone methods still use the default configuration of Qubole.agent(..) which is set to v1.2 unless all the actions in the session are configure(..)`ed with the v1.3 of the API.

In my case I want to perform some of the unsupported actions with v1.3 of the cluster API but fall back to v1.2 for commands APIs. One way to do this is to explicitly switch the version in the create(..)/update(..)/clone(..) calls which thereby alter the version for that API call alone and leave the cached_agent as is for the v1.2 APIs.

utf-8 errors with qds.py when retrieving results

I just run into this problem:

$ qds.py --token=$QUBOLE_KEY_AVAZQUEZ hivecmd getresult 28319250

Traceback (most recent call last):
  File "/Users/amelio/anaconda/envs/py35/bin/qds.py", line 603, in <module>
    sys.exit(main())
  File "/Users/amelio/anaconda/envs/py35/bin/qds.py", line 556, in main
    return cmdmain(a0, args)
  File "/Users/amelio/anaconda/envs/py35/bin/qds.py", line 194, in cmdmain
    return globals()[action + "action"](cmdclass, args)
  File "/Users/amelio/anaconda/envs/py35/bin/qds.py", line 160, in getresultaction
    return _getresult(cmdclass, cmd)
  File "/Users/amelio/anaconda/envs/py35/bin/qds.py", line 118, in _getresult
    cmd.get_results(sys.stdout, delim='\t')
  File "/Users/amelio/anaconda/envs/py35/lib/python3.5/site-packages/qds_sdk/commands.py", line 207, in get_results
    skip_data_avail_check=isinstance(self, PrestoCommand))
  File "/Users/amelio/anaconda/envs/py35/lib/python3.5/site-packages/qds_sdk/commands.py", line 1286, in _download_to_local
    _read_iteratively(one_path, fp, delim=delim)
  File "/Users/amelio/anaconda/envs/py35/lib/python3.5/site-packages/qds_sdk/commands.py", line 1177, in _read_iteratively
    fp.buffer.write(data.decode('utf-8').replace(chr(1), delim).encode('utf8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 8191: unexpected end of data
Fatal Python error: GC object already tracked

Current thread 0x00007fff7c065000 (most recent call first):

Usage for non-cluster subcommands hard to find.

I tried a few things:

qds.py group --help
qds.py --help group
qds.py group -h
qds.py -h group

Only the cluster subcommand help shows (always), the others don't.

Oh, it works if you add a token, even if the token is bad.

qds.py --token=xyz group --help

Timeout is hard coded in connection wrapper

Timeout for net calls is hard coded to 300 seconds in https://github.com/qubole/qds-sdk-py/blob/master/qds_sdk/connection.py . This should be an option that is passed as part of params (and so on through https://github.com/qubole/qds-sdk-py/blob/master/qds_sdk/commands.py ) to allow a client to execute requests with variable timeout.

ClusterInfo v1.3 is not very wieldy and extendible

As a part of this commit the ClusterInfoV13 class was introduced to deal with the newer version of the cluster API. The previous form ClusterInfo had setters for each property (spot, security settings, ec2 settings etc) to update the specific parts of the cluster. Eg: If I needed to update only the spot instance properties I had only to call set_spot_instance_settings(..). Now I have to call the long form set_cluster_info(..) on each update. I cannot call the double underscored __set_spot_instance_settings(..) anymore since these are changed into _ClusterInfoV13__set_spot_instance_settings(..) because of python's name mangling. This is supremely frustrating and breaks a lot of test code that we have built up which needs to be rewritten as __Cluster_set_spot_instance_settings(..) or the long form update Cluster.set_cluster_info(..) for each update call to a cluster object.

Is it possible to convert these setters to @property-ies because that's how setters are exposed in Python. Was there a strong reason to hide the v1.3 setters deviating from the norm that was in v1.2 where setters were public?

Updating cluster to modify cluster composition does not work

For pushing configs, I use class ClusterInfoV22 to modify cluster composition, to update the cluster when I am using the ClusterV2.update() , it uses api_version as v2 as in code below

class ClusterV2(Resource):
    rest_entity_path = "clusters"
    api_version = "v2"

which causes the pushing of composition to not work, as api v2.2 is required for it.

Error: Status code 449 (RetryWithDelay)

I just started running into this error for some of my queries

qds.py --token=$QUBOLE_KEY_AVAZQUEZ hivecmd getresult 28597419 > my_data.csv
Error: Status code 449 (RetryWithDelay) from url https://api.qubole.com/api/v1.2/commands/28597419/results?inline=True

I am having this problem in Python 2.7.x and Python 3.5.x.

Trigger Qubole workspace query using qds-sdk

Please let me know how to trigger query in Qubole workspace using qds-sdk

Improve semantics of Qubole.configure

Currently user can run Qubole.configure as many times as possible. It modifies auth_token, url, api_version etc. But these changes not reflected in the connection agent, because it is cached once and reused thereafter. So the semantics of Qubole.configure is incorrect. We should either

Disallow the user to run Qubole.configure more than once
or
If we allow it, then we should change cached_agent accordingly.

Keep running into utf-8 errors in Python 3.x

qds.py --token=$QUBOLE_KEY_AVAZQUEZ hivecmd getresult "37309793" > my_output

Fails with:

Traceback (most recent call last):
  File "/Users/amelio/anaconda/envs/py35/bin/qds.py", line 604, in <module>
    sys.exit(main())
  File "/Users/amelio/anaconda/envs/py35/bin/qds.py", line 557, in main
    return cmdmain(a0, args)
  File "/Users/amelio/anaconda/envs/py35/bin/qds.py", line 195, in cmdmain
    return globals()[action + "action"](cmdclass, args)
  File "/Users/amelio/anaconda/envs/py35/bin/qds.py", line 161, in getresultaction
    return _getresult(cmdclass, cmd)
  File "/Users/amelio/anaconda/envs/py35/bin/qds.py", line 119, in _getresult
    cmd.get_results(sys.stdout, delim='\t')
  File "/Users/amelio/anaconda/envs/py35/lib/python3.5/site-packages/qds_sdk/commands.py", line 245, in get_results
    skip_data_avail_check=isinstance(self, PrestoCommand))
  File "/Users/amelio/anaconda/envs/py35/lib/python3.5/site-packages/qds_sdk/commands.py", line 1345, in _download_to_local
    _read_iteratively(one_path, fp, delim=delim)
  File "/Users/amelio/anaconda/envs/py35/lib/python3.5/site-packages/qds_sdk/commands.py", line 1236, in _read_iteratively
    fp.buffer.write(data.decode('utf-8').replace(chr(1), delim).encode('utf8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 8191: unexpected end of data

$ pip freeze | grep qds
qds-sdk==1.9.4

$ python --version
Python 3.5.2 :: Continuum Analytics, Inc.

Missing functionality in Scheduler class

I've noticed there are some missing features in some of the classes. For example, the qds_sdk.scheduler.Scheduler class doesn't have an edit/update or create method. I was able to pretty easily add this functionality to a class that extended "qds_sdk.scheduler.Scheduler".

class MyScheduler(Scheduler):
    def create(self):
        conn = Qubole.agent()
        return conn.post(Scheduler.rest_entity_path, self.attributes)

    def edit(self, args):
        conn = Qubole.agent()
        return conn.put(self.element_path(self.id), args)

It would be useful if this functionality exists in the SDK by default.

hivecmd --query doesn't consume the entire option string

When going through the examples in the README, the show tables example fails.

$ qds.py --vv hivecmd run -q "SHOW TABLES"
https://api.qubole.com/api/v1.2/commands
INFO:qds_connection:[POST] https://api.qubole.com/api/v1.2/commands
INFO:qds_connection:Payload: {
    "sample_size": null, 
    "label": null, 
    "macros": null, 
    "query": "SHOW", 
    "command_type": "HiveCommand", 
    "can_notify": false, 
    "script_location": null
}
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): api.qubole.com
DEBUG:requests.packages.urllib3.connectionpool:"POST /api/v1.2/commands HTTP/1.1" 200 682
https://api.qubole.com/api/v1.2/commands/1690268
INFO:qds_connection:[GET] https://api.qubole.com/api/v1.2/commands/1690268
INFO:qds_connection:Payload: null
DEBUG:requests.packages.urllib3.connectionpool:"GET /api/v1.2/commands/1690268 HTTP/1.1" 200 694
ERROR:qds:Cannot fetch results - command Id: 1690268 failed with status: error

S3ResponseError: S3ResponseError: 400 Bad Request when using V1.12.0

TL;DR
S3ResponseError: S3ResponseError: 400 Bad Request when making hivecmd getresult command using V1.12.0. This behavior does not happen with I build & install earlier releases V1.11.1 and V1.11.0.

Longer version
I have built & installed V1.12.0 from the release branch and tried to execute the following hivecmd getresult command:

/usr/local/bin/qds.py --token=<TOKEN>  --url=https://<QUBOLE_HOST>/api  -vv hivecmd getresult <COMMAND_ID>  > results.txt

I got S3ResponseError: S3ResponseError: 400 Bad Request.

Please refer below to the full error output:

INFO:qds_connection:[GET] https://<QUBOLE_HOST>/api/v1.2/commands/<COMMAND_ID>
INFO:qds_connection:Payload: null
INFO:qds_connection:Params: None
INFO:qds:Fetching results for HiveCommand, Id: <COMMAND_ID>
INFO:qds_connection:[GET] https://<QUBOLE_HOST>/api/v1.2/commands/<COMMAND_ID>/results
INFO:qds_connection:Payload: null
INFO:qds_connection:Params: {'include_headers': 'false', 'inline': True}
INFO:qds_connection:[GET] https://<QUBOLE_HOST>/api/v1.2/accounts/get_creds
INFO:qds_connection:Payload: null
INFO:qds_connection:Params: None
INFO:qds_commands:Starting download from result locations: [s3://<S3_HOST>/account_id/<ACCOUNT_ID>/tmp/<DATE>/<ACCOUNT_ID>/<COMMAND_ID>,s3://<S3_HOST>/account_id/<ACCOUNT_ID>/tmp/<DATE>/<ACCOUNT_ID>/<COMMAND_ID>.dir/]
INFO:qds_connection:[GET] https://<QUBOLE_HOST>/api/v1.2/commands/<COMMAND_ID>
INFO:qds_connection:Payload: null
INFO:qds_connection:Params: None
Traceback (most recent call last):
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/qds_sdk-1.12.0-py2.7.egg/EGG-INFO/scripts/qds.py", line 690, in <module>
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/qds_sdk-1.12.0-py2.7.egg/EGG-INFO/scripts/qds.py", line 637, in main
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/qds_sdk-1.12.0-py2.7.egg/EGG-INFO/scripts/qds.py", line 225, in cmdmain
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/qds_sdk-1.12.0-py2.7.egg/EGG-INFO/scripts/qds.py", line 191, in getresultaction
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/qds_sdk-1.12.0-py2.7.egg/EGG-INFO/scripts/qds.py", line 130, in _getresult
  File "build/bdist.macosx-10.12-x86_64/egg/qds_sdk/commands.py", line 327, in get_results
    _download_to_local(boto_conn, s3_path, fp, num_result_dir, delim=delim)
  File "build/bdist.macosx-10.12-x86_64/egg/qds_sdk/commands.py", line 1432, in _download_to_local
    bucket = boto_conn.get_bucket(bucket_name)
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/boto-2.49.0-py2.7.egg/boto/s3/connection.py", line 509, in get_bucket
    return self.head_bucket(bucket_name, headers=headers)
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/boto-2.49.0-py2.7.egg/boto/s3/connection.py", line 556, in head_bucket
    response.status, response.reason, body)
S3ResponseError: S3ResponseError: 400 Bad Request

need macro support for hadoopcmd and other cmds

It looks like macro replacement only works when using the hivecmd. I was trying to do the same with shellcmd and the macros were not replaced as specified in the cmd line. The --macro option is also missing for shellcmd.

qds.py --skip_ssl_cert_check --token=$AUTH_TOKEN --url=https://test.qubole.net/api/ --version=latest shellcmd submit -q  "hadoop dfs -ls '$scheme$://$base_bucket$/*/' > level_one.txt;" --macros='[{"scheme": "s3", "base_bucket" : "some-dir-test"}]'

Tags in the CLI

Is there any way to declare tags in the CLI?

qds-sdk cannot do constrained and test runs

On the Analyze pane one can do a constrained and test runs. The SDK support for this appears to be absent

Cannot pass named arguments to shell command

If my shell command uses "getopts" for named arguments in bash script, this script cannot be called via qds python api.

For example,

my bash script expects a named parameter " -- date or -d ", When I pass " --date=2016-01-01 " in "parameters" argument for shell command, the invocation fails.

get_results() does not seem to be working

Hey, so I launched a query:

hive_command_response = HiveCommand.create(query=sql)

then I wait for the query to finish:

cmd = Command.find(hive_command_response.id)
while not Command.is_done(cmd.status):
    time.sleep(Qubole.poll_interval)
    cmd = Command.find(hive_command_response.id)
    print('current jobs status is: ' + cmd.status)

then once it is finished, I want to get the results:

res = hive_command_response.get_results()

except that variable is empty, but if i do:

res = hive_command_response.get_log()

I see the correct log, and if I log in to the web app I can see my results.

why am I not seeing them with the python call?

doc: which class API could return the container id for the qubole command?

Spark SQL commands

I am hoping to use this library to run Spark SQL commands on a Qubole cluster. I see some references to Spark in the code, but not in the documentation (README.rst). Any guidance on this topic would be great.

skip_ssl_cert_check is inverted

When using the cmd line it is passed correctly.

From cmd line:

qds.py --skip_ssl_cert_check --token=$AUTH_TOKEN --url=https://api.qubole.com/api/ --version=latest cluster list --label default

But when called using the SDK's method the argument should be skip_ssl_check=False to skip the verification and skip_ssl_check=True to apply the verification. This is inverted. May be calling the variable ssl_verify=True|False makes more sense

From SDK (to apply SSL check):

Qubole.configure(
        api_token=cfg.get(config.option.environment, 'auth_token'),
        api_url=cfg.get(config.option.environment, 'api_url'),
        skip_ssl_cert_check=True
    )

Because the request args sent is inverted below

kwargs = {'headers': self._headers, 'auth': self.auth, 'verify': not self.skip_ssl_cert_check}

Properly reading results from Hive queries in Pandas in Python 3

What is the best way to read the output from disk with Pandas after using cmd.get_results ? (e.g. from a Hive command).

For example, consider the following:

out_file = 'results.csv'
delimiter = chr(1)
....

Qubole.configure(qubole_key)
hc_params = ['--query', query]
hive_args = HiveCommand.parse(hc_params)
cmd = HiveCommand.run(**hive_args)
if (HiveCommand.is_success(cmd.status)):
  with open(out_file, 'wt') as writer:
  cmd.get_results(writer, delim=delimiter, inline=False)

If, after successfully running the query, I then inspect the first few bytes of results.csv, I see the following:

$ head -c 300 results.csv
b'flight_uid\twinning_price\tbid_price\timpressions_source_timestamp\n'b'0FY6ZsrnMy\x012000\x012270.0\x011427243278000\n0FamrXG9AW\x01710\x01747.0\x011427243733000\n0FY6ZsrnMy\x012000\x012270.0\x011427245266000\n0FY6ZsrnMy\x012000\x012270.0\x011427245088000\n0FamrXG9AW\x01330\x01747.0\x011427243407000\n0FamrXG9AW\x01710\x01747.0\x011427243981000\n0FamrXG9AW\x01490\x01747.0\x011427245289000\n

When I try to open this in Pandas:

df = pd.read_csv('results.csv')

it obviously doesn't work (I get an empty DataFrame), since it isn't properly formatted as a csv file.

While I could try to open results.csv and post-process it (to remove b', etc.) before I open it in Pandas, this would be a quite hacky way to load it.

Am I using the interface correctly? This is using the very last version of qds_sdk: 1.4.2 from a three hours ago.

QDS Py package always has hc.status of 'waiting'

Following the example here: hc.status() invariably gives a status of 'waiting', even if the server has finished the result.

Please can you advise as to how we resolve this?

support python3 version

Are you going to have a support python3 by this module?

Getting "qds_sdk.exception.ResourceNotFound" from Airflow

Hi Team,

We are getting error as "qds_sdk.exception.ResourceNotFound" from Airflow for tasks which uses Qubole conn(which in turn uses qds-sdk). But the command itself is either running or successful from Qubole backend

Description of Airflow and QDS-SDK used:
apache-airflow==2.0.1
apache-airflow-providers-qubole==1.0.2
qds-sdk==1.16.1
attachments:

rewrite hadoop_clusters command to use clusters_controller endpoint (or remove it)

the hadoop_clusters endpoint is deprecated and its usage should be removed.

add support for querying metadata of tables

Support for the hive metadata API is missing from the SDK. Similarly, executing queries against a non-default schema needs to be supported. Similar to the PR here: qubole/qds-sdk-java#26

Support Hive Node Bootstrap APIs

Account API supports managing Hive Bootstrap: http://docs.qubole.com/en/latest/rest-api/account_api/set-bootstrap.html

There is no equivalent support in the sdk.

Unable to install qds-sdk-py. Error: invalid command 'bdist_wheel'

Here is the list of packages in my environment. Note that it includes the latest wheel (0.26.0)
Inspecting my environment:

In [1]: import wheel

In [2]: wheel.__file__
Out[2]: '/Users/avazquez/anaconda3/envs/py35/lib/python3.5/site-packages/wheel/__init__.py'

In [3]: wheel.__version__
Out[3]: '0.26.0'

All using a Python 3.5 in OSX from a conda install:

$ python --version
Python 3.5.0 :: Continuum Analytics, Inc.

And finally the full trace of $ pip install qds-sdk

Building wheels for collected packages: qds-sdk
Running setup.py bdist_wheel for qds-sdk
Complete output from command /Users/avazquez/anaconda3/envs/py35/bin/python3 -c "import setuptools;__file__='/private/var/folders/yl/bqcgdj3d6ss_mrfkpzxlh89m0000gp/T/pip-build-k8fufyo7/qds-sdk/setup.py';exec(compile(open(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" bdist_wheel -d /var/folders/yl/bqcgdj3d6ss_mrfkpzxlh89m0000gp/T/tmpmprq00c1pip-wheel-:
usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
   or: -c --help [cmd1 cmd2 ...]
   or: -c --help-commands
   or: -c cmd --help

error: invalid command 'bdist_wheel'

----------------------------------------
Failed building wheel for qds-sdk
Failed to build qds-sdk
Installing collected packages: qds-sdk
   Running setup.py install for qds-sdk
Successfully installed qds-sdk-1.7.0

client does not retry on BadStatusLine http error

This error seems to happen if a connection is terminated (say because of web server/load-balancer issues). Client should just retry.

Add ability to download results more than 10MB

right now we just skip non inline results:

commands.py: get_results()

  if r.get('inline'):
        return r['results']
    else:
        # TODO - this will be implemented in future                                                                                                                          
        log.error("Unable to download results, please fetch from S3")

Fatal Python error when using the CLI

I just run into this error when running:

qds.py --token=$QUBOLE_KEY_AVAZQUEZ hivecmd run --script_location $(pwd)/my_query.sql 
--tags "Team=opt-team" --cluster-label "default"

...
1466703091183   -1058627497_GEN_EX      http://xxx.yyy
1466703126387   -1058627497_GEN_EX      http://xxx.yyy
1466703161399   -1058627497_GEN_EX      http://xxx.yyy
1466703197279   -1058627497_GEN_EX      http://xxx.yyy
1466703232580   -1058627497_GEN_EX      http://xxx.yyy

Fatal Python error: GC object already tracked

Current thread 0x00007fff7c065000 (most recent call first):
[1]    29397 abort      qds.py --token=$QUBOLE_KEY_AVAZQUEZ hivecmd run --script_location  --tags

The command ID was: 28319250 and the query completed correct as far as I can tell.

Unicode decoding errors with qds-sdk

Hi, I have started running into Unicode decoding errors recently. When running:

     delimiter=chr(9)
     hc_params = ['--query', query]
     hc_params += ['--tags', 'Team=opt']

     hive_args = HiveCommand.parse(hc_params)
      cmd = HiveCommand.run(**hive_args)
      if (HiveCommand.is_success(cmd.status)):
          with open(out_file, 'wt') as writer:
            cmd.get_results(writer, delim=delimiter, inline=False)

I ended up with:

  File "/some_path/log-index/logindex/ qubole_query.py", line 54, in run_query
    cmd.get_results(writer, delim=delimiter, inline=False)
  File "/some_path/anaconda/envs/py35/lib/python3.5/site-packages/qds_sdk/commands.py", line 206, in get_results
    _download_to_local(boto_conn, s3_path, fp, num_result_dir, delim=delim)
  File "/some_path/anaconda/envs/py35/lib/python3.5/site-packages/qds_sdk/commands.py", line 1179, in _download_to_local
    _read_iteratively(one_path, fp, delim=delim)
  File "/some_path/anaconda/envs/py35/lib/python3.5/site-packages/qds_sdk/commands.py", line 1071, in _read_iteratively
    fp.buffer.write(data.decode('utf-8').replace(chr(1), delim).encode('utf8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 8191: unexpected end of data

Here is the Job ID with the query mentioend above.

unreleased versions are not installed when setup has 1.4.3 as stable

The setup.py's setup() method indicates the stable version to be installed. Because the unreleased branch of qds-sdk has 1.4.3 as the stable version, downstream tools that want the unreleased versions by specifying a github dependency pointed to the unreleased branch will fail to install. This is because setuptools finds 1.4.3 to be stable over unreleased and always override with what is found in PyPi. In order to install unreleased versions the "stable" version in the unreleased branch should be made unreleased

Timeout errors with the CLI

I am running into timeout errors when downloading data: (the resulting my_data.csv file is also empty). Why is it failing?

> qds.py --token=$QUBOLE_KEY_AVAZQUEZ hivecmd getresult 28091451 > my_data.csv

Traceback (most recent call last):
  File "/Users/amelio/anaconda/envs/py35/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py", line 385, in _make_request
    httplib_response = conn.getresponse(buffering=True)
TypeError: getresponse() got an unexpected keyword argument 'buffering'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/amelio/anaconda/envs/py35/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py", line 387, in _make_request
    httplib_response = conn.getresponse()
  File "/Users/amelio/anaconda/envs/py35/lib/python3.5/http/client.py", line 1174, in getresponse
    response.begin()
  File "/Users/amelio/anaconda/envs/py35/lib/python3.5/http/client.py", line 282, in begin
    version, status, reason = self._read_status()
  File "/Users/amelio/anaconda/envs/py35/lib/python3.5/http/client.py", line 243, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/Users/amelio/anaconda/envs/py35/lib/python3.5/socket.py", line 575, in readinto
    return self._sock.recv_into(b)
  File "/Users/amelio/anaconda/envs/py35/lib/python3.5/ssl.py", line 924, in recv_into
    return self.read(nbytes, buffer)
  File "/Users/amelio/anaconda/envs/py35/lib/python3.5/ssl.py", line 786, in read
    return self._sslobj.read(len, buffer)
  File "/Users/amelio/anaconda/envs/py35/lib/python3.5/ssl.py", line 570, in read
    v = self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/amelio/anaconda/envs/py35/lib/python3.5/site-packages/requests/adapters.py", line 403, in send
    timeout=timeout
  File "/Users/amelio/anaconda/envs/py35/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py", line 623, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/Users/amelio/anaconda/envs/py35/lib/python3.5/site-packages/requests/packages/urllib3/util/retry.py", line 255, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/Users/amelio/anaconda/envs/py35/lib/python3.5/site-packages/requests/packages/urllib3/packages/six.py", line 310, in reraise
    raise value
  File "/Users/amelio/anaconda/envs/py35/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py", line 578, in urlopen
    chunked=chunked)
  File "/Users/amelio/anaconda/envs/py35/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py", line 389, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/Users/amelio/anaconda/envs/py35/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py", line 314, in _raise_timeout
    raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
requests.packages.urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='api.qubole.com', port=443): Read timed out. (read timeout=300)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/amelio/anaconda/envs/py35/bin/qds.py", line 603, in <module>
    sys.exit(main())
  File "/Users/amelio/anaconda/envs/py35/bin/qds.py", line 556, in main
    return cmdmain(a0, args)
  File "/Users/amelio/anaconda/envs/py35/bin/qds.py", line 194, in cmdmain
    return globals()[action + "action"](cmdclass, args)
  File "/Users/amelio/anaconda/envs/py35/bin/qds.py", line 160, in getresultaction
    return _getresult(cmdclass, cmd)
  File "/Users/amelio/anaconda/envs/py35/bin/qds.py", line 118, in _getresult
    cmd.get_results(sys.stdout, delim='\t')
  File "/Users/amelio/anaconda/envs/py35/lib/python3.5/site-packages/qds_sdk/commands.py", line 177, in get_results
    r = conn.get(result_path, {'inline': inline})
  File "/Users/amelio/anaconda/envs/py35/lib/python3.5/site-packages/qds_sdk/retry.py", line 22, in f_retry
    return f(*args, **kwargs)
  File "/Users/amelio/anaconda/envs/py35/lib/python3.5/site-packages/qds_sdk/connection.py", line 52, in get
    return self._api_call("GET", path, params=params)
  File "/Users/amelio/anaconda/envs/py35/lib/python3.5/site-packages/qds_sdk/connection.py", line 97, in _api_call
    return self._api_call_raw(req_type, path, data=data, params=params).json()
  File "/Users/amelio/anaconda/envs/py35/lib/python3.5/site-packages/qds_sdk/connection.py", line 83, in _api_call_raw
    r = x.get(url, timeout=300, **kwargs)
  File "/Users/amelio/anaconda/envs/py35/lib/python3.5/site-packages/requests/sessions.py", line 487, in get
    return self.request('GET', url, **kwargs)
  File "/Users/amelio/anaconda/envs/py35/lib/python3.5/site-packages/requests/sessions.py", line 475, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/amelio/anaconda/envs/py35/lib/python3.5/site-packages/requests/sessions.py", line 585, in send
    r = adapter.send(request, **kwargs)
  File "/Users/amelio/anaconda/envs/py35/lib/python3.5/site-packages/requests/adapters.py", line 479, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='api.qubole.com', port=443): Read timed out. (read timeout=300)

hivecmd getresult changes delimiters for results >= 20Mb

For smaller results, the column delimiter is \t.
For larger results, the column delimiter is \001, likely due to direct download from S3 without post-processing.

This is an awkward result to consume. I looked into patching get_results in commands.py, but _download_to_local is complicated (multiple files, possibly a directory, ..).

SyntaxWarning and deprecation warning over invalid escape sequences

./qds_sdk/role.py:23: DeprecationWarning: invalid escape sequence \]
  help="Policy Statement example '[{\"access\":\"deny\", \"resource\": \"all\", \"action\": \"[\"create\",\"update\",\"delete\"\]\"}]'")
./qds_sdk/role.py:42: DeprecationWarning: invalid escape sequence \]
  help="Policy Statement example '[{\"access\":\"deny\", \"resource\": \"all\", \"action\": \"[\"create\",\"update\",\"delete\"\]\"}]'")
./qds_sdk/commands.py:1124: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if options.mode is "1":
./qds_sdk/commands.py:1137: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if options.db_update_mode is "updateonly":
./qds_sdk/commands.py:1424: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if (total is 0) or (downloaded == total):
./tests/test_command.py:1950: DeprecationWarning: invalid escape sequence \$
  sys.argv = ['qds.py', 'dbtapquerycmd', 'submit', '--query', "select * from table_1 limit  \$limit\$",
./tests/test_command.py:1958: DeprecationWarning: invalid escape sequence \$
  'query': "select * from table_1 limit  \$limit\$",
./tests/test_command.py:1966: DeprecationWarning: invalid escape sequence \$
  sys.argv = ['qds.py', 'dbtapquerycmd', 'submit', '--query', "select * from table_1 limit  \$limit\$",
./tests/test_command.py:1974: DeprecationWarning: invalid escape sequence \$
  'query': "select * from table_1 limit  \$limit\$",

ConnectionError handling

The requests library used underneath QDS_SDK needs to handle ConnectionErrors better. These pop up when we poll aggressively on the endpoint and a connection is refused.

built-in 'id' redefined

The following files use id as a variable name:
qds_sdk/commands.py
qds_sdk/resource.py

installation issue

Trying to install the qds-sdk in a virtualenv 2.7.5 python and encounter this issue:

~/workspace/ [master]   source ~/workspace/qbl/__venv__/bin/activate
(__venv__) ~/workspace/ [master]   pip install qds-sdk              
Downloading/unpacking qds-sdk
  Downloading qds_sdk-1.1.1.tar.gz
  Running setup.py egg_info for package qds-sdk
    Traceback (most recent call last):
      File "<string>", line 16, in <module>
      File "/Users/tsp/workspace/qbl/__venv__/build/qds-sdk/setup.py", line 23, in <module>
        long_description=read('README.rst')
      File "/Users/tsp/workspace/qbl/__venv__/build/qds-sdk/setup.py", line 10, in read
        return open(os.path.join(os.path.dirname(__file__), fname)).read()
    IOError: [Errno 2] No such file or directory: '/Users/tsp/workspace/qbl/__venv__/build/qds-sdk/README.rst'
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 16, in <module>

  File "/Users/tsp/workspace/qbl/__venv__/build/qds-sdk/setup.py", line 23, in <module>

    long_description=read('README.rst')

  File "/Users/tsp/workspace/qbl/__venv__/build/qds-sdk/setup.py", line 10, in read

    return open(os.path.join(os.path.dirname(__file__), fname)).read()

IOError: [Errno 2] No such file or directory: '/Users/tsp/workspace/qbl/__venv__/build/qds-sdk/README.rst'

----------------------------------------
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /Users/tsp/workspace/qbl/__venv__/build/qds-sdk
Storing complete log in /Users/tsp/.pip/pip.log

Can you use underscores instead of hyphen's in your releases?

In this pip issue we see that hyphens are converted to underscores in the wheel building process, but during lookup 1.0.5-beta does not equal 1.0.5_beta. pip is going to be retooled, but it sounds like (you can read the rest of the issue for more context) that - in version strings are just problematic and possibly not in spec.

get_results into a file

Hi,

The comments provided in the code are not helpful in understanding how to retrieve the results into 'fp'. It is mentioned that the get_results() can be redirected to 'fp'. But an example of how to provide the details for this file stream 'fp' would be really appreciated.

Best Regards,
Sam

Add MANIFEST with LICENSE reference

Hey-lo,

I was going to add an explicit link to qds-sdk's license file in the conda build for conda-forge. Doing so, however, requires an explicit MANIFEST file that declares the license included in the source distribution. Would y'all consider adding a MANIFEST to future releases?

Any plans to port the library to Python 3?

From what I see the only errors are:

    changing mode of build/scripts-3.3/qds.py from 644 to 755
      File "/Users/josh/anaconda/envs/py3k/lib/python3.3/site-packages/qds_sdk/actions.py", line 115
        print action.logs()
                   ^
    SyntaxError: invalid syntax

      File "/Users/josh/anaconda/envs/py3k/lib/python3.3/site-packages/qds_sdk/commands.py", line 225
        except IOError, e:
                      ^
    SyntaxError: invalid syntax

      File "/Users/josh/anaconda/envs/py3k/lib/python3.3/site-packages/qds_sdk/retry.py", line 16
        except ExceptionToCheck, e:
                               ^
    SyntaxError: invalid syntax