python-happybase / happybase Goto Github PK

View Code? Open in Web Editor NEW

609.0 609.0 162.0 487 KB

A developer-friendly Python library to interact with Apache HBase

Home Page: https://happybase.readthedocs.io/

License: Other

Python 73.35% Makefile 0.33% Thrift 26.31%

happybase's People

Contributors

Stargazers

Watchers

Forkers

ebottabi abeusher defcube strategist922 yuj rogerhu saintthor jackerxff aburan28 wangperry rtvt123 hahakubile ov7a vishalgoel003 maksimov kiranbhakre pkufranky varindersingh lopatin bkeep hkpesala dhellmann datareply nataliaking export-default foreverhy jeroenvlek ummae readthecodes takealot mdodsworth yonglehou zhuxf0407 skjaini bcui6611 rickysaltzer georgesuperman tekton not-today calm4wei sjarvie webon100 ccl0326 rmini liujinliang99 dao12dao nkhuyu lkhn-pjpt shaldengeki alex-senov smferguson lipper adamchainz perillaseed gorlins wulinwulin76 calloc ppanconi oldpanda nikolayvoronchikhin yangxinnewlife sjl421 mbrukman luyun-aa erkatz vitillo gaoyangkuanglong vls scrapinghub pranny alvarlaigna imrepo bigrlab hicomein bierbarbar suzaku yanghongkjxy anandaverma markmnl ypzhang sarang-sharma abaelhe longchanging shaozhipeng clarksun willfleury dongcd jbaayen xuxin8911 demon386 jjuhn seanpmorgan vietbk12345 zhengc1 daddyauden sudheer0553 vcancy mikhailkazagashev cijianzy heapxlabs

happybase's Issues

isn't there batch_inc in ver 0.7?

i notice v0.7 supports hbase 0.94. it seems still no api for counters batch sending?

Use TScan.batchSize (available in newer Thrift versions)

Relevant Thrift API:

/**
 * A Scan object is used to specify scanner parameters when opening a scanner.
 */
struct TScan {
  1:optional Text startRow,
  2:optional Text stopRow,
  3:optional i64 timestamp,
  4:optional list<Text> columns,
  5:optional i32 caching,
  6:optional Text filterString,
  7:optional i32 batchSize,
  8:optional bool sortColumns
}

Expose Mutation.writeToWAL (available in newer Thrift versions)

Relevant Thrift API:

/**
 * A Mutation object is used to either update or delete a column-value.
 */
struct Mutation {
  1:bool isDelete = 0,
  2:Text column,
  3:Text value,
  4:bool writeToWAL = 1
}

Does happybase support filters?

I noticed that in the document:

scan(row_start=None, row_stop=None, row_prefix=None, columns=None, filter=None, timestamp=None, include_timestamp=False, batch_size=1000, limit=None)

The filter argument may be a filter string that will be applied at the server by the region servers.

Does the filter here mean column filters? what does "filter string" mean? in hbase shell, you use filter like this:

scan 'table', { COLUMNS => 'family:qualifier', FILTER =>
SingleColumnValueFilter.new
(Bytes.toBytes('family'),
Bytes.toBytes('qualifier'),
CompareFilter::CompareOp.valueOf('EQUAL'),
SubstringComparator.new('somevalue'))
}

How does it work using happybase ?

10 warnings when installing on macos 10.9

Not sure if these warnings are significant. Please advise. Thanks!

Environment:
Clean Mac OS 10.9.1 (Mavericks)
Enthought Canopy (Enthought Canopy Python 2.7.3 | 64-bit | (default, Aug 8 2013, 05:37:06) )
Xcode [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin

$ pip install happybase
Downloading/unpacking happybase
Downloading happybase-0.7.tar.gz (59kB): 59kB downloaded
Running setup.py egg_info for package happybase

Downloading/unpacking thrift>=0.8.0 (from happybase)
Downloading thrift-0.9.1.tar.gz
Running setup.py egg_info for package thrift

Installing collected packages: happybase, thrift
Running setup.py install for happybase

Running setup.py install for thrift
building 'thrift.protocol.fastbinary' extension
gcc -fno-strict-aliasing -fno-common -dynamic -arch x86_64 -DNDEBUG -g -O3 -arch x86_64 -I/Applications/Canopy.app/appdata/canopy-1.1.0.1371.macosx-x86_64/Canopy.app/Contents/include/python2.7 -c src/protocol/fastbinary.c -o build/temp.macosx-10.6-x86_64-2.7/src/protocol/fastbinary.o
src/protocol/fastbinary.c:227:7: warning: comparison of constant -1 with expression of type 'TType' (aka 'enum TType') is always false [-Wtautological-constant-out-of-range-compare]
if (INT_CONV_ERROR_OCCURRED(dest->element_type)) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/protocol/fastbinary.c:124:43: note: expanded from macro 'INT_CONV_ERROR_OCCURRED'
#define INT_CONV_ERROR_OCCURRED(v) ( ((v) == -1) && PyErr_Occurred() )
~~~ ^ ~~
src/protocol/fastbinary.c:244:7: warning: comparison of constant -1 with expression of type 'TType' (aka 'enum TType') is always false [-Wtautological-constant-out-of-range-compare]
if (INT_CONV_ERROR_OCCURRED(dest->ktag)) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/protocol/fastbinary.c:124:43: note: expanded from macro 'INT_CONV_ERROR_OCCURRED'
#define INT_CONV_ERROR_OCCURRED(v) ( ((v) == -1) && PyErr_Occurred() )
~~~ ^ ~~
src/protocol/fastbinary.c:249:7: warning: comparison of constant -1 with expression of type 'TType' (aka 'enum TType') is always false [-Wtautological-constant-out-of-range-compare]
if (INT_CONV_ERROR_OCCURRED(dest->vtag)) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/protocol/fastbinary.c:124:43: note: expanded from macro 'INT_CONV_ERROR_OCCURRED'
#define INT_CONV_ERROR_OCCURRED(v) ( ((v) == -1) && PyErr_Occurred() )
~~~ ^ ~~
src/protocol/fastbinary.c:287:7: warning: comparison of constant -1 with expression of type 'TType' (aka 'enum TType') is always false [-Wtautological-constant-out-of-range-compare]
if (INT_CONV_ERROR_OCCURRED(dest->type)) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/protocol/fastbinary.c:124:43: note: expanded from macro 'INT_CONV_ERROR_OCCURRED'
#define INT_CONV_ERROR_OCCURRED(v) ( ((v) == -1) && PyErr_Occurred() )
~~~ ^ ~~
src/protocol/fastbinary.c:740:7: warning: comparison of constant -1 with expression of type 'TType' (aka 'enum TType') is always false [-Wtautological-constant-out-of-range-compare]
if (INT_CONV_ERROR_OCCURRED(got)) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/protocol/fastbinary.c:124:43: note: expanded from macro 'INT_CONV_ERROR_OCCURRED'
#define INT_CONV_ERROR_OCCURRED(v) ( ((v) == -1) && PyErr_Occurred() )
~~~ ^ ~~
src/protocol/fastbinary.c:787:15: warning: comparison of constant -1 with expression of type 'TType' (aka 'enum TType') is always false [-Wtautological-constant-out-of-range-compare]
if (etype == -1) {
~~~~~ ^ ~~
src/protocol/fastbinary.c:809:15: warning: comparison of constant -1 with expression of type 'TType' (aka 'enum TType') is always false [-Wtautological-constant-out-of-range-compare]
if (ktype == -1) {
~~~~~ ^ ~~
src/protocol/fastbinary.c:814:15: warning: comparison of constant -1 with expression of type 'TType' (aka 'enum TType') is always false [-Wtautological-constant-out-of-range-compare]
if (vtype == -1) {
~~~~~ ^ ~~
src/protocol/fastbinary.c:836:16: warning: comparison of constant -1 with expression of type 'TType' (aka 'enum TType') is always false [-Wtautological-constant-out-of-range-compare]
if (type == -1) {
~~~~ ^ ~~
src/protocol/fastbinary.c:888:14: warning: comparison of constant -1 with expression of type 'TType' (aka 'enum TType') is always false [-Wtautological-constant-out-of-range-compare]
if (type == -1) {
~~~~ ^ ~~
10 warnings generated.
gcc -bundle -undefined dynamic_lookup -g -arch x86_64 -L/tmp/_py/libraries/Applications/Canopy.app/appdata/canopy-1.1.0.1371.macosx-x86_64/Canopy.app/Contents/lib -headerpad_max_install_names -arch x86_64 build/temp.macosx-10.6-x86_64-2.7/src/protocol/fastbinary.o -o build/lib.macosx-10.6-x86_64-2.7/thrift/protocol/fastbinary.so
ld: warning: directory not found for option '-L/tmp/_py/libraries/Applications/Canopy.app/appdata/canopy-1.1.0.1371.macosx-x86_64/Canopy.app/Contents/lib'
/Users/XXXXXX/Library/Enthought/Canopy_64bit/User/bin/python -O /var/folders/pg/k32xh5g96t7378pl__p7khd00000gn/T/tmp2DDIYN.py
removing /var/folders/pg/k32xh5g96t7378pl__p7khd00000gn/T/tmp2DDIYN.py

Successfully installed happybase thrift
Cleaning up...

timestamp check does not allow long

if not (timestamp is None or isinstance(timestamp, int)):
    raise TypeError("'timestamp' must be an integer or None")

This causes issues for me in Python 2.7.3 on Windows because this is long:

timestamp=1369168852994

Suggested fix:

if not (timestamp is None or isinstance(timestamp, (int, long))):
    raise TypeError("'timestamp' must be an integer or None")

Expose TRegionInfo.{serverName,port} (available in newer Thrift versions)

Relevant Thrift API:

/**
 * A TRegionInfo contains information about an HTable region.
 */
struct TRegionInfo {
  1:Text startKey,
  2:Text endKey,
  3:i64 id,
  4:Text name,
  5:byte version,
  6:Text serverName,
  7:i32 port
}

Entity mapper

Is it planned to provide Pythonic entity mapper (like ORM for relational database) for HBase in the future? I think it would be great if it makes model classes easy to map.

Write nice support functions to construct filter strings for table.scan(filter=...)

The summary says it all.

Multiget

Is it possible to fetch data for a list of row keys in a single request (similar to multiget in memcache)

e.g. multiget(row_key1, row_key4, row_key6) gives
{row_key1: value_1, row_key4:value_4, row_key6:value_6}

I tried using scan

scan 'my_table', {filter : (RowFilter (=, 'binary:row_key1') OR (RowFilter (=, 'binary:row_key2))}

But scan is taking too much time. Wondering if there is a more efficient way to do multiget

Unsupported framed Thrift transports results in connection resets

happybase 0.3 with hbase 0.94.0 is getting its connection closed consistently:

>>> import happybase
>>> conn = happybase.Connection("localhost")
>>> t = conn.table("test")
>>> list(t.scan())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "happybase/api.py", line 535, in scan
    scan_id = client.scannerOpenWithScan(self.name, scan)
  File "happybase/hbase/Hbase.py", line 1716, in scannerOpenWithScan
    return self.recv_scannerOpenWithScan()
  File "happybase/hbase/Hbase.py", line 1728, in recv_scannerOpenWithScan
    (fname, mtype, rseqid) = self._iprot.readMessageBegin()
  File "thrift/protocol/TBinaryProtocol.py", line 126, in readMessageBegin
    sz = self.readI32()
  File "thrift/protocol/TBinaryProtocol.py", line 203, in readI32
    buff = self.trans.readAll(4)
  File "thrift/transport/TTransport.py", line 58, in readAll
    chunk = self.read(sz-have)
  File "thrift/transport/TTransport.py", line 160, in read
    self.__rbuf = StringIO(self.__trans.read(max(sz, self.__rbuf_size)))
  File "thrift/transport/TSocket.py", line 94, in read
    buff = self.handle.recv(sz)
socket.error: [Errno 104] Connection reset by peer

and in the thrift server log:

2012-06-29 13:06:01,330 ERROR org.apache.thrift.server.TNonblockingServer: Read an invalid frame size of -2147418111. Are you using TFramedTransport on the client side?

Connection pooling

Connection pooling should use a separate pool API, not be completely embedded inside the happybase.Connection class.

Goal:

When using happybase in the context of a web application, it would be useful to re-use connections between page requests. A connection pooling solution should take a MIN, MAX and IDLE count as parameters, and open connections as needed by the application.

Inspiration:

socketpool https://github.com/benoitc/socketpool
pybase https://github.com/bcopeland/pybase
PyMongo http://api.mongodb.org/python/
PyCassa http://pycassa.github.io/pycassa/

API:

import happybase

pool = happybase.ConnectionPool(
    'hostname', 
    port=9090, 
    timeout=None, 
    min=0, 
    max=3, 
    idle=1, 
    autoconnect=False,
    compat='0.92', 
    transport='buffered')

# block == wait until a connection is available 
# versus raise an exception
connection = pool.get_connection(
    block=True,
    table_prefix=None, 
    table_prefix_separator='_')

The pool could be instantiated manually per-process in the setup flow of a in a web server framework. For example, in Django, this could be done in settings.py with AUTOCONNECT=False so that connections are not established until the first calls to get_connection().

Retries:

If a connection cannot be established, or is terminated (ie by a timeout), it would attempt to re-establish after RETRY_MS milliseconds.

Errors:

ConnectionPool could thrown an error right away if it can't establish MIN connections immediately. Otherwise, a call to pool.get_connection, will raise various exceptions for things like pool exhaustion (if BLOCK=False), cannot connect to Thrift endpoint, etc.

Other thoughts:

I can't see how we could support connection pooling between multiple python processes except by implementing a separate process to connect through, similar to pgpool.

How to check if a key exists or not

Hi,

How can I just check if a key exists or not rather than return the value of the row?

Thanks

Batched gets

HappyBase's batch interface currently supports puts and deletes, but does not allow for gets.

dose happybase create_table support HexStringSplit?

shell command like:

hbase org.apache.hadoop.hbase.util.RegionSplitter MyTableName HexStringSplit -c 1000 -f f1:f2:f3

where to set autoflush = false?

will it increase efficient for batch writing?

Performance of Happybase

Hi, I am writing my data to HBase using HappyBase API. The data is 28969265 lines, and 6 GB in total.

If I write the data using one single script, it takes forever (more than 30 minutes). So I wrote a MapReduce job on Hadoop, and divided the work to 1000 reducer (~ 29000 lines per reducer). The job took around 30 minutes to finish.

The code is using Table.batch() as the document suggested. Table.batch().send() is the only expensive function.

I think my dataset is quite small and shoudn't take such long time to write. Could you please offer some advice on how to optimize the writing process? Thanks a lot.

Maybe use TRowResult.sortedColumns if available?

See https://issues.apache.org/jira/browse/HBASE-7826

Add API to update multiple counters at once

for a mass data, i use batches to write lines. it is efficient. but the counters take too much time. is there any way to set counters in batch?

table.families() do not show actual time to live

If we set ttl for column family when table is being created and after call families() for this table, then always families contain ttl = -1. Hbase 0.94.12.

WAL(Write Ahead Log) operations..?

Does happybase support turning off WAL(Write Ahead Log) while put and scan operation?

API to get only row keys

Is there a way to get just the row keys (and don't get the column data) from hbase using happybase.

Usecase:
I am trying to implement pagination on rows. My row keys are random integers, they are unique but not sequential.

The closest to efficient pagination I could think of is

a. Get all the row keys
b. Loop through row keys (in batch of 100) and get the column data, when needed

how to scan in reverse order?

or how to get the last line in a table?

Connection timeout setting

I was looking at happybase documentation here for Connection class ( http://happybase.readthedocs.org/en/latest/api.html ), and was searching for timeout parameter but couldn't find one.

Is happybase.Connection() to a hbase thrift server a persistent connection. If not, does it have a default timeout ( does it fallback/use timeout defined in python thrift bindings) that can be changed.

Python 3 support

This would be trivial for HappyBase, but the underlying Thrift library needs to be Python 3 compatible first.

A simple '2to3' on the source code seems to work, but this needs to be properly supported in a stable Thrift release first.

Status/todo for this issue:

Merge #78
Regenerate Thrift bindings using a newer Thrift version (maybe not needed)
Update docs (TODO.rst)

table_prefix assumes underscore separator

The table_prefix parameter on Connection is incredibly useful, except it mandates that you use the underscore character as a separator. This is a problem in situations where you don't want any character after your prefix (i.e. it's a set length), or indeed want to use something else.

I propose a change that requires users specify the separator of their choice as art of the prefix. The effort required for users to update their code would be minimal, and the code would benefit from not having a "table_prefix_separator"!

-d

Q: Does HappyBase support aggregate operations?

Does happybase have support for aggregate functions like count(), sum() etc..?

Thanks

AttributeError: 'module' object has no attribute 'Connection'

I don't know if it is a right place to ask questions, but don't know where to go. Anyway...
I downloaded the happybase package from PayPI and installed it using ''python setup.py install --prefix=~/.local".
Then tried Python -c "import happybase", It's ok.
But when I run a script it shows the error below:

Traceback (most recent call last):
File "happybase.py", line 1, in
import happybase
File "/home/linqili/sh/thrift_hbase/happybase.py", line 3, in
connection = happybase.Connection('192.168.19.107', 39090)
AttributeError: 'module' object has no attribute 'Connection'

here is my test code:
import happybase
connection = happybase.Connection('192.168.19.107', 39090)
table = connection.table("AD_DSP")
for key, data in table.rows(['1003_64133', '1_1']):
print key, data

Did I do something wrong?

Did table.cells supports variable columnnames in one program

Hi wbolster,
I want to see the versions of data in hbase, I'm using table.cells() in here i can give row_key as variable and for that row key i have many column names i don't know what are it,by using the same can i get column name iteratively..
I have only one columnfamily that is cf1
Here i pasted my code,can u help me..

Thank you

!/usr/bin/env python

from happybase import Connection
from time import time
TABLE_NAME = 'test'
connect = Connection('localhost')
table_conn = connect.table(TABLE_NAME)
def versioning(rowkey):

versions = table_conn.cells(rowkey,columname,versions = 100000)
init = 0
for column,value in versions.items():
    print "Data : %s" % value
    init = init + 1
    #print init

print "Data : %s" % value
print "Columnname : %s " % column
print "Data count : %s" % init

if name == "main":

"""
   For command line test

"""
rowkey = 'string'
column = 'cf1'
start = time()
count = versioning(rowkey)
print count
end  = time()
print end - start

thanks alot

Support for Thrift Types

There doesn't seem to be any mention of how to best handle typed data with thrift. I've seen several examples so far of people storing integers as strings. I understand that using struct would effectively give you this, but are thrift types really so bad? Seems like using struct.pack would lead to code that has to be very verbose to work in other languages, especially when thrift already has many predefined data types.

what may cause TTransportException: Transport not open?

my code runs well with a hbase. and when redrecting to another, all write method cause such an exception:

File "/opt/python27/lib/python2.7/site-packages/happybase/table.py", line 370, in put
batch.put(row, data)
File "/opt/python27/lib/python2.7/site-packages/happybase/batch.py", line 116, in exit
self.send()
File "/opt/python27/lib/python2.7/site-packages/happybase/batch.py", line 54, in send
self._table.connection.client.mutateRows(self._table.name, bms)
File "/opt/python27/lib/python2.7/site-packages/happybase/hbase/Hbase.py", line 1449, in mutateRows
self.send_mutateRows(tableName, rowBatches)
File "/opt/python27/lib/python2.7/site-packages/happybase/hbase/Hbase.py", line 1459, in send_mutateRows
self._oprot.trans.flush()
File "/opt/python27/lib/python2.7/site-packages/thrift/transport/TTransport.py", line 169, in flush
self.__trans.write(out)
File "/opt/python27/lib/python2.7/site-packages/thrift/transport/TSocket.py", line 124, in write

message='Transport not open')

if connect and put immediatlly, it may succeed. if connect and put after a while, error. how to make the connection never break?

or, how to check if the connection is available?

in python shell, the connection works after a long while. but in my code, the connections that created in threads terminate soon.

Scanner limitations ?

Hi, I am writing a solution using HBase (MapR M7 currently) and python for the backend part (data generation, admin interface ... )

I use Java for processing large files into HBase but I need to create some dashboards based on the data I get in Hbase.

I will write some MapReduce jobs to create aggregated batch of data because my main table contain more than 100 MM lines. Currently, I encounter an issue with HappyBase. When i count the number of lines in a table scanner, I always get 234 or 236. So if I loop on the scanner I cannot get more results.

Do you know if I am doing something wrong or could this be a known issue with Thrift or something like that ?

Thanks a lot in advance, and keep up with the good work !

matter in upgrading hbase

i am upgrading hbase from 0.90 to 0.94. it seems ok. in hbase shell it shows me the virson is 0.94. but happybase do not think so.

co = hb.Connection( '127.0.0.1' ) //or Connection( '127.0.0.1', compat='0.92' )
x = co.table( 'Message' ).scan()
x.next()
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/python273/lib/python2.7/site-packages/happybase-0.4-py2.7.egg/happybase/api.py", line 567, in scan
scan_id = client.scannerOpenWithScan(self.name, scan)
File "/usr/local/python273/lib/python2.7/site-packages/happybase-0.4-py2.7.egg/happybase/hbase/Hbase.py", line 1716, in scannerOpenWithScan
return self.recv_scannerOpenWithScan()
File "/usr/local/python273/lib/python2.7/site-packages/happybase-0.4-py2.7.egg/happybase/hbase/Hbase.py", line 1733, in recv_scannerOpenWithScan
raise x
thrift.Thrift.TApplicationException: Invalid method name: 'scannerOpenWithScan'

and using compat = '0.90' succeeds as before.

co.close()
co = hb.Connection( '127.0.0.1', compat='0.90' )
x = co.table( 'Message' ).scan()
x.next()
('52255fa14844acf3cb2e3e5e', {'Duration:': '1', 'Result:': '1', 'Content:': 'content'})

is hbase upgraded well?

Can not filter a table in hbase0.96 or 0.98

Hi, I have just upgraded hbase from 0.94 to 0.96. happybase work fine except filter table.
Below is my code:

connection = happybase.Connection(host='localhost', port=9090,autoconnect=False,transport='buffered')
connection.open()
filter_string = "SingleColumnValueFilter ('info', 'user_name', =, 'binary:user1')"
table = connection.table('my_table')
data_scan = table.scan(filter= filter_string)

And below is hbase log:

 2014-02-25 10:16:14,097 ERROR [RpcServer.handler=11,port=58866] ipc.RpcServer: Unexpected throwable object 
org.apache.hadoop.hbase.filter.IncompatibleFilterException: Cannot set batch on a scan using a filter that returns true for filter.hasFilterRow.

Can you help me ?

storing data in Hbase (Bytes conversion)

Hi,

I have went through your documentation and right now stumbled up on storing data in Hbase. I have read that strings need to converted to ByteStrings before they are sent.

I have did something like below but it keeps giving me IOError.

row = str("row")
col1 = str("col1")
value1 = str("value1")

table.put( row , {col1: value1})

Am i doing it right ? If i am wrong, can you please give me an example of doing this is 2.7 + version ?

Thanks

using batch in threads lost versions

writing a lot of lines to hbase via threads.

thread.run():

msg = queue.get_nowait()
batch.put( msg.key, {...} )
if condi:
    batch.send()

a msg may have 2 versions. this is to say, there may be two msgs with same key. i want to store them in the same line.

the codes run well for a few msgs. when data grows larger, only one version would be stored to hbase for most messages.

using table.put instead of batch, the versions didn't lose.

what can i do with batch?

Cleanup the batch size and caching scanner flags

The current implementation mixes up these options affecting scanner behaviour and performance:

batch size for fetching frows, used between the Python process and the Thrift server
batch size passed as a scanner option, used between the Thrift server and region servers
caching size passed as a scanner option, also used between the Thrift server and region servers

Column Name regex filter

Is it possible to apply a regex filter on column names, and get only those columns which match the regex using happybase

I tried 'filter' : 'regexstring:my_regex_here' in scan_args, but it didn't work

Get a list of keys

Is it possible to use the table.scan without getting any columns? I just want to get the row keys. Like this pseudocode:

onlyRowKeys = hBaseTable.scan(columns=[''])

With happybase, I wrote the following code:
datacollection = hBaseTable.scan(columns=['columnX:fieldX'])

scannedCollection = list(datacollection)
onlyKeys,bogusData = zip(*scannedCollection)

Is there a better way?

[I can't seem to add a label to this issue]

Invalid method scannerOpenWithScan - Thrift Error

This error seems to have appeared once upgrading to version .3. I have a .2 virtual environment that works just fine. See logs below

=== Version 0.3 ===
(env)ricky-mbp:base ricky$ pip install happybase
Downloading/unpacking happybase
Downloading happybase-0.3.tar.gz (43Kb): 43Kb downloaded
Running setup.py egg_info for package happybase

Downloading/unpacking thrift (from happybase)
Downloading thrift-0.8.0.tar.gz
Running setup.py egg_info for package thrift
.......
$ python scan.py ca-
File "scan.py", line 12, in
scan(sys.argv[1])
File "scan.py", line 8, in scan
for key , row in table.scan(row_prefix=prefix, columns=['name:first', 'name:last']):
File "/Users/ricky/Sandbox/base/env/lib/python2.7/site-packages/happybase/api.py", line 535, in scan
scan_id = client.scannerOpenWithScan(self.name, scan)
File "/Users/ricky/Sandbox/base/env/lib/python2.7/site-packages/happybase/hbase/Hbase.py", line 1716, in scannerOpenWithScan
return self.recv_scannerOpenWithScan()
File "/Users/ricky/Sandbox/base/env/lib/python2.7/site-packages/happybase/hbase/Hbase.py", line 1733, in recv_scannerOpenWithScan
raise x
thrift.Thrift.TApplicationException: Invalid method name: 'scannerOpenWithScan'

=== Version 0.2 ===
(env2)ricky-mbp:base ricky$ pip install happybase==0.2
Downloading/unpacking happybase==0.2
Downloading happybase-0.2.tar.gz (42Kb): 42Kb downloaded
Running setup.py egg_info for package happybase

Downloading/unpacking thrift (from happybase==0.2)
Downloading thrift-0.8.0.tar.gz
Running setup.py egg_info for package thrift
......
$ python scan.py ca- | wc -l
1894
^^ works just fine

Here is the scan code:

def scan(prefix):
    connection = happybase.Connection("192.168.42.201")
    table = connection.table("donors")

    for key , row in table.scan(row_prefix=prefix, columns=['name:first', 'name:last']):
        print(', '.join(map(str.capitalize, [row['name:first'], row['name:last']])))

Invalid method name: 'scannerOpenWithScan'

The sample code in the tutorial results in an thrift.Thrift.TApplicationException:

Python 2.7.3 (default, Apr 20 2012, 22:39:59)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import happybase
connection = happybase.Connection('localhost')
table = connection.table("authors")
for k,v in table.scan():
... print k,v
...
Traceback (most recent call last):
File "", line 1, in
File "build/bdist.linux-x86_64/egg/happybase/api.py", line 546, in scan
File "build/bdist.linux-x86_64/egg/happybase/hbase/Hbase.py", line 1716, in scannerOpenWithScan
File "build/bdist.linux-x86_64/egg/happybase/hbase/Hbase.py", line 1733, in recv_scannerOpenWithScan
thrift.Thrift.TApplicationException: Invalid method name: 'scannerOpenWithScan'

HBase and Hadoop are installed as part of CDH3 distribution.

ubuntu@hadoop1:~$ thrift -version
Thrift version 0.8.0

what is the fast way for writing?

i have to write Gs of lines per day. for each about 1KB.

i use batch in threads, put about 30000 lines/second. i want it be faster.

i tried running in gevent, but it seems much slower than in threads. is there better way?

Regenerate Thrift client with version 0.9

How to store Array / List / Dict/Struct to column of a Column Family

I am trying to store and array to column of a columnfamily and get the following error
TypeError: must be string or read-only character buffer, not list .

Work around i did was to convert it into string and the i was able to perform the put command .

However the challenge is when i create hive table referencing to hbase table and perform an explode functionality it treat my array as string and does not break it into multiple rows. The reason being i have converted array to string in my put command.

Need some help or example to solve this issue.

Thanks in advance.

Anupam

scan and batch.send fail while put and row succeed

need i do any configuration for hbase?

import happybase as hb
conn = hb.Connection( '10.8.210.182' )
conn.tables()
['Recommendation', 'StatsResult', 'hbase_t1', 'test', 'total_score']
t = conn.table( 'test' )
t.scan()
<generator object scan at 0x4e1a4b0>
for x in t.scan(): print x
...
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/python273/lib/python2.7/site-packages/happybase-0.4-py2.7.egg/happybase/api.py", line 567, in scan
scan_id = client.scannerOpenWithScan(self.name, scan)
File "/usr/local/python273/lib/python2.7/site-packages/happybase-0.4-py2.7.egg/happybase/hbase/Hbase.py", line 1716, in scannerOpenWithScan
return self.recv_scannerOpenWithScan()
File "/usr/local/python273/lib/python2.7/site-packages/happybase-0.4-py2.7.egg/happybase/hbase/Hbase.py", line 1733, in recv_scannerOpenWithScan
raise x
thrift.Thrift.TApplicationException: Invalid method name: 'scannerOpenWithScan'

just ignore it

Potentially counter-intuitive behavior on rows() when some results are null

First, thank you for creating happybase! So far it's been a great and very pythonic way to connect my Django project to HBase.

While building a component today, I noticed that table.rows() acts in a somewhat surprising way:

Say I pass (row1, row2, row3) in as the rows parameter and (cf1,) in as columns. If row3,cf1 doesn't exist then rows() returns an empty list, even if row1,cf1 and row2,cf2 do exist.

To me, a more intuitive and useful result would be [result1, result2, None] as the return instead of just []. Also, the happybase API doesn't provide a .exists? method for keys, meaning that it's pretty dangerous to call .rows() unless you're 100% sure all the KeyValues exist. Obviously I could make a loop and make an invididual .row() call for each row, but I'm making live calls to Hbase to render a webpage so I'm very concerned about minimizing latency (I'm assuming that .rows() is faster.)

I looked into the code to potentially try to tweak it on my deployment, but it looks like this is happening inside the Thrift code (which I don't understand.) Is this behavior some sort of inherent limit to Thrift, or could we potentially change this?

Thanks!
-George

Incompletely iterated scanners going out of scope are not cleaned

If the iterator returned by Table.scan() is not completely exhausted (i.e. due to a break in a for loop), HappyBase does not close the server-side scanner, even if the Table.scan() result (currently a generator function) has gone out of scope. The (leaked) associated server-side resources are freed only after the scanner times out.

A possible solution would be to make the return type from Table.scan() a real class with __iter__() and next() functions and a __del__() to make sure resources are freed as soon as possible. Additionally, this class can be made a context manager, so that something like this can be used for scanners that may not be fully iterated over:

with table.scan() as scan:
    for row_key, row_data in scan:
        pass  # do something, possibly breaking out of the loop

Storing data Hbase

I'm storing data in hbase and I receive the same error and I don't understand why...

Traceback (most recent call last):
File "dump.py", line 107, in
table.put(idate+'_'+str(count) , {'info:latitude': dados['latitude'] ,'info:longitude': dados['longitude'], 'info:velocidade': dados['velocidade'], 'info:direcao': dados['direcao'] })
File "/home/Larissa/Downloads/ENV/lib/python2.6/site-packages/happybase/api.py", line 618, in put
batch.put(row, data)
File "/home/Larissa/Downloads/ENV/lib/python2.6/site-packages/happybase/api.py", line 841, in exit
self.send()
File "/home/Larissa/Downloads/ENV/lib/python2.6/site-packages/happybase/api.py", line 779, in send
self.table.client.mutateRows(self.table.name, bms)
File "/home/Larissa/Downloads/ENV/lib/python2.6/site-packages/happybase/hbase/Hbase.py", line 1450, in mutateRows
self.recv_mutateRows()
File "/home/Larissa/Downloads/ENV/lib/python2.6/site-packages/happybase/hbase/Hbase.py", line 1472, in recv_mutateRows
raise result.io
happybase.hbase.ttypes.IOError: IOError(_message='Connection refused')

table.scan() with filter set will always fail in happybase >= 0.7

Description:
batchSize should not be set on scans with filter.

happybase v0.7 introduced new argument batchSize for TScan in method happybase.table.scan(). When used with filter this parameter will cause all scan operations to fail.

happybase always passes batch_size to TScan, even if there is filter_string present.
there is no way to set batch_size to None since method scan() validates batch_size value:
https://github.com/wbolster/happybase/blob/0.7/happybase/table.py#L259

See corresponding HBase code:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase/0.94.9/org/apache/hadoop/hbase/client/Scan.java?av=f#311

Steps to reproduce:

import happybase
conn = happybase.Connection(host='localhost', port=9090)
conn.create_table('project', {'f': dict()})
table = conn.table('project')

table.put('row1', {'f:qual1': 'val1'})
table.put('row2', {'f:qual1': 'val2'})
table.put('row3', {'f:qual1': 'val1'})

# this operation always fails
for k, v in table.scan(filter="SingleColumnValueFilter ('f', 'qual1', =, 'binary:val1')"): 
    print v

python-happybase / happybase Goto Github PK

happybase's People

Contributors

Stargazers

Watchers

Forkers

happybase's Issues

!/usr/bin/env python

message='Transport not open')

Recommend Projects

Recommend Topics

Recommend Org