Git Product home page Git Product logo

happybase's People

Contributors

aisk avatar artofhuman avatar bierbarbar avatar bitdeli-chef avatar bwo avatar clarksun avatar defcube avatar dhellmann avatar dhermes avatar ecederstrand avatar mathbugua avatar oldpanda avatar paolostivanin avatar rogerhu avatar seanpmorgan avatar timgates42 avatar wbolster avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

happybase's Issues

Use TScan.batchSize (available in newer Thrift versions)

Relevant Thrift API:

/**
 * A Scan object is used to specify scanner parameters when opening a scanner.
 */
struct TScan {
  1:optional Text startRow,
  2:optional Text stopRow,
  3:optional i64 timestamp,
  4:optional list<Text> columns,
  5:optional i32 caching,
  6:optional Text filterString,
  7:optional i32 batchSize,
  8:optional bool sortColumns
}

Does happybase support filters?

I noticed that in the document:

scan(row_start=None, row_stop=None, row_prefix=None, columns=None, filter=None, timestamp=None, include_timestamp=False, batch_size=1000, limit=None)

The filter argument may be a filter string that will be applied at the server by the region servers.

Does the filter here mean column filters? what does "filter string" mean? in hbase shell, you use filter like this:

scan 'table', { COLUMNS => 'family:qualifier', FILTER =>
SingleColumnValueFilter.new
(Bytes.toBytes('family'),
Bytes.toBytes('qualifier'),
CompareFilter::CompareOp.valueOf('EQUAL'),
SubstringComparator.new('somevalue'))
}

How does it work using happybase ?

10 warnings when installing on macos 10.9

Not sure if these warnings are significant. Please advise. Thanks!

Environment:
Clean Mac OS 10.9.1 (Mavericks)
Enthought Canopy (Enthought Canopy Python 2.7.3 | 64-bit | (default, Aug 8 2013, 05:37:06) )
Xcode [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin

$ pip install happybase
Downloading/unpacking happybase
Downloading happybase-0.7.tar.gz (59kB): 59kB downloaded
Running setup.py egg_info for package happybase

Downloading/unpacking thrift>=0.8.0 (from happybase)
Downloading thrift-0.9.1.tar.gz
Running setup.py egg_info for package thrift

Installing collected packages: happybase, thrift
Running setup.py install for happybase

Running setup.py install for thrift
building 'thrift.protocol.fastbinary' extension
gcc -fno-strict-aliasing -fno-common -dynamic -arch x86_64 -DNDEBUG -g -O3 -arch x86_64 -I/Applications/Canopy.app/appdata/canopy-1.1.0.1371.macosx-x86_64/Canopy.app/Contents/include/python2.7 -c src/protocol/fastbinary.c -o build/temp.macosx-10.6-x86_64-2.7/src/protocol/fastbinary.o
src/protocol/fastbinary.c:227:7: warning: comparison of constant -1 with expression of type 'TType' (aka 'enum TType') is always false [-Wtautological-constant-out-of-range-compare]
if (INT_CONV_ERROR_OCCURRED(dest->element_type)) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/protocol/fastbinary.c:124:43: note: expanded from macro 'INT_CONV_ERROR_OCCURRED'
#define INT_CONV_ERROR_OCCURRED(v) ( ((v) == -1) && PyErr_Occurred() )
~~~ ^ ~~
src/protocol/fastbinary.c:244:7: warning: comparison of constant -1 with expression of type 'TType' (aka 'enum TType') is always false [-Wtautological-constant-out-of-range-compare]
if (INT_CONV_ERROR_OCCURRED(dest->ktag)) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/protocol/fastbinary.c:124:43: note: expanded from macro 'INT_CONV_ERROR_OCCURRED'
#define INT_CONV_ERROR_OCCURRED(v) ( ((v) == -1) && PyErr_Occurred() )
~~~ ^ ~~
src/protocol/fastbinary.c:249:7: warning: comparison of constant -1 with expression of type 'TType' (aka 'enum TType') is always false [-Wtautological-constant-out-of-range-compare]
if (INT_CONV_ERROR_OCCURRED(dest->vtag)) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/protocol/fastbinary.c:124:43: note: expanded from macro 'INT_CONV_ERROR_OCCURRED'
#define INT_CONV_ERROR_OCCURRED(v) ( ((v) == -1) && PyErr_Occurred() )
~~~ ^ ~~
src/protocol/fastbinary.c:287:7: warning: comparison of constant -1 with expression of type 'TType' (aka 'enum TType') is always false [-Wtautological-constant-out-of-range-compare]
if (INT_CONV_ERROR_OCCURRED(dest->type)) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/protocol/fastbinary.c:124:43: note: expanded from macro 'INT_CONV_ERROR_OCCURRED'
#define INT_CONV_ERROR_OCCURRED(v) ( ((v) == -1) && PyErr_Occurred() )
~~~ ^ ~~
src/protocol/fastbinary.c:740:7: warning: comparison of constant -1 with expression of type 'TType' (aka 'enum TType') is always false [-Wtautological-constant-out-of-range-compare]
if (INT_CONV_ERROR_OCCURRED(got)) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/protocol/fastbinary.c:124:43: note: expanded from macro 'INT_CONV_ERROR_OCCURRED'
#define INT_CONV_ERROR_OCCURRED(v) ( ((v) == -1) && PyErr_Occurred() )
~~~ ^ ~~
src/protocol/fastbinary.c:787:15: warning: comparison of constant -1 with expression of type 'TType' (aka 'enum TType') is always false [-Wtautological-constant-out-of-range-compare]
if (etype == -1) {
~~~~~ ^ ~~
src/protocol/fastbinary.c:809:15: warning: comparison of constant -1 with expression of type 'TType' (aka 'enum TType') is always false [-Wtautological-constant-out-of-range-compare]
if (ktype == -1) {
~~~~~ ^ ~~
src/protocol/fastbinary.c:814:15: warning: comparison of constant -1 with expression of type 'TType' (aka 'enum TType') is always false [-Wtautological-constant-out-of-range-compare]
if (vtype == -1) {
~~~~~ ^ ~~
src/protocol/fastbinary.c:836:16: warning: comparison of constant -1 with expression of type 'TType' (aka 'enum TType') is always false [-Wtautological-constant-out-of-range-compare]
if (type == -1) {
~~~~ ^ ~~
src/protocol/fastbinary.c:888:14: warning: comparison of constant -1 with expression of type 'TType' (aka 'enum TType') is always false [-Wtautological-constant-out-of-range-compare]
if (type == -1) {
~~~~ ^ ~~
10 warnings generated.
gcc -bundle -undefined dynamic_lookup -g -arch x86_64 -L/tmp/_py/libraries/Applications/Canopy.app/appdata/canopy-1.1.0.1371.macosx-x86_64/Canopy.app/Contents/lib -headerpad_max_install_names -arch x86_64 build/temp.macosx-10.6-x86_64-2.7/src/protocol/fastbinary.o -o build/lib.macosx-10.6-x86_64-2.7/thrift/protocol/fastbinary.so
ld: warning: directory not found for option '-L/tmp/_py/libraries/Applications/Canopy.app/appdata/canopy-1.1.0.1371.macosx-x86_64/Canopy.app/Contents/lib'
/Users/XXXXXX/Library/Enthought/Canopy_64bit/User/bin/python -O /var/folders/pg/k32xh5g96t7378pl__p7khd00000gn/T/tmp2DDIYN.py
removing /var/folders/pg/k32xh5g96t7378pl__p7khd00000gn/T/tmp2DDIYN.py

Successfully installed happybase thrift
Cleaning up...

timestamp check does not allow long

if not (timestamp is None or isinstance(timestamp, int)):
    raise TypeError("'timestamp' must be an integer or None")

This causes issues for me in Python 2.7.3 on Windows because this is long:

timestamp=1369168852994

Suggested fix:

if not (timestamp is None or isinstance(timestamp, (int, long))):
    raise TypeError("'timestamp' must be an integer or None")

Entity mapper

Is it planned to provide Pythonic entity mapper (like ORM for relational database) for HBase in the future? I think it would be great if it makes model classes easy to map.

Multiget

Is it possible to fetch data for a list of row keys in a single request (similar to multiget in memcache)

e.g. multiget(row_key1, row_key4, row_key6) gives
{row_key1: value_1, row_key4:value_4, row_key6:value_6}

I tried using scan

scan 'my_table', {filter : (RowFilter (=, 'binary:row_key1') OR (RowFilter (=, 'binary:row_key2))}

But scan is taking too much time. Wondering if there is a more efficient way to do multiget

Unsupported framed Thrift transports results in connection resets

happybase 0.3 with hbase 0.94.0 is getting its connection closed consistently:

>>> import happybase
>>> conn = happybase.Connection("localhost")
>>> t = conn.table("test")
>>> list(t.scan())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "happybase/api.py", line 535, in scan
    scan_id = client.scannerOpenWithScan(self.name, scan)
  File "happybase/hbase/Hbase.py", line 1716, in scannerOpenWithScan
    return self.recv_scannerOpenWithScan()
  File "happybase/hbase/Hbase.py", line 1728, in recv_scannerOpenWithScan
    (fname, mtype, rseqid) = self._iprot.readMessageBegin()
  File "thrift/protocol/TBinaryProtocol.py", line 126, in readMessageBegin
    sz = self.readI32()
  File "thrift/protocol/TBinaryProtocol.py", line 203, in readI32
    buff = self.trans.readAll(4)
  File "thrift/transport/TTransport.py", line 58, in readAll
    chunk = self.read(sz-have)
  File "thrift/transport/TTransport.py", line 160, in read
    self.__rbuf = StringIO(self.__trans.read(max(sz, self.__rbuf_size)))
  File "thrift/transport/TSocket.py", line 94, in read
    buff = self.handle.recv(sz)
socket.error: [Errno 104] Connection reset by peer

and in the thrift server log:

2012-06-29 13:06:01,330 ERROR org.apache.thrift.server.TNonblockingServer: Read an invalid frame size of -2147418111. Are you using TFramedTransport on the client side?

Connection pooling

Connection pooling should use a separate pool API, not be completely embedded inside the happybase.Connection class.

Goal:

When using happybase in the context of a web application, it would be useful to re-use connections between page requests. A connection pooling solution should take a MIN, MAX and IDLE count as parameters, and open connections as needed by the application.

Inspiration:

API:

import happybase

pool = happybase.ConnectionPool(
    'hostname', 
    port=9090, 
    timeout=None, 
    min=0, 
    max=3, 
    idle=1, 
    autoconnect=False,
    compat='0.92', 
    transport='buffered')

# block == wait until a connection is available 
# versus raise an exception
connection = pool.get_connection(
    block=True,
    table_prefix=None, 
    table_prefix_separator='_')

The pool could be instantiated manually per-process in the setup flow of a in a web server framework. For example, in Django, this could be done in settings.py with AUTOCONNECT=False so that connections are not established until the first calls to get_connection().

Retries:

If a connection cannot be established, or is terminated (ie by a timeout), it would attempt to re-establish after RETRY_MS milliseconds.

Errors:

ConnectionPool could thrown an error right away if it can't establish MIN connections immediately. Otherwise, a call to pool.get_connection, will raise various exceptions for things like pool exhaustion (if BLOCK=False), cannot connect to Thrift endpoint, etc.

Other thoughts:

I can't see how we could support connection pooling between multiple python processes except by implementing a separate process to connect through, similar to pgpool.

Batched gets

HappyBase's batch interface currently supports puts and deletes, but does not allow for gets.

Performance of Happybase

Hi, I am writing my data to HBase using HappyBase API. The data is 28969265 lines, and 6 GB in total.

If I write the data using one single script, it takes forever (more than 30 minutes). So I wrote a MapReduce job on Hadoop, and divided the work to 1000 reducer (~ 29000 lines per reducer). The job took around 30 minutes to finish.

The code is using Table.batch() as the document suggested. Table.batch().send() is the only expensive function.

I think my dataset is quite small and shoudn't take such long time to write. Could you please offer some advice on how to optimize the writing process? Thanks a lot.

API to get only row keys

Is there a way to get just the row keys (and don't get the column data) from hbase using happybase.

Usecase:
I am trying to implement pagination on rows. My row keys are random integers, they are unique but not sequential.

The closest to efficient pagination I could think of is

a. Get all the row keys
b. Loop through row keys (in batch of 100) and get the column data, when needed

Connection timeout setting

I was looking at happybase documentation here for Connection class ( http://happybase.readthedocs.org/en/latest/api.html ), and was searching for timeout parameter but couldn't find one.

Is happybase.Connection() to a hbase thrift server a persistent connection. If not, does it have a default timeout ( does it fallback/use timeout defined in python thrift bindings) that can be changed.

Python 3 support

This would be trivial for HappyBase, but the underlying Thrift library needs to be Python 3 compatible first.

A simple '2to3' on the source code seems to work, but this needs to be properly supported in a stable Thrift release first.

Status/todo for this issue:

  • Merge #78
  • Regenerate Thrift bindings using a newer Thrift version (maybe not needed)
  • Update docs (TODO.rst)

table_prefix assumes underscore separator

The table_prefix parameter on Connection is incredibly useful, except it mandates that you use the underscore character as a separator. This is a problem in situations where you don't want any character after your prefix (i.e. it's a set length), or indeed want to use something else.

I propose a change that requires users specify the separator of their choice as art of the prefix. The effort required for users to update their code would be minimal, and the code would benefit from not having a "table_prefix_separator"!

-d

AttributeError: 'module' object has no attribute 'Connection'

I don't know if it is a right place to ask questions, but don't know where to go. Anyway...
I downloaded the happybase package from PayPI and installed it using ''python setup.py install --prefix=~/.local".
Then tried Python -c "import happybase", It's ok.
But when I run a script it shows the error below:

Traceback (most recent call last):
File "happybase.py", line 1, in
import happybase
File "/home/linqili/sh/thrift_hbase/happybase.py", line 3, in
connection = happybase.Connection('192.168.19.107', 39090)
AttributeError: 'module' object has no attribute 'Connection'

here is my test code:
import happybase
connection = happybase.Connection('192.168.19.107', 39090)
table = connection.table("AD_DSP")
for key, data in table.rows(['1003_64133', '1_1']):
print key, data

Did I do something wrong?

Did table.cells supports variable columnnames in one program

Hi wbolster,
I want to see the versions of data in hbase, I'm using table.cells() in here i can give row_key as variable and for that row key i have many column names i don't know what are it,by using the same can i get column name iteratively..
I have only one columnfamily that is cf1
Here i pasted my code,can u help me..

Thank you

!/usr/bin/env python

from happybase import Connection
from time import time
TABLE_NAME = 'test'
connect = Connection('localhost')
table_conn = connect.table(TABLE_NAME)
def versioning(rowkey):

versions = table_conn.cells(rowkey,columname,versions = 100000)
init = 0
for column,value in versions.items():
    print "Data : %s" % value
    init = init + 1
    #print init

print "Data : %s" % value
print "Columnname : %s " % column
print "Data count : %s" % init

if name == "main":

"""
   For command line test

"""
rowkey = 'string'
column = 'cf1'
start = time()
count = versioning(rowkey)
print count
end  = time()
print end - start

thanks alot

Support for Thrift Types

There doesn't seem to be any mention of how to best handle typed data with thrift. I've seen several examples so far of people storing integers as strings. I understand that using struct would effectively give you this, but are thrift types really so bad? Seems like using struct.pack would lead to code that has to be very verbose to work in other languages, especially when thrift already has many predefined data types.

what may cause TTransportException: Transport not open?

my code runs well with a hbase. and when redrecting to another, all write method cause such an exception:

File "/opt/python27/lib/python2.7/site-packages/happybase/table.py", line 370, in put
batch.put(row, data)
File "/opt/python27/lib/python2.7/site-packages/happybase/batch.py", line 116, in exit
self.send()
File "/opt/python27/lib/python2.7/site-packages/happybase/batch.py", line 54, in send
self._table.connection.client.mutateRows(self._table.name, bms)
File "/opt/python27/lib/python2.7/site-packages/happybase/hbase/Hbase.py", line 1449, in mutateRows
self.send_mutateRows(tableName, rowBatches)
File "/opt/python27/lib/python2.7/site-packages/happybase/hbase/Hbase.py", line 1459, in send_mutateRows
self._oprot.trans.flush()
File "/opt/python27/lib/python2.7/site-packages/thrift/transport/TTransport.py", line 169, in flush
self.__trans.write(out)
File "/opt/python27/lib/python2.7/site-packages/thrift/transport/TSocket.py", line 124, in write

message='Transport not open')

if connect and put immediatlly, it may succeed. if connect and put after a while, error. how to make the connection never break?

or, how to check if the connection is available?

in python shell, the connection works after a long while. but in my code, the connections that created in threads terminate soon.

Scanner limitations ?

Hi, I am writing a solution using HBase (MapR M7 currently) and python for the backend part (data generation, admin interface ... )

I use Java for processing large files into HBase but I need to create some dashboards based on the data I get in Hbase.

I will write some MapReduce jobs to create aggregated batch of data because my main table contain more than 100 MM lines. Currently, I encounter an issue with HappyBase. When i count the number of lines in a table scanner, I always get 234 or 236. So if I loop on the scanner I cannot get more results.

Do you know if I am doing something wrong or could this be a known issue with Thrift or something like that ?

Thanks a lot in advance, and keep up with the good work !

matter in upgrading hbase

i am upgrading hbase from 0.90 to 0.94. it seems ok. in hbase shell it shows me the virson is 0.94. but happybase do not think so.

co = hb.Connection( '127.0.0.1' ) //or Connection( '127.0.0.1', compat='0.92' )
x = co.table( 'Message' ).scan()
x.next()
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/python273/lib/python2.7/site-packages/happybase-0.4-py2.7.egg/happybase/api.py", line 567, in scan
scan_id = client.scannerOpenWithScan(self.name, scan)
File "/usr/local/python273/lib/python2.7/site-packages/happybase-0.4-py2.7.egg/happybase/hbase/Hbase.py", line 1716, in scannerOpenWithScan
return self.recv_scannerOpenWithScan()
File "/usr/local/python273/lib/python2.7/site-packages/happybase-0.4-py2.7.egg/happybase/hbase/Hbase.py", line 1733, in recv_scannerOpenWithScan
raise x
thrift.Thrift.TApplicationException: Invalid method name: 'scannerOpenWithScan'

and using compat = '0.90' succeeds as before.

co.close()
co = hb.Connection( '127.0.0.1', compat='0.90' )
x = co.table( 'Message' ).scan()
x.next()
('52255fa14844acf3cb2e3e5e', {'Duration:': '1', 'Result:': '1', 'Content:': 'content'})

is hbase upgraded well?

Can not filter a table in hbase0.96 or 0.98

Hi, I have just upgraded hbase from 0.94 to 0.96. happybase work fine except filter table.
Below is my code:

connection = happybase.Connection(host='localhost', port=9090,autoconnect=False,transport='buffered')
connection.open()
filter_string = "SingleColumnValueFilter ('info', 'user_name', =, 'binary:user1')"
table = connection.table('my_table')
data_scan = table.scan(filter= filter_string)

And below is hbase log:

 2014-02-25 10:16:14,097 ERROR [RpcServer.handler=11,port=58866] ipc.RpcServer: Unexpected throwable object 
org.apache.hadoop.hbase.filter.IncompatibleFilterException: Cannot set batch on a scan using a filter that returns true for filter.hasFilterRow.

Can you help me ?

storing data in Hbase (Bytes conversion)

Hi,

I have went through your documentation and right now stumbled up on storing data in Hbase. I have read that strings need to converted to ByteStrings before they are sent.

I have did something like below but it keeps giving me IOError.

row = str("row")
col1 = str("col1")
value1 = str("value1")

table.put( row , {col1: value1})

Am i doing it right ? If i am wrong, can you please give me an example of doing this is 2.7 + version ?

Thanks

using batch in threads lost versions

writing a lot of lines to hbase via threads.

thread.run():

msg = queue.get_nowait()
batch.put( msg.key, {...} )
if condi:
    batch.send()

a msg may have 2 versions. this is to say, there may be two msgs with same key. i want to store them in the same line.

the codes run well for a few msgs. when data grows larger, only one version would be stored to hbase for most messages.

using table.put instead of batch, the versions didn't lose.

what can i do with batch?

Cleanup the batch size and caching scanner flags

The current implementation mixes up these options affecting scanner behaviour and performance:

  • batch size for fetching frows, used between the Python process and the Thrift server
  • batch size passed as a scanner option, used between the Thrift server and region servers
  • caching size passed as a scanner option, also used between the Thrift server and region servers

Column Name regex filter

Is it possible to apply a regex filter on column names, and get only those columns which match the regex using happybase

I tried 'filter' : 'regexstring:my_regex_here' in scan_args, but it didn't work

Get a list of keys

Is it possible to use the table.scan without getting any columns? I just want to get the row keys. Like this pseudocode:

onlyRowKeys = hBaseTable.scan(columns=[''])

With happybase, I wrote the following code:
datacollection = hBaseTable.scan(columns=['columnX:fieldX'])

scannedCollection = list(datacollection)
onlyKeys,bogusData = zip(*scannedCollection)

Is there a better way?

[I can't seem to add a label to this issue]

Invalid method scannerOpenWithScan - Thrift Error

This error seems to have appeared once upgrading to version .3. I have a .2 virtual environment that works just fine. See logs below

=== Version 0.3 ===
(env)ricky-mbp:base ricky$ pip install happybase
Downloading/unpacking happybase
Downloading happybase-0.3.tar.gz (43Kb): 43Kb downloaded
Running setup.py egg_info for package happybase

Downloading/unpacking thrift (from happybase)
Downloading thrift-0.8.0.tar.gz
Running setup.py egg_info for package thrift
.......
$ python scan.py ca-
File "scan.py", line 12, in
scan(sys.argv[1])
File "scan.py", line 8, in scan
for key , row in table.scan(row_prefix=prefix, columns=['name:first', 'name:last']):
File "/Users/ricky/Sandbox/base/env/lib/python2.7/site-packages/happybase/api.py", line 535, in scan
scan_id = client.scannerOpenWithScan(self.name, scan)
File "/Users/ricky/Sandbox/base/env/lib/python2.7/site-packages/happybase/hbase/Hbase.py", line 1716, in scannerOpenWithScan
return self.recv_scannerOpenWithScan()
File "/Users/ricky/Sandbox/base/env/lib/python2.7/site-packages/happybase/hbase/Hbase.py", line 1733, in recv_scannerOpenWithScan
raise x
thrift.Thrift.TApplicationException: Invalid method name: 'scannerOpenWithScan'

=== Version 0.2 ===
(env2)ricky-mbp:base ricky$ pip install happybase==0.2
Downloading/unpacking happybase==0.2
Downloading happybase-0.2.tar.gz (42Kb): 42Kb downloaded
Running setup.py egg_info for package happybase

Downloading/unpacking thrift (from happybase==0.2)
Downloading thrift-0.8.0.tar.gz
Running setup.py egg_info for package thrift
......
$ python scan.py ca- | wc -l
1894
^^ works just fine

Here is the scan code:

def scan(prefix):
    connection = happybase.Connection("192.168.42.201")
    table = connection.table("donors")

    for key , row in table.scan(row_prefix=prefix, columns=['name:first', 'name:last']):
        print(', '.join(map(str.capitalize, [row['name:first'], row['name:last']])))

Invalid method name: 'scannerOpenWithScan'

The sample code in the tutorial results in an thrift.Thrift.TApplicationException:

Python 2.7.3 (default, Apr 20 2012, 22:39:59)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import happybase
connection = happybase.Connection('localhost')
table = connection.table("authors")
for k,v in table.scan():
... print k,v
...
Traceback (most recent call last):
File "", line 1, in
File "build/bdist.linux-x86_64/egg/happybase/api.py", line 546, in scan
File "build/bdist.linux-x86_64/egg/happybase/hbase/Hbase.py", line 1716, in scannerOpenWithScan
File "build/bdist.linux-x86_64/egg/happybase/hbase/Hbase.py", line 1733, in recv_scannerOpenWithScan
thrift.Thrift.TApplicationException: Invalid method name: 'scannerOpenWithScan'

HBase and Hadoop are installed as part of CDH3 distribution.

ubuntu@hadoop1:~$ thrift -version
Thrift version 0.8.0

what is the fast way for writing?

i have to write Gs of lines per day. for each about 1KB.

i use batch in threads, put about 30000 lines/second. i want it be faster.

i tried running in gevent, but it seems much slower than in threads. is there better way?

How to store Array / List / Dict/Struct to column of a Column Family

I am trying to store and array to column of a columnfamily and get the following error
TypeError: must be string or read-only character buffer, not list .

Work around i did was to convert it into string and the i was able to perform the put command .

However the challenge is when i create hive table referencing to hbase table and perform an explode functionality it treat my array as string and does not break it into multiple rows. The reason being i have converted array to string in my put command.

Need some help or example to solve this issue.

Thanks in advance.

Anupam

scan and batch.send fail while put and row succeed

need i do any configuration for hbase?

import happybase as hb
conn = hb.Connection( '10.8.210.182' )
conn.tables()
['Recommendation', 'StatsResult', 'hbase_t1', 'test', 'total_score']
t = conn.table( 'test' )
t.scan()
<generator object scan at 0x4e1a4b0>
for x in t.scan(): print x
...
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/python273/lib/python2.7/site-packages/happybase-0.4-py2.7.egg/happybase/api.py", line 567, in scan
scan_id = client.scannerOpenWithScan(self.name, scan)
File "/usr/local/python273/lib/python2.7/site-packages/happybase-0.4-py2.7.egg/happybase/hbase/Hbase.py", line 1716, in scannerOpenWithScan
return self.recv_scannerOpenWithScan()
File "/usr/local/python273/lib/python2.7/site-packages/happybase-0.4-py2.7.egg/happybase/hbase/Hbase.py", line 1733, in recv_scannerOpenWithScan
raise x
thrift.Thrift.TApplicationException: Invalid method name: 'scannerOpenWithScan'

Potentially counter-intuitive behavior on rows() when some results are null

First, thank you for creating happybase! So far it's been a great and very pythonic way to connect my Django project to HBase.

While building a component today, I noticed that table.rows() acts in a somewhat surprising way:

Say I pass (row1, row2, row3) in as the rows parameter and (cf1,) in as columns. If row3,cf1 doesn't exist then rows() returns an empty list, even if row1,cf1 and row2,cf2 do exist.

To me, a more intuitive and useful result would be [result1, result2, None] as the return instead of just []. Also, the happybase API doesn't provide a .exists? method for keys, meaning that it's pretty dangerous to call .rows() unless you're 100% sure all the KeyValues exist. Obviously I could make a loop and make an invididual .row() call for each row, but I'm making live calls to Hbase to render a webpage so I'm very concerned about minimizing latency (I'm assuming that .rows() is faster.)

I looked into the code to potentially try to tweak it on my deployment, but it looks like this is happening inside the Thrift code (which I don't understand.) Is this behavior some sort of inherent limit to Thrift, or could we potentially change this?

Thanks!
-George

Incompletely iterated scanners going out of scope are not cleaned

If the iterator returned by Table.scan() is not completely exhausted (i.e. due to a break in a for loop), HappyBase does not close the server-side scanner, even if the Table.scan() result (currently a generator function) has gone out of scope. The (leaked) associated server-side resources are freed only after the scanner times out.

A possible solution would be to make the return type from Table.scan() a real class with __iter__() and next() functions and a __del__() to make sure resources are freed as soon as possible. Additionally, this class can be made a context manager, so that something like this can be used for scanners that may not be fully iterated over:

with table.scan() as scan:
    for row_key, row_data in scan:
        pass  # do something, possibly breaking out of the loop

Storing data Hbase

I'm storing data in hbase and I receive the same error and I don't understand why...

Traceback (most recent call last):
File "dump.py", line 107, in
table.put(idate+'_'+str(count) , {'info:latitude': dados['latitude'] ,'info:longitude': dados['longitude'], 'info:velocidade': dados['velocidade'], 'info:direcao': dados['direcao'] })
File "/home/Larissa/Downloads/ENV/lib/python2.6/site-packages/happybase/api.py", line 618, in put
batch.put(row, data)
File "/home/Larissa/Downloads/ENV/lib/python2.6/site-packages/happybase/api.py", line 841, in exit
self.send()
File "/home/Larissa/Downloads/ENV/lib/python2.6/site-packages/happybase/api.py", line 779, in send
self.table.client.mutateRows(self.table.name, bms)
File "/home/Larissa/Downloads/ENV/lib/python2.6/site-packages/happybase/hbase/Hbase.py", line 1450, in mutateRows
self.recv_mutateRows()
File "/home/Larissa/Downloads/ENV/lib/python2.6/site-packages/happybase/hbase/Hbase.py", line 1472, in recv_mutateRows
raise result.io
happybase.hbase.ttypes.IOError: IOError(_message='Connection refused')

table.scan() with filter set will always fail in happybase >= 0.7

Description:
batchSize should not be set on scans with filter.

happybase v0.7 introduced new argument batchSize for TScan in method happybase.table.scan(). When used with filter this parameter will cause all scan operations to fail.

happybase always passes batch_size to TScan, even if there is filter_string present.
there is no way to set batch_size to None since method scan() validates batch_size value:
https://github.com/wbolster/happybase/blob/0.7/happybase/table.py#L259

See corresponding HBase code:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase/0.94.9/org/apache/hadoop/hbase/client/Scan.java?av=f#311

Steps to reproduce:

import happybase
conn = happybase.Connection(host='localhost', port=9090)
conn.create_table('project', {'f': dict()})
table = conn.table('project')

table.put('row1', {'f:qual1': 'val1'})
table.put('row2', {'f:qual1': 'val2'})
table.put('row3', {'f:qual1': 'val1'})

# this operation always fails
for k, v in table.scan(filter="SingleColumnValueFilter ('f', 'qual1', =, 'binary:val1')"): 
    print v

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.