pycassa / pycassa Goto Github PK

View Code? Open in Web Editor NEW

506.0 506.0 147.0 2.39 MB

Python Thrift driver for Apache Cassandra

Home Page: http://pycassa.github.io/pycassa/

License: Other

Python 99.91% Shell 0.09%

pycassa's Introduction

pycassa

pycassa is a Thrift-based python client library for Apache Cassandra

pycassa does not support CQL or Cassandra's native protocol, which are a replacement for the Thrift interface that pycassa is based on. If you are starting a new project, it is highly recommended that you use the newer DataStax python driver instead of pycassa.

pycassa is open source under the MIT license.

Documentation

Documentation can be found here:

http://pycassa.github.com/pycassa/

It includes installation instructions, a tutorial, API documentation, and a change log.

Getting Help

IRC:

Use the #cassandra channel on irc.freenode.net. If you don't have an IRC client, you can use freenode's web based client.

Mailing List:

User list: http://groups.google.com/group/pycassa-discuss
Developer list: http://groups.google.com/group/pycassa-devel

Installation

If pip is available, you can install the lastest pycassa release with:

pip install pycassa

If you want to install from a source checkout, make sure you have Thrift installed, and run setup.py as a superuser:

pip install thrift
python setup.py install

Basic Usage

To get a connection pool, pass a Keyspace and an optional list of servers:

>>> import pycassa
>>> pool = pycassa.ConnectionPool('Keyspace1') # Defaults to connecting to the server at 'localhost:9160'
>>>
>>> # or, we can specify our servers:
>>> pool = pycassa.ConnectionPool('Keyspace1', server_list=['192.168.2.10'])

To use the standard interface, create a ColumnFamily instance.

>>> pool = pycassa.ConnectionPool('Keyspace1')
>>> cf = pycassa.ColumnFamily(pool, 'Standard1')
>>> cf.insert('foo', {'column1': 'val1'})
>>> cf.get('foo')
{'column1': 'val1'}

insert() will also update existing columns:

>>> cf.insert('foo', {'column1': 'val2'})
>>> cf.get('foo')
{'column1': 'val2'}

You may insert multiple columns at once:

>>> cf.insert('bar', {'column1': 'val3', 'column2': 'val4'})
>>> cf.multiget(['foo', 'bar'])
{'foo': {'column1': 'val2'}, 'bar': {'column1': 'val3', 'column2': 'val4'}}
>>> cf.get_count('bar')
2

get_range() returns an iterable. You can use list() to convert it to a list:

>>> list(cf.get_range())
[('bar', {'column1': 'val3', 'column2': 'val4'}), ('foo', {'column1': 'val2'})]
>>> list(cf.get_range(row_count=1))
[('bar', {'column1': 'val3', 'column2': 'val4'})]

You can remove entire keys or just a certain column:

>>> cf.remove('bar', columns=['column1'])
>>> cf.get('bar')
{'column2': 'val4'}
>>> cf.remove('bar')
>>> cf.get('bar')
Traceback (most recent call last):
...
pycassa.NotFoundException: NotFoundException()

See the tutorial for more details.

pycassa's People

Stargazers

Watchers

Forkers

ketralnis koepsell dln eevans sethclong phuongcsa somia jhseu trhowe enki tedcarroll savinos thaingo dkuebric dmcgowan thepaul bwhite lorenhsu1128 samuraisam amorton gmcquillan ljsking hppj gustavopinto truesef-dev amxn carlopires bshanks spladug bobveznat elephantum caruccio bmat06 formspring mdennis oztc mitchellzen artgibson kylemcc pulseenergy styxman babo robingustafsson danhoerst pbutler aalexander lukearno michaelhood f0rk forsberg ebocek anisnasir yinyanfei gjcourt yienyien iamaleksey rudimk rkomartin katzj stantonk jc0n umairmufti gdoermann ahmed26 sebastibe maraca blair edevil mailmahee rumpelt chondm termie emergingthreats hellcoderz david-huber cnhacks devmario devdazed justecorruptio shubham-mlwiz prodigeni natasha-aleksandrova adnam pilate smajithia volkangurel fkanchwala rbranson kurtjx bjornedstrom a13x delciotorres mlifemaker bblay ezy023 mrkeng decbis sbadia dkong sprime01

pycassa's Issues

Keep connection alive only for a set time, for better load balancing

To ensure better load balancing, it would be useful to allow a connection to remain alive for a maximum T seconds before being returned to the pool.

Few questions

Is there any ORM using pycassa? I would like to use cassandra with django and pycassa seems to give enough low-level interface.
New feature of Cassandra 0.7 is secondary indexes created by Cassandra itself (without having to support them manually). Does pycassa support it?
Thanks a lot for answers.

import pycassa

Traceback (most recent call last):
File "", line 1, in
File "pycassa/init.py", line 4, in
from pycassa.columnfamily import *
File "pycassa/columnfamily.py", line 3, in
pkg_resources.require('Thrift')
File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/pkg_resources.py", line 620, in require
needed = self.resolve(parse_requirements(requirements))
File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/pkg_resources.py", line 518, in resolve
raise DistributionNotFound(req) # XXX put more info here
pkg_resources.DistributionNotFound: Thrift

create_index doesn't seem to work

Hello,

in pycassa 1.0.3, I'm using the create_index method of system_manager. It doesn't seem to have any effect, even though other methods of the same class do. I check with cassandra-cli, or with describe_column_family, and the index is not created. However, I can create it in cassandra-cli, and it is duly displayed in both describe and cli. TIA.

Ability to specify consistency levels per-request

It would be nice if the consistency-levels specified per ColumnFamily client were a default that an argument to the actual request methods (get, put, etc) could override

ColumnFamily.get() NotFoundException is ambiguous

A NotFoundException is raised when:

A key is requested that does not exist
The column slice requested is empty
column_count is 0
All of the columns passed in the 'columns' argument do not exist.

This makes the exception cause ambiguous. get() should only raise a NotFoundException in cases where Cassandra itself raises a NotFoundException. Namely, cases 1 and 4.

Strange error when creating a pool

I spent time looking into that but I just can't get through, please help:

Python 2.6.6 (r266:84292, Apr 29 2011, 11:49:08)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-44)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import pycassa
import thrift
hosts = ['cassandra01:9160','cassandra02:9160','cassandra03:9160']
pool = pycassa.ConnectionPool('PANDA', hosts, max_retries=10, timeout=10.0, pool_timeout=10)
Traceback (most recent call last):
File "", line 1, in
File "/home/test/lib/python2.6/site-packages/pycassa-1.0.8-py2.6.egg/pycassa/pool.py", line 622, in init
self._q.put(self._create_connection(), False)
File "/home/test/lib/python2.6/site-packages/pycassa-1.0.8-py2.6.egg/pycassa/pool.py", line 118, in _create_connection
wrapper = self._get_new_wrapper(server)
File "/home/test/lib/python2.6/site-packages/pycassa-1.0.8-py2.6.egg/pycassa/pool.py", line 652, in _get_new_wrapper
credentials=self.credentials)
File "/home/test/lib/python2.6/site-packages/pycassa-1.0.8-py2.6.egg/pycassa/pool.py", line 313, in init
super(ConnectionWrapper, self).init(_args, *_kwargs)
File "/home/test/lib/python2.6/site-packages/pycassa-1.0.8-py2.6.egg/pycassa/connection.py", line 50, in init
self.set_keyspace(keyspace)
File "/home/test/lib/python2.6/site-packages/pycassa-1.0.8-py2.6.egg/pycassa/connection.py", line 58, in set_keyspace
if not self.keyspace or keyspace != self.keyspace:
AttributeError: 'ConnectionWrapper' object has no attribute 'keyspace'

TypeError: get() takes exactly 4 arguments (1 given)

This error crops up using both the git pull, and the v1.0.0 release of pycassa with cassandra 0.7 rc1

here is a traceback:

Traceback (most recent call last):
File "/home/tomfarvour/NetBeansProjects/PythonCassandraTest/src/pythoncassandratest.py", line 137, in
```
main()
```
File "/home/tomfarvour/NetBeansProjects/PythonCassandraTest/src/pythoncassandratest.py", line 128, in main

col_fam = create_column_family(pycassa, connection, ksName, cfName)

File "/home/tomfarvour/NetBeansProjects/PythonCassandraTest/src/pythoncassandratest.py", line 47, in create_column_family

return pyc_o.ColumnFamily(pyc_conn, cfName)

File "/usr/home/tomfarvour/NetBeansProjects/PythonCassandraTest/src/pycassa/columnfamily.py", line 115, in init
```
self.client = self.pool.get()
```
File "/usr/home/tomfarvour/NetBeansProjects/PythonCassandraTest/src/pycassa/pool.py", line 402, in new_f

result = getattr(super(ConnectionWrapper, self), f.**name**)(_args, *_kwargs)

TypeError: get() takes exactly 4 arguments (1 given)

Thrift errors leak

TypeError: expected string or Unicode object, int found

no apparent way to figure out which column is causing the problem

Publish on Cheeseshop

It'd be handy to have Pycassa published in the Cheeseshop (http://pypi.python.org).

alter_column for super column families broken?

It looks like alter_column for supercolumn families is broken in 1.0.6 (but not 1.0.4). I create a supercolumn family with a time uuid comparator and a UTF8 subcomparator. When I alter_column('keyspace', 'column_family', 'foo_column', pycassa.UTF8_TYPE), pycassa complains that 'foo_column is not valid for a time uuid.

Will try to provide more detail and sample code tonight.

Running sys.create_index() fails with a non-string column name.

I can't seem to create an index on a column named numerically.

schematool has been deprecated

Please update the documentation.

OverflowError: mktime argument out of range

This exception (see the title) cause when I try to save a date as pycassa date object (pycassa.DateTime()), the date came form a date of birth value ie: 1956-05-03. So my question is how do I avoid this issue?

Unable to get values of RF and replication strategy for given keyspace.

Seems I am unable get parameters of keyspace like:
replication_factor, replication_strategy, strategy_options
suitable to call create_keyspace() with them ?

Is it possible to add
get_keyspace_properties() returning same data as used in create_keyspace() ?

error in test in python 2.5.4 ('with' statement will be reserved in 2.6)

...........:1: Warning: 'with' will become a reserved keyword in Python 2.6
E..........S..S........................S...................
ERROR: test_contextmgr (tests.test_batch_mutation.TestMutator)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/me/Downloads/pycassa/tests/test_batch_mutation.py", line 98, in test_contextmgr
    assert cf.get('3') == ROWS['3']"""
  File "", line 1
    with cf.batch(queue_size=2) as b:
          ^
SyntaxError: invalid syntax

----------------------------------------------------------------------
Ran 70 tests in 45.069s

FAILED (SKIP=3, errors=1)

"from future import with_statement" should fix it

Framed transport is default in 0.7

Changing the default framed_transport arg in connect to True is probably a good idea.

Python 2.5 Multithreading Issues

#
# Test for verifying that pycassa works consistently when reads and writes are done in QUORUM consistency level
#
# NOTE: this was made, because test setup with one node with trunk cassandra and pycassa 1.0.6 
#       failed when python-2.5 interpreter was used. When interpreter was changed to 2.6 everything 
#       started to work as expected. So basically pycassa 1.0.6 has bug with multithreading when it
#       is used with python-2.5 (or my python 2.5 installation is buggy) 
#
# @author Mikael Lepisto
#

import pycassa
import unittest
import datetime
import uuid
from threading import Thread
from multiprocessing import Pool

import logging
log = pycassa.PycassaLogger()
log.set_logger_name('pycassa_library')
log.set_logger_level('warn')
log.get_logger().addHandler(logging.StreamHandler())

cf_name = "ThreadingTest"

DISABLE_MULTIPROCESS_TEST = False
PROCESS_COUNT = 10
MULTIPROCESS_THREAD_COUNT = 5
SINGLE_PROCESS_THREAD_COUNT = 50
RUN_TIME = 5

keyspace = "test_keyspace"
servers = ['localhost']

class TestThread(Thread):

    def __init__(self):
        super(TestThread,self).__init__()
        sys = pycassa.system_manager.SystemManager('localhost')
        connection_pool = pycassa.pool.ConnectionPool(keyspace, server_list=servers, 
                                                      credentials=None, timeout=0.5, use_threadlocal=True, 
                                                      pool_size=5, max_overflow=0, prefill=True, pool_timeout=30, 
                                                      recycle=10000, max_retries=5, listeners=[], logging_name=None, 
                                                      framed_transport=True)

        cf = None
        try:
            cf = pycassa.columnfamily.ColumnFamily(connection_pool, cf_name, autopack_names=False, autopack_values=False)
        except pycassa.cassandra.ttypes.NotFoundException:
            sys.create_column_family(keyspace, cf_name, comparator_type=pycassa.system_manager.UTF8_TYPE)
            cf = pycassa.columnfamily.ColumnFamily(connection_pool, cf_name, autopack_names=False, autopack_values=False)
        self.cf = cf
        self.write_success = 0
        self.write_fail = 0
        self.read_found = 0
        self.read_not_found = 0

    def run(self):
        until = datetime.datetime.now() + datetime.timedelta(seconds=RUN_TIME)
        while datetime.datetime.now() < until:
            new_key = str(uuid.uuid1())
            try:
                self.cf.insert(new_key, {'test_value' : '1'}, 
                               write_consistency_level=pycassa.cassandra.ttypes.ConsistencyLevel.QUORUM)
                self.write_success += 1
            except Exception, e:
                self.write_fail += 1
                raise e

            written_value = None
            try:
                written_value = self.cf.get(new_key, 
                                            read_consistency_level=pycassa.cassandra.ttypes.ConsistencyLevel.QUORUM)
                self.read_found += 1
            except pycassa.cassandra.ttypes.NotFoundException,e:
                self.read_not_found += 1

            if written_value:
                self.cf.remove(new_key, 
                               write_consistency_level=pycassa.cassandra.ttypes.ConsistencyLevel.QUORUM)

def run_multithread(args):
    pid = args['pid']
    thread_count = args['thread_count']

    # start threads
    threads = []
    for i in range(thread_count):
        test = TestThread()
        test.start()
        threads.append(test)

    # wait to complete and collect results
    write_success = 0
    write_fail = 0
    read_found = 0
    read_not_found = 0

    for test in threads:
        test.join()
        write_success += test.write_success
        write_fail += test.write_fail
        read_found += test.read_found
        read_not_found += test.read_not_found

    ret_val = "-------- %s threads: %i ----------\n" % (pid, thread_count)
    ret_val += "\n".join(["%s : %s" % (key,value) for key,value in [('write_success', write_success),
                                                                    ('read_found', read_found),
                                                                    ('write_fail', write_fail),
                                                                    ('read_not_found', read_not_found)]])
    failed = False
    if write_success != read_found:
        ret_val += "\nWriteError: Some results were not found after they were written to db."
        failed = True

    return pid,ret_val,failed


class ThreadingTest(unittest.TestCase):

    def test_with_multiple_threads(self):        
        pid,result,failed = run_multithread(dict(pid="SingleProcess", 
                                                 thread_count=SINGLE_PROCESS_THREAD_COUNT))
        print result
        assert not failed, "There were errors during threading."

    def test_with_multiple_process(self):

        if DISABLE_MULTIPROCESS_TEST:
            return

        p = Pool(PROCESS_COUNT)
        results = p.map(run_multithread, 
                        [dict(thread_count=MULTIPROCESS_THREAD_COUNT, 
                              pid="Process-%i" % i) for i in range(PROCESS_COUNT)])

        failed_processes = 0
        for pid,result,failed in results:
            print result
            if failed: failed_processes += 1

        assert failed_processes == 0,"There was failures in %i processes" % failed_processes

Strange scaling when adding nodes

Hello, pycassa 1.04 vs a cluster of three machines -- the read latency as reported by cfstats becomes much worse when I grow the cluster from one to three. Is there a possibility that read consistency level was set to all? I didn't do it, but can there be some side effect I'm not aware of?

Thanks,
Maxim

Mandatory keyspace per connection breaks system commands

Not a big issue, but if we'd like to add system commands to pycassa, the current model of hardwiring a session to a particular keyspace will cause some issues.

To examplify, connecting to a freshly installed Cassandra 0.7, there is no keyspace initially.

cassandra.ttypes issue

I am trying to setup Twissandra. Everything seems to work until I try to create the keyspace for Cassandra. I run 'python manage.py sync_cassandra' and it returns the following error.

ImportError: No module named cassandra.ttypes

Any ideas?

Pycassa expects integers to be stored in a different format from cassandra-cli

If you set a value through the cassandra-cli for a column with IntegerType as the validation class like this:

set CFName['key']['value']=1;

Then reading it back through pycassa results in:

[...application stack...]
File "/Library/Python/2.6/site-packages/pycassa-1.0.4-py2.6.egg/pycassa/columnfamily.py", line 496, in get_indexed_slices
key_slice.columns, include_timestamp))
File "/Library/Python/2.6/site-packages/pycassa-1.0.4-py2.6.egg/pycassa/columnfamily.py", line 179, in _convert_ColumnOrSuperColumns_to_dict_class
    ret[self._unpack_name(col.name)] = self._convert_Column_to_base(col, include_timestamp)
File "/Library/Python/2.6/site-packages/pycassa-1.0.4-py2.6.egg/pycassa/columnfamily.py", line 160, in _convert_Column_to_base
    value = self._unpack_value(column.value, column.name)
File "/Library/Python/2.6/site-packages/pycassa-1.0.4-py2.6.egg/pycassa/columnfamily.py", line 293, in _unpack_value
    return self._unpack(value, self._get_data_type_for_col(col_name))
File "/Library/Python/2.6/site-packages/pycassa-1.0.4-py2.6.egg/pycassa/columnfamily.py", line 341, in _unpack
    return _int_packer.unpack(b)[0]
error: unpack requires a string argument of length 4

The problem is that pycassa is expecting '\u0000\u0000\u0000\u0000' and is getting '\u0001' instead.

Cassandra version is 0.7.3 and pycassa 1.0.4

_find_server always iterates servers in the same order

so the first server will get a disproportionate share of the load.

copying the master list and calling random.shuffle() before looping would fix this.

column_reversed order incorrect!

CfDef(keyspace, cf_name, column_type='Standard', comparator_type='LongType')

client= connect(keyspace)
cf= ColumnFamily(client, cf_name)
for i in range(1,100):
cf.insert(key=keyname, columns={int(time.time()*1e6): str(i)})

result= cf.get(key=keyname, column_count=3, column_reversed=True, column_start='')

result is not reversing the order

Cassandra 0.7, pycassa 0.4.0

super column get_indexed_slices bug?

Having problems with the following:
expr = create_index_expression('lastName', 'Smith')
clause = create_index_clause([expr])
list(columnFamily.get_indexed_slices(clause, super_column='userData'))
[]

Works fine when not a super column family...

Currently running:
pycassa 1.0.4
python 2.4.3
cassandra 0.7
thrift 0.5.0

`not found TimedOutException` on Debian 5.0

{{{
In [13]: for i in dir(cassandra.ttypes):
if i.find('Exception') != -1:
print i
....:
....:
InvalidRequestException
NotFoundException
TApplicationException
TException
UnavailableException
}}}

My patch is here:

http://github.com/shuge/pycassa/commit/90bed587d972ecc31ee671ee267be47f14906044#diff-1

Please fix it, thanks.

Remove should accept timestamps

syntax error in 1.0.7?

ERROR: Failure: SyntaxError (invalid syntax (pool.py, line 418))
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/nose-1.0.0-py2.5.egg/nose/loader.py", line 390, in loadTestsFromName
    addr.filename, addr.module)
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/nose-1.0.0-py2.5.egg/nose/importer.py", line 39, in importFromPath
    return self.importFromDir(dir_path, fqname)
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/nose-1.0.0-py2.5.egg/nose/importer.py", line 86, in importFromDir
    mod = load_module(part_fqname, fh, filename, desc)
  File "/Users/me/Downloads/pycassa/tests/__init__.py", line 1, in 
    from pycassa.system_manager import *
  File "/Users/me/Downloads/pycassa/pycassa/__init__.py", line 5, in 
    from pycassa.pool import *
  File "/Users/me/Downloads/pycassa/pycassa/pool.py", line 418
    return new_f(self, *args, reset=True, **kwargs)
                                  ^
SyntaxError: invalid syntax

----------------------------------------------------------------------

error in autopacking

using cassandra svn and pycassa 0.4
see:
http://pastebin.com/7pwqQpUP

CFMap get() and others don't document all parameters

The full list of parameters needs to be shown in the docs. Just pointing to CF.get() for all of the info about them is probably all that's needed.

import error pycassa module

I'm getting this error when I try to import pycassa from python interceptor
Traceback (most recent call last):
File "", line 1, in
File "pycassa/init.py", line 4, in
from pycassa.columnfamily import *
File "pycassa/columnfamily.py", line 1, in
import pkg_resources
ImportError: No module named pkg_resources

Both in Redhat linux and Windows XP

P:S pycassa version--: pycassa-0.2.0

Make connection pool more robust w.r.t connection problems

I'm running pycassa from master, commit 7821aa4...

The connection pool does not handle network failure as well as it could. Specifically, pulling a node from under the connection pool in a way that results in Broken Pipe or TTransportException (due to TSocket problems) is not handled.

How to reproduce

One way to reproduce this problem is to set up a small cluster that requires QUORUM for writes, point a pool to this cluster, then shut down cassandra instances that would result in a write failure. There perhaps exist an easier way to do this.

Problem in detail

I've seen two exceptions, one for TTransportException and one for broken pipe. Here they are (there are some variations depending on what requests is performed after the nodes were put down):

(Note that the line numbers may be slightly off due to debug printing when investigating this problem)

   File "/var/lib/python-support/python2.5/project/storage.py", line 130, in write_single
     cf = pycassa.ColumnFamily(self.client, self.COLUMN_FAMILY_NAME)
   File "/var/lib/python-support/python2.5/pycassa/columnfamily.py", line 116, in __init__
     col_fam = self.client.get_keyspace_description(use_dict_for_col_metadata=True)[self.column_family]
   File "/var/lib/python-support/python2.5/pycassa/pool.py", line 484, in get_keyspace_description
     ks_def = self.describe_keyspace(keyspace)
   File "/var/lib/python-support/python2.5/pycassa/cassandra/Cassandra.py", line 1075, in describe_keyspace
     return self.recv_describe_keyspace()
   File "/var/lib/python-support/python2.5/pycassa/cassandra/Cassandra.py", line 1086, in recv_describe_keyspace
     (fname, mtype, rseqid) = self._iprot.readMessageBegin()
   File "/usr/lib/python2.5/site-packages/thrift/protocol/TBinaryProtocol.py", line 126, in readMessageBegin
     sz = self.readI32()
   File "/usr/lib/python2.5/site-packages/thrift/protocol/TBinaryProtocol.py", line 203, in readI32
     buff = self.trans.readAll(4)
   File "/usr/lib/python2.5/site-packages/thrift/transport/TTransport.py", line 58, in readAll
     chunk = self.read(sz-have)
   File "/usr/lib/python2.5/site-packages/thrift/transport/TTransport.py", line 272, in read
     self.readFrame()
   File "/usr/lib/python2.5/site-packages/thrift/transport/TTransport.py", line 276, in readFrame
     buff = self.__trans.readAll(4)
   File "/usr/lib/python2.5/site-packages/thrift/transport/TTransport.py", line 58, in readAll
     chunk = self.read(sz-have)
   File "/usr/lib/python2.5/site-packages/thrift/transport/TSocket.py", line 108, in read
     raise TTransportException(type=TTransportException.END_OF_FILE, message='TSocket read 0 bytes') TTransportException: TSocket read 0 bytes

And here's broken pipe:

   File "/var/lib/python-support/python2.5/project/storage.py", line 130, in write_single
     cf = pycassa.ColumnFamily(self.client, self.COLUMN_FAMILY_NAME)
   File "/var/lib/python-support/python2.5/pycassa/columnfamily.py", line 116, in __init__
     col_fam = self.client.get_keyspace_description(use_dict_for_col_metadata=True)[self.column_family]
   File "/var/lib/python-support/python2.5/pycassa/pool.py", line 484, in get_keyspace_description
      ks_def = self.describe_keyspace(keyspace)
   File "/var/lib/python-support/python2.5/pycassa/cassandra/Cassandra.py", line 1074, in describe_keyspace
      self.send_describe_keyspace(keyspace)
   File "/var/lib/python-support/python2.5/pycassa/cassandra/Cassandra.py", line 1083, in send_describe_keyspace
     self._oprot.trans.flush()
   File "/usr/lib/python2.5/site-packages/thrift/transport/TTransport.py", line 293, in flush
     self.__trans.write(buf)
   File "/usr/lib/python2.5/site-packages/thrift/transport/TSocket.py", line 117, in write
     plus = self.handle.send(buff)
   File "/usr/lib/python2.5/site-packages/gevent/socket.py", line 458, in send
     return sock.send(data, flags) error: (32, 'Broken pipe')

I've looked at the pool and it seems to me there are two versions of this problem:

Requests in general have the ConnectionWrapper::retry decorator that will close a broken connection and replace it. This is a desirable behaviour for broken pipe. However, the decorator does only handle TimedOutException, UnavailableException, when we may want to handle for example Thrift.TException, socket.error, IOError.
ConnectionWrapper::get_keyspace_description in particular has no retry logic and if there's a connection problem there it will fail. On broken pipe, the socket will be returned to the pool, and may cause arbitrary errors when the connection is reused later.

A workaround

Here is a small patch that works around the problem. I'm not advocating that this is a proper fix, rather it's meant to highlight the problem experienced. What the workaround does is make retry handle socket errors and arbitrary thrift exceptions. Then there's an ugly hack so we can use _retry on get_keyspace_description.

diff --git a/pycassa/pool.py b/pycassa/pool.py
index f84f649..e36da1d 100644
--- a/pycassa/pool.py
+++ b/pycassa/pool.py
@@ -11,6 +11,7 @@ import weakref, time, threading, random

 import connection
 import queue as pool_queue
+import socket
 from logging.pool_logger import PoolLogger
 from util import as_interface
 from cassandra.ttypes import TimedOutException, UnavailableException
@@ -399,10 +400,14 @@ class ConnectionWrapper(connection.Connection):
         def new_f(self, *args, **kwargs):
             self.operation_count += 1
             try:
-                result = getattr(super(ConnectionWrapper, self), f.__name__)(*args, **kwargs)
+                try:
+                    result = getattr(super(ConnectionWrapper, self), f.__name__)(*args, **kwargs)
+                except AttributeError:
+                    # Hack for get_keyspace_description
+                    result = f(self, *args, **kwargs)
                 self._retry_count = 0 # reset the count after a success
                 return result
-            except (TimedOutException, UnavailableException), exc:
+            except (TimedOutException, UnavailableException, Thrift.TException, socket.error, IOError), exc:
                 self._pool._notify_on_failure(exc, server=self.server,
                                               connection=self)

@@ -424,6 +429,7 @@ class ConnectionWrapper(connection.Connection):
                     self._pool._replace_wrapper()
                 self._replace(self._pool.get())
                 return new_f(self, *args, **kwargs)
+
         new_f.__name__ = f.__name__
         return new_f

@@ -466,7 +472,7 @@ class ConnectionWrapper(connection.Connection):
     def truncate(self, *args, **kwargs):
         pass

-
+    @_retry
     def get_keyspace_description(self, keyspace=None, use_dict_for_col_metadata=False):
         """
         Describes the given keyspace.

Cheers,
Björn

CFMap get_indexed_slices doesn't pack values without instance argument

ColumnFamilyMap.get_indexed_slices() only packs IndexExpression values if an instance argument is passed. It should pack values in the IndexExpressions regardless.

sudo python setup.py install

I get the following error when running the command in the title for pycassa:

byte-compiling /usr/lib/python2.4/site-packages/pycassa/connection.py to connection.pyc
File "/usr/lib/python2.4/site-packages/pycassa/connection.py", line 125
self._logins = logins if logins is not None else {}
^

SyntaxError: invalid syntax

Please help!

connection.py getattr retry smashes the stack

In connection.py,

  def __getattr__(self, attr):
    def _client_call(*args, **kwargs):

On UnavailableException, TimedOutException, _client_call will recursively be retried.

This will eventually smash the stack, and you will get a RuntimeError.

A suggestion is to either raise and let the application handle retries, or have some recursion depth limit, perhaps with a sleep between retries.

ColumnFamilyMap.remove with SuperColumn Error

seems the key work argument should be super_column instead of column
return self.column_family.remove(instance.key, super_column=instance.super_column)
and the key word arguments better renamed to columns(with s) instead of column, to be
consist with ColumnFamily

Included cassandra package should probably be removed

Even though it makes sense to embed the cassandra package for convenience, it can easily result in version mismatches.

It would probably be better if it wasn't included at all.

using pkg_require explicitly is hostile to users who install thrift manually

pkg_require generated a DistributionNotFound exception (looks like you expected ImportError?) but ISTM that the right thing here is the simple thing: just go ahead and do the import of the thrift types and let the normal import resolution do its thing.

describe_keyspace(keyspace)

Seems
describe_keyspace(keyspace)
intstead of returning human readable description of the Keyspace prints it ?
Perhaps it could return data as mentioned in
https://github.com/pycassa/pycassa/issues#issue/33

if comparator = 'BytesType' and subcomparator = 'LongType'

create column family CFTEST with column_type = 'Super' and comparator = 'BytesType' and subcomparator = 'LongType' and memtable_throughput = 32;

CFTEST.insert('KEY', {'lowercase': {1: 1, 2: 2, 3: 3}})
then

CFTEST.get('KEY', super_column='lowercase', column_start=1 )
=> A str or unicode column name was expected

CFTEST.get('KEY', super_column='lowercase', column_start='1' )
=> InvalidRequestException(why='Expected 8 or 0 byte long (16)')

There is no NE inex clause in Pycassa

Hello,
sometimes we need to query a large table for error codes resulting from production jobs. Often, I want to select non-zero values (and corresponding errors) only. However, there is no 'NE' (non-equal) clause to be created. I understand that this may be due to the nature of secondary index, but then again, maybe it's just an oversight. If you look into that, this will be much appreciated. Thanks!

I can emulate it using existing clauses, but it would be cleaner natively.

Maxim

Can we have "describe cf" back in System Manager?

Greetings,

I really miss the describe column functionality that was refactored out of SM some time ago. I now run two scripts with different python path to do stuff -- I don't want to load pycassa shell just for that. What was the reason for the removal? One would think that the manager already does have quite similar functionality.

Thank you,
Maxim

Cassandra Up and Running but NoServerAvailable Exception with Pycassa

pycassa 0.3.0
cassandra 0.6.3

/usr/src# apache-cassandra-0.6.3/bin/cassandra-cli
Welcome to cassandra CLI.

Type 'help' or '?' for help. Type 'quit' or 'exit' to quit.
cassandra> connect localhost/9160
Connected to: "Test Cluster" on localhost/9160
cassandra>

import pycassa
client = pycassa.connect('Keyspace1', timeout=3.5)
cf = pycassa.ColumnFamily(client, 'Standard1')
cf.insert('foo', {'column1': 'val1'})

/usr/lib/python2.5/site-packages/pycassa/columnfamily.pyc in insert(self, key, c     olumns, write_consistency_level)
    330                 cols.append(Mutation(column_or_supercolumn=ColumnOrSuper     Column(column=column)))
    331         self.client.batch_mutate({key: {self.column_family: cols}},
  --> 332                                  self._wcl(write_consistency_level))
    333         return clock.timestamp
    334

/usr/lib/python2.5/site-packages/pycassa/connection.pyc in client_call(*args, **     kwargs)
    129         def client_call(*args, **kwargs):
    130             if self._client is None:
--> 131                 self._find_server()
    132             try:
    133                 return getattr(self._client, attr)(*args, **kwargs)

/usr/lib/python2.5/site-packages/pycassa/connection.pyc in _find_server(self)
    167                 continue
    168         self._client = None
--> 169         raise NoServerAvailable()
    170
    171 class ThreadLocalConnection(object):

NoServerAvailable:

Inserting/Retrieving Values for Columns with LongType Comparators Doesn't Also Work as Expected

Following the blog post here: http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes

I created a column family through the CLI:

[default@unknown] create keyspace demo;  
ad80bf08-41fa-11e0-b4ef-e700f669bcfc
[default@unknown] use demo;
Authenticated to keyspace: demo
[default@demo] create column family users with comparator=UTF8Type                       
... and column_metadata=[{column_name: full_name, validation_class: UTF8Type},
... {column_name: birth_date, validation_class: LongType, index_type: KEYS}]; 
be542659-41fa-11e0-b4ef-e700f669bcfc
[default@demo] set users[bsanderson][full_name] = 'Brandon Sanderson';
Value inserted.
[default@demo] set users[bsanderson][birth_date] = 1975;
Value inserted.
[default@demo] get users['bsanderson'];
=> (column=birth_date, value=, timestamp=1298760619049000)
=> (column=full_name, value=Brandon Sanderson, timestamp=1298760612881000)
Returned 2 results.
[default@demo]

Next, I did the following with pycassa:

import pycassa
pool = pycassa.connect('demo', ['localhost:9160'])
cf = pycassa.ColumnFamily(pool, 'users')
cf.insert('odoe', {'full_name': 'John Doe'})
cf.insert('odoe', {'birth_date': 1999})
res = cf.get('odoe')

Running the above gives:

Traceback (most recent call last):
  File "bug.py", line 9, in <module>
    res = cf.get('odoe')
  File "/home/posulliv/repos/pycassa/pycassa/columnfamily.py", line 344, in get
    return self._convert_ColumnOrSuperColumns_to_dict_class(list_col_or_super,     include_timestamp)
  File "/home/posulliv/repos/pycassa/pycassa/columnfamily.py", line 149, in     _convert_ColumnOrSuperColumns_to_dict_class
    ret[self._unpack_name(col.name)] = self._convert_Column_to_base(col, include_timestamp)
  File "/home/posulliv/repos/pycassa/pycassa/columnfamily.py", line 130, in _convert_Column_to_base
    value = self._unpack_value(column.value, column.name)
  File "/home/posulliv/repos/pycassa/pycassa/columnfamily.py", line 264, in _unpack_value
    return util.unpack(value, self._get_data_type_for_col(col_name))
  File "/home/posulliv/repos/pycassa/pycassa/util.py", line 176, in unpack
return _long_packer.unpack(byte_array)[0]
struct.error: unpack requires a string argument of length 8

And if I go back to the CLI:

[default@demo] get users['odoe'];
=> (column=birth_date, value=, timestamp=1298760813075848)
=> (column=full_name, value=John Doe, timestamp=1298760813074461)
Returned 2 results.
[default@demo]

However, if I change the insert in python to be:

cf.insert('odoe', {'full_name': 'John Doe', 'birth_date': 1983})

It works fine:

OrderedDict([(u'birth_date', 1983), (u'full_name', u'John Doe')])

And also through the CLI:

[default@demo] get users['jdoe'];
=> (column=birth_date, value=1983, timestamp=1298760786582058)
=> (column=full_name, value=John Doe, timestamp=1298760786582058)
Returned 2 results.
[default@demo]

pycassa's mapreduce

Any example Pycassa's Mapreduce
Thank u !

Untitled

create_column_family() seems have broken and misleading parameter descriptions:
like: "or float key_cache_size"
or:
key_cache_save_in_seconds instead of row_cache_save_period_in_seconds
and:
row_cache_save_in_seconds instead of key_cache_save_period_in_seconds

Auto-packing should also check the CF for a default_validation_class

See: http://issues.apache.org/jira/browse/CASSANDRA-891 (now resolved!)

If there's no validation_class on the column metadata, go up one level and check the CF metadata as well (then failing that, BytesType no-op).

Document eventlet and multiprocess

Need to give a rough idea of what works and doesn't work with eventlet and multiprocess in the documentation.

Column Family error

When I am executing following command:
cf = pycassa.ColumnFamily(client, 'Standard2')
I get this error message - http://pastie.org/1107607. The keyspace I have connected to definitely exists. I checked this through cassandra-cli: http://pastie.org/1107610.
I have also tried
client.describe_version()
And same error occured

batch_insert documentation is incorrect for super CFs

"The rows parameter should be of the form {key: {column_name: column_value}} if this is a standard column family or {key: {column_name: column_value}} if this is a super column family."

Presumably, the rows parameter should be a different form for the super CF.

See: http://pycassa.github.com/pycassa/api/pycassa/columnfamily.html#pycassa.columnfamily.ColumnFamily.batch_insert