douban / dpark Goto Github PK

View Code? Open in Web Editor NEW

2.7K 2.7K 534.0 2.71 MB

Python clone of Spark, a MapReduce alike framework in Python

License: BSD 3-Clause "New" or "Revised" License

Shell 0.26% Python 92.66% CSS 0.36% JavaScript 3.49% HTML 1.34% C 1.69% Dockerfile 0.19%

bigdata dpark mapreduce python spark stream-processing

dpark's People

Contributors

Stargazers

Watchers

Forkers

ccl0326 congmo yiyinianhua seawavet ywdong seacoastboy kulasama williamyangcn proximamonkey iss mtfelix jinxiaoye1987 ssmithcr kewinwang andyyehoo wangxiaomo jackliu8722 qingfeng joshrosen yangls06 wuhoufeng stefanie924 hivefans cute guoyunsky ryaninhust dlutcat revskill10 dshadowzh hubpeter ipconfigme ilikethiscat liyinhgqw woerwin imcom orangelpai qichenftw auxten todayilearned menghan jiangdapeng muxueqz vincentbdb zhengyuxin jimmylovelife ballacky13 xunzhang hellcoderz florianleibert lzchit mikegoodspeed gosteven mwzkqmuuzkflbxm guibog raymondhe rudyli tclh123 xuyong yinxusen cylinderlee daqing15 errord baeeq olemis npc7 flaneur2020 kenlen arekyao fashtimedotcom banzhuanpig huangxinms flclain jeffreywugz eclipselu qingyuanzhang zz198808 xxling waytai wtzhuque cloudsyk fangang ljc0753 yminer ustbgaofan flyfire ccxuy vicever perryhau is00hcw dyfeng sanlee42 lddhbu ryankung kuguobing aiex lqshixinlei strategist922 kaka1992 mckelvin pombredanne

dpark's Issues

多Python版本与多virtualenv环境下的问题

系统默认的python2.7是2.7.2，
新装了python2.7.5
建立一个新的virtualenv，用python2.7.5
用pwd/venv/bin/python的方式，process没问题，
但是在mesos的环境下发现仍然调用了/usr/bin/env python2.7，修改
venv/lib/python2.7/site-packages/DPark-0.1-py2.7.egg/dpark/bin/executor27.py
的第一行
从 #!/usr/bin/env python2.7
改为
#!pwd/venv/bin/python
就OK

在mesos中执行失败

/tmp/mesos/slaves/201301141050-1654703350-5050-14385-0/frameworks/201301141050-1654703350-5050-14385-0010/executors/default/runs/1/stderr 的内容为：

sh: /opt/python-2.7/lib/python2.7/site-packages/DPark-0.1-py2.7.egg/dpark/executor27.py: Permission denied

chmod -v +x /opt/python-2.7/lib/python2.7/site-packages/DPark-0.1-py2.7.egg/dpark/executor*.py
后解决。

安装出错

rhel7 3.10.0-229.el7.x86_64,
python 2.7.5
运行python setup.py install
出错：
gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/usr/include/python2.7 -c dpark/portable_hash.c -o build/temp.linux-x86_64-2.7/dpark/portable_hash.o
dpark/portable_hash.c:8:22: 致命错误：pyconfig.h：没有那个文件或目录
#include "pyconfig.h"

DeprecationWarning: Call to deprecated function or method decompress (use lz4.block.decompress instead)

运行过程中会输出多行
DeprecationWarning: Call to deprecated function or method decompress (use lz4.block.decompress instead)

请问是什么原因

python xy 安装错误怎么解决？

File "C:\Python27\lib\site-packages\Cython\Utils.py", line 22, in wrapper
res = cache[args] = f(*args)
File "C:\Python27\lib\site-packages\Cython\Utils.py", line 111, in search_incl
de_directories
path = os.path.join(dir, dotted_filename)
File "C:\Python27\lib\ntpath.py", line 85, in join
result_path = result_path + p_path
nicodeDecodeError: 'ascii' codec can't decode byte 0xca in position 1: ordinal
ot in range(128)
uilding 'dpark.portable_hash' extension
reating build\temp.win32-2.7
reating build\temp.win32-2.7\Release
reating build\temp.win32-2.7\Release\dpark
:\Program Files\Common Files\Microsoft\Visual C++ for Python\9.0\VC\Bin\cl.exe
c /nologo /Ox /MD /W3 /GS- /DNDEBUG -IC:\Python27\include -IC:\Python27\PC /Tcd
ark\portable_hash.c /Fobuild\temp.win32-2.7\Release\dpark\portable_hash.obj
ortable_hash.c
park\portable_hash.c(1) : fatal error C1189: #error : Do not use this file, it
is the result of a failed Cython compilation.
rror: command 'C:\Program Files\Common Files\Microsoft\Visual C++ for Pytho
\9.0\VC\Bin\cl.exe' failed with exit status 2

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

在Mesos上跑当计算文件数量稍多时会报下面的错…

2014-12-24 20:02:30,120 [ERROR] [pymesos.process] error while call <function handle at 0x1566ed8> (tried 4 times)
Traceback (most recent call last):
  File "/usr/lib64/python2.6/site-packages/pymesos-0.0.3-py2.6.egg/pymesos/process.py", line 69, in run_jobs
    func(*args, **kw)
  File "/usr/lib64/python2.6/site-packages/pymesos-0.0.3-py2.6.egg/pymesos/process.py", line 139, in handle
    f(*args)
  File "/usr/lib64/python2.6/site-packages/pymesos-0.0.3-py2.6.egg/pymesos/scheduler.py", line 93, in onResourceOffersMessage
    self.sched.resourceOffers(self, list(offers))
  File "/usr/lib64/python2.6/site-packages/DPark-0.1-py2.6-linux-x86_64.egg/dpark/schedule.py", line 448, in _
    r = f(self, *a, **kw)
  File "/usr/lib64/python2.6/site-packages/DPark-0.1-py2.6-linux-x86_64.egg/dpark/schedule.py", line 685, in resourceOffers
    len(offers), sum(cpus), sum(mems), len(self.activeJobs))
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

新版本的Dpark的nginx webroot应该设置为什么？

以前是/tmp/dpark，
现在启动时会优先/dev/shm/，
那是否要把webroot改成/dev/shm/ ?

error: can't copy 'dpark/portable_hash.c': doesn't exist or not a regular file

看了下，setup.py中有

   ext_modules = [Extension('dpark.portable_hash', ['dpark/portable_hash.pyx'])],

它会找 .portable_hash.c ??? 不过已经指明是找pyx文件呢。。奇怪

MultiProcessScheduler breaks rdd.fold()

Creating a context with 'process' will use the MultiProcessScheduler instead of LocalScheduler. However, this breaks rdd.fold() (and maybe others?). See below for reduced test case:

>>> import dpark
>>> ctx = dpark.DparkContext('process')
>>> rdd = ctx.makeRDD([1, 2, 3, 4])
>>> rdd.fold(0, max)
2013-04-18 10:58:02,786 [INFO] [scheduler] Got a job with 2 tasks
2013-04-18 10:58:02,797 [ERROR] [scheduler] error in task <ResultTask(0) of <ParallelCollection 4>
Traceback (most recent call last):
  File "dpark/schedule.py", line 343, in run_task
    result = task.run(aid)
  File "dpark/task.py", line 52, in run
    return self.func(self.rdd.iterator(self.split))
  File "dpark/rdd.py", line 253, in <lambda>
    self.ctx.runJob(self, lambda x: reduce(f, x, copy(zero))),
TypeError: 'int' object is not callable
2013-04-18 10:58:02,801 [ERROR] [scheduler] error in task <ResultTask(1) of <ParallelCollection 4>
Traceback (most recent call last):
  File "dpark/schedule.py", line 343, in run_task
    result = task.run(aid)
  File "dpark/task.py", line 52, in run
    return self.func(self.rdd.iterator(self.split))
  File "dpark/rdd.py", line 253, in <lambda>
    self.ctx.runJob(self, lambda x: reduce(f, x, copy(zero))),
TypeError: 'int' object is not callable
2013-04-18 10:58:02,804 [INFO] [scheduler] Job finished in 0.0 seconds                    
2013-04-18 10:58:02,889 [ERROR] [scheduler] task <ResultTask(0) of <ParallelCollection 4> failed
2013-04-18 10:58:02,889 [ERROR] [scheduler] <OtherFailure exception:'int' object is not callable> <type 'instance'> exception:'int' object is not callable
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "dpark/rdd.py", line 254, in fold
    zero)
  File "dpark/context.py", line 202, in runJob
    for it in self.scheduler.runJob(rdd, func, partitions, allowLocal):
  File "dpark/schedule.py", line 330, in runJob
    raise evt.reason
dpark.schedule.OtherFailure: <OtherFailure exception:'int' object is not callable>

And the correct behavior for the LocalScheduler version:

>>> import dpark
>>> ctx = dpark.DparkContext('local')
>>> rdd = ctx.makeRDD([1, 2, 3, 4])
>>> rdd.fold(0, max)
4

mesos slave异常导致Dpark任务失败

是这样的，
有一台节点的mesos-slave进程突然中止，然后Dpark开始报404，然后就卡在那里了……
对于这种情况，不知有什么建议吗？

gunicorn 使用 gevent kernel 无法对file进行解析计算字符

下面的这个是一个上传文件然后算各个字母数量的小模块。我用内核为gevent的gunicorn 启动该服务，请求时出现timeout问题。内核换为sync就木有问题~ 谢谢

@app.route('/upload', methods = ['POST'])
def upload():
    if request.method == 'POST':
        file = request.files['file']
        if file:
            filename = secure_filename(file.filename)
            file.save(os.path.join(UPLOAD_FOLDER, filename))
            f = dpark.textFile(os.path.join(UPLOAD_FOLDER, filename))
            chs = f.flatMap(lambda x: x).filter(lambda x: x in string.ascii_letters).map(lambda x: (x, 1))
            wc = chs.reduceByKey(lambda x, y: x+y).collectAsMap()
            db.LogProc.save(wc)
    return redirect('/')

Exception: exception:<class 'main.SampleHash'> is unhashable by portable_hash

I have a pair RDD that the key is a named tuple. For example,

from collections import namedtuple

Sample = namedtuple('Sample', 
                    ['i'],
                    verbose = False)

When I try to invoke groupByKey on the rdd, dpark responds with

Traceback (most recent call last):
  File "demo.py", line 253, in <module>
    sys.exit(main(sys.argv) or 0)
  File "demo.py", line 247, in main
    oneD(verbose)
  File "demo.py", line 236, in oneD
    resultRDD = inputRDD.flatMap(aFillIn.apply).reduceByKey(combineKeys).reduce(xxx)
  File "/Users/jbw/anaconda/lib/python2.7/site-packages/DPark-0.1-py2.7-macosx-10.5-x86_64.egg/dpark/rdd.py", line 246, in reduce
    return reduce(f, chain(self.ctx.runJob(self, reducePartition)))
  File "/Users/jbw/anaconda/lib/python2.7/site-packages/DPark-0.1-py2.7-macosx-10.5-x86_64.egg/dpark/util.py", line 50, in chain
    for v in it:
  File "/Users/jbw/anaconda/lib/python2.7/site-packages/DPark-0.1-py2.7-macosx-10.5-x86_64.egg/dpark/context.py", line 261, in runJob
    for it in self.scheduler.runJob(rdd, func, partitions, allowLocal):
  File "/Users/jbw/anaconda/lib/python2.7/site-packages/DPark-0.1-py2.7-macosx-10.5-x86_64.egg/dpark/schedule.py", line 341, in runJob
    raise Exception(reason.message)
Exception: exception:<class '__main__.SampleHash'> is unhashable by portable_hash

The method named portable_hash in the file dpark/portable_hash.pyx contains the dispatching of the hashing algorithm based on type:

cpdef int64_t portable_hash(obj) except -1:
    t = type(obj)
    if obj is None:
        return 1315925605
    if t is bytes:
        return string_hash(obj)
    elif t is unicode:
        return unicode_hash(obj)
    elif t is tuple:
        return tuple_hash(obj)
    elif t is int or t is long or t is float:
        return hash(obj)
    else:
        raise TypeError('%s is unhashable by portable_hash' % t)

As I read it, the current implementation has the limitation that all user defined classes will raise TypeError. Is this the intended behavior in dpark because I don't see this limitation described in the Spark documentation? If it is, then at appears that keys in dpark can only be:

None
btyes
unicode
tuple
int
long
float

One possible way to allow user defined classes would be to invoke hash when required, something like:

cpdef int64_t portable_hash(obj) except -1:
    t = type(obj)
    if obj is None:
        return 1315925605
    if t is bytes:
        return string_hash(obj)
    elif t is unicode:
        return unicode_hash(obj)
    elif t is tuple:
        return tuple_hash(obj)
    elif t is int or t is long or t is float:
        return hash(obj)
    else:
        return hash(obj)

Please let me know your thinking on this issue.

dpark在mesos上安装部署

dpark怎样在mesos上安装部署?mesos怎么启动Myexecutor来运行task的?

Translation Docs

This Project looks promising but could someone translate the docs and add a architectural overview.

Memory consumption is over the limit

I have a txt file to process. I use -M options to set the memory limit (I've tried -M 400 param, should be 400 megabytes as far as I understood). But on the test run, the memory consumption jumps up to 3.2GB which is really high. The script works as intended and the results are valid, the only issue here is the memory consumption. I'm afraid on larger files it will start swapping and simply choke.

I use no extra parameters, only -M, so I guess it should work using the default ones.

The memory consumption jumps not smoothly so I don't think it's a memory garbage issue.

Looks like Dpark creates a heap or something like that but how to set a limit for that? As I wrote, I'm afraid that it will eat all the memory. Now my test file is around 1GB, but what if I want to process 20GB?

In general I really like Dpark and don't want to switch to any other map-reduce framework (I've tried most of them, I guess). Please tell me how to predict and control the memory consumption. Thank you very much!

in test_job.py, one letter missing(´･_･`)

in test_job.py: line 54, unitest.main() should be unittest.main()
作为一个两度被豆瓣拒绝了的孩纸,表示终于有时间给豆瓣项目挑错好爽*3~
好吧,这也不是啥错╮(╯﹏╰）╭........

windows7 下单机能运行dpark么

如题，谢谢~~

Why does this exist - what problems does it solve!?

I increasingly find more and more cool GitHub projects on the web that look really interesting, dpark is one of them. And I see that you have re-written spark in Python, which is cool because I am a Python fan, and am also interested in Spark. But I find no mention as to why, and this puzzles me- what problems with original Spark were you trying to solve? Did you find some big bug with the current implementation, or was this maybe instead just a learning excise, something done for fun (A impressive excretes if so!)?

Basically it would be nice to have a paragraph in the README just saying why this projects exists- what limitations it was trying to overcome or performance to gain etc.

Many thanks!

MultiProcessScheduler breaks GlommedRDD

See this test case:

mike@EDEN:~/Development/dpark$ cat t.py 
import dpark
rdd = dpark.makeRDD(range(9), 3).flatMap(lambda i: (i, i)).glom()
print list(map(list, rdd.collect()))
mike@EDEN:~/Development/dpark$ python t.py
[[0, 0, 1, 1, 2, 2], [3, 3, 4, 4, 5, 5], [6, 6, 7, 7, 8, 8]]
mike@EDEN:~/Development/dpark$ python t.py -m process
2013-04-26 17:24:30,239 [INFO] [scheduler] Got a job with 3 tasks
2013-04-26 17:24:30,249 [INFO] [scheduler] Job finished in 0.0 seconds                    
[[], [], []]

Take a look at Spark's implementation of glom, specifically line 11. They get the array from the iterator and make a new array. The equivalent of that in Python is list(). Take this patch:

diff --git a/dpark/rdd.py b/dpark/rdd.py
index c04a5e9..1efb3f8 100644
--- a/dpark/rdd.py
+++ b/dpark/rdd.py
@@ -511,7 +511,7 @@ class FilteredRDD(MappedRDD):

 class GlommedRDD(DerivedRDD):
     def compute(self, split):
-        yield self.prev.iterator(split)
+        yield list(self.prev.iterator(split))

 class MapPartitionsRDD(MappedRDD):
     def compute(self, split):

Once I apply the patch, here is the result:

mike@EDEN:~/Development/dpark$ python t.py -m process
2013-04-26 17:25:27,891 [INFO] [scheduler] Got a job with 3 tasks
2013-04-26 17:25:27,902 [INFO] [scheduler] Job finished in 0.0 seconds                    
[[0, 0, 1, 1, 2, 2], [3, 3, 4, 4, 5, 5], [6, 6, 7, 7, 8, 8]]

I believe this is being caused by the pickle-ing that is happening when chained iterables are being passed through the multiprocessing pool. If you have a better solution that doesn't involve pulling everything into memory, I'd be happy to hear it!

Contribution?

Is there any documents that tell people how to contribute?

dpark依赖安装失败

aws：redhat7.1

Installed /usr/lib64/python2.7/site-packages/mesos.interface-0.23.0-py2.7.egg
error: Installed distribution mesos.interface 0.23.0 conflicts with requirement mesos.interface==0.22.0

不知道是不是pymesos依赖的问题：
“install_requires=['mesos.interface>=0.22.0,<0.22.1.2'],”

Couldn't download the mesos.interface

Couldn't find a setup script in /tmp/easy_install-QfQ_64/mesos.interface-1.0.1.linux-x86_64.tar.gz

-m process偶尔执行完会卡在那里不退出

#!/usr/bin/env python
import dpark

nums = dpark.parallelize(range(100), 4)
print nums.count()

n=0; 
while true; do
 echo $n ;
 n=`expr $n + 1 `;
  ../venv/bin/python test_hang.py -p 10   ;
 done

具体表现为运行完print nums.count()之后卡在那里不退出
一般运行30次会出现1次，频率不固定，有时多有时少

环境:
OS: RHEL 6.2 x86_64
Python: 2.7.5
Github上最新的Dpark

我这边用旧版本(212e088)的是正常的，
使用mesos的也是正常的，
pyzmq等依赖模块也是同一版本(我是把旧的virtualenv中的site-packages/dpark cp过来测试的，所以可以保证除了dpark外都是一样的)

需要频繁的重复用到某个大的rdd进行计算,有什么好方法吗？

如：
经过很多的中间计算得出一个较大的rdd,后面的运算需要重复的用到这个rdd,有什么好的建议吗？
用cache和snapshot都会出错
目前我自己是先把rdd用saveAsTextFile写文件，运算用到时再导入

建议增加hdfs支持

通过textFile创建rdd时，如何给每一行文本添加一个行号？

通过textFile创建rdd，文件中的每一行是一个document，现在需要进行分词、转vector等，怎么样给每一行自动添加一个行号，跟其在text file中的对应行号一致

做 reduce 的时候出现了 Negative size passed to PyString_FromStringAndSize

Python2.7 64位机器上
在对读入的数据 A 做 reduce 的时候抛出了 Negative size passed to PyString_FromStringAndSize 这样的异常错误，而将数据 A 切分为 A1，A2 两份分别处理就没有这个问题。溢出的时候内存大约使用31G 多，不知道是不是数据量大了导致溢出。不知道是否和 portable_hash.c 里面有关？
当时没有打印堆栈信息，稍后准备重现下试试。

dpark运行出错

import sys
sys.path.append('../')
from dpark import DparkContext
dpark = DparkContext()
file = dpark.textFile("./tmp/words.txt")
words = file.flatMap(lambda x:x.split()).map(lambda x:(x,1))
wc = words.reduceByKey(lambda x,y:x+y).collectAsMap()

print wc

Traceback (most recent call last):
File "wc_1.py", line 7, in
wc = words.reduceByKey(lambda x,y:x+y).collectAsMap()
File "F:\install\acond\lib\site-packages\dpark-0.2.2-py2.7-win-amd64.egg\dpark
\rdd.py", line 502, in collectAsMap
for v in self.ctx.runJob(self, lambda x:list(x)):
File "F:\install\acond\lib\site-packages\dpark-0.2.2-py2.7-win-amd64.egg\dpark
\context.py", line 264, in runJob
for it in self.scheduler.runJob(rdd, func, partitions, allowLocal):
File "F:\install\acond\lib\site-packages\dpark-0.2.2-py2.7-win-amd64.egg\dpark
\schedule.py", line 333, in runJob
[l[-1] for l in stage.outputLocs])
File "F:\install\acond\lib\site-packages\dpark-0.2.2-py2.7-win-amd64.egg\dpark
\shuffle.py", line 353, in registerMapOutputs
self.client.call(SetValueMessage('shuffle:%s' % shuffleId, locs))
File "F:\install\acond\lib\site-packages\dpark-0.2.2-py2.7-win-amd64.egg\dpark
\tracker.py", line 112, in call
sock.connect(self.addr)
File "zmq/backend/cython/socket.pyx", line 514, in zmq.backend.cython.socket.S
ocket.connect (zmq\backend\cython\socket.c:5376)
zmq.error.ZMQError: Invalid argument
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "F:\install\acond\lib\atexit.py", line 24, in _run_exitfuncs
func(_targs, *_kargs)
File "F:\install\acond\lib\site-packages\dpark-0.2.2-py2.7-win-amd64.egg\dpark
\context.py", line 281, in stop
env.stop()
File "F:\install\acond\lib\site-packages\dpark-0.2.2-py2.7-win-amd64.egg\dpark
\env.py", line 92, in stop
self.trackerServer.stop()
File "F:\install\acond\lib\site-packages\dpark-0.2.2-py2.7-win-amd64.egg\dpark
\tracker.py", line 49, in stop
sock.connect(self.addr)
File "zmq/backend/cython/socket.pyx", line 514, in zmq.backend.cython.socket.S
ocket.connect (zmq\backend\cython\socket.c:5376)
ZMQError: Invalid argument
Error in sys.exitfunc:
Traceback (most recent call last):
File "F:\install\acond\lib\atexit.py", line 24, in _run_exitfuncs
func(_targs, *_kargs)
File "F:\install\acond\lib\site-packages\dpark-0.2.2-py2.7-win-amd64.egg\dpark
\context.py", line 281, in stop
env.stop()
File "F:\install\acond\lib\site-packages\dpark-0.2.2-py2.7-win-amd64.egg\dpark
\env.py", line 92, in stop
self.trackerServer.stop()
File "F:\install\acond\lib\site-packages\dpark-0.2.2-py2.7-win-amd64.egg\dpark
\tracker.py", line 49, in stop
sock.connect(self.addr)
File "zmq/backend/cython/socket.pyx", line 514, in zmq.backend.cython.socket.S
ocket.connect (zmq\backend\cython\socket.c:5376)
zmq.error.ZMQError: Invalid argument

Translate documents in English

There are already some documents for DPark in English, https://github.com/jackfengji/test_pro/issues.
English documents are needed to popularize this project to English speakers.

I could translate some of documents as a volunteer.

pymesos 0.1.6 VS mesosphere 1.1.0, 取不到 slave id

环境: pymesos 0.1.6, mesosphere 1.1.0 , dpark 0.3.5
文件: executor.py
函数: reply_status
问题: mesos_pb2.TaskStatus()返回的status对象的slave_id属性是空

def reply_status(driver, task_id, state, data=None):
status = mesos_pb2.TaskStatus()
status.task_id.MergeFrom(task_id)
status.state = state
status.timestamp = time.time()
if data is not None:
status.data = data
driver.sendStatusUpdate(status)

支持py3吗？

你好，请问支持py3吗？

from dpark import DparkContext ERROR

results is below:

raceback (most recent call last):
File "test_dpark.py", line 20, in
from dpark import DparkContext
File "/usr/lib64/python2.6/site-packages/DPark-0.3.5-py2.6-linux-x86_64.egg/dpark/init.py", line 1, in
from context import DparkContext, parser as optParser
File "/usr/lib64/python2.6/site-packages/DPark-0.3.5-py2.6-linux-x86_64.egg/dpark/context.py", line 10, in
from dpark.schedule import LocalScheduler, MultiProcessScheduler, MesosScheduler
File "/usr/lib64/python2.6/site-packages/DPark-0.3.5-py2.6-linux-x86_64.egg/dpark/schedule.py", line 15, in
import pymesos as mesos
File "/usr/lib64/python2.6/site-packages/pymesos-0.1.6-py2.6.egg/pymesos/init.py", line 1, in
from .scheduler import MesosSchedulerDriver
File "/usr/lib64/python2.6/site-packages/pymesos-0.1.6-py2.6.egg/pymesos/scheduler.py", line 18, in
from .process import UPID, Process, async
File "/usr/lib64/python2.6/site-packages/pymesos-0.1.6-py2.6.egg/pymesos/process.py", line 141
args = {f.name: getattr(msg, f.name) for f in msg.DESCRIPTOR.fields}
^
SyntaxError: invalid syntax

在mesos 上使用docker 的方式运行 dpark ，有没有部署的文档可以参考？

dpark multiprocess 下stage 是不是不能并行

看了点源码, 发现可运行的stage之间是不是不能并行, 只支持某stage下tasks的并行. 求问是真的不能并行还是我忽略了些什么...

安装报错

$ sudo python setup.py install
running install
running bdist_egg
running egg_info
writing requirements to DPark.egg-info/requires.txt
writing DPark.egg-info/PKG-INFO
writing top-level names to DPark.egg-info/top_level.txt
writing dependency_links to DPark.egg-info/dependency_links.txt
reading manifest file 'DPark.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no files found matching 'dpark/porable_hash.pyx'
writing manifest file 'DPark.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
running build_ext
skipping 'dpark/portable_hash.c' Cython extension (up-to-date)
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/dpark
copying build/lib.linux-x86_64-2.7/dpark/decorator.py -> build/bdist.linux-x86_64/egg/dpark
copying build/lib.linux-x86_64-2.7/dpark/hotcounter.py -> build/bdist.linux-x86_64/egg/dpark
copying build/lib.linux-x86_64-2.7/dpark/schedule.py -> build/bdist.linux-x86_64/egg/dpark
copying build/lib.linux-x86_64-2.7/dpark/__init__.py -> build/bdist.linux-x86_64/egg/dpark
copying build/lib.linux-x86_64-2.7/dpark/util.py -> build/bdist.linux-x86_64/egg/dpark
copying build/lib.linux-x86_64-2.7/dpark/rdd.py -> build/bdist.linux-x86_64/egg/dpark
copying build/lib.linux-x86_64-2.7/dpark/cache.py -> build/bdist.linux-x86_64/egg/dpark
copying build/lib.linux-x86_64-2.7/dpark/tracker.py -> build/bdist.linux-x86_64/egg/dpark
copying build/lib.linux-x86_64-2.7/dpark/bagel.py -> build/bdist.linux-x86_64/egg/dpark
copying build/lib.linux-x86_64-2.7/dpark/executor.py -> build/bdist.linux-x86_64/egg/dpark
copying build/lib.linux-x86_64-2.7/dpark/task.py -> build/bdist.linux-x86_64/egg/dpark
copying build/lib.linux-x86_64-2.7/dpark/bitindex.py -> build/bdist.linux-x86_64/egg/dpark
copying build/lib.linux-x86_64-2.7/dpark/table.py -> build/bdist.linux-x86_64/egg/dpark
creating build/bdist.linux-x86_64/egg/dpark/moosefs
copying build/lib.linux-x86_64-2.7/dpark/moosefs/__init__.py -> build/bdist.linux-x86_64/egg/dpark/moosefs
copying build/lib.linux-x86_64-2.7/dpark/moosefs/consts.py -> build/bdist.linux-x86_64/egg/dpark/moosefs
copying build/lib.linux-x86_64-2.7/dpark/moosefs/cs.py -> build/bdist.linux-x86_64/egg/dpark/moosefs
copying build/lib.linux-x86_64-2.7/dpark/moosefs/utils.py -> build/bdist.linux-x86_64/egg/dpark/moosefs
copying build/lib.linux-x86_64-2.7/dpark/moosefs/master.py -> build/bdist.linux-x86_64/egg/dpark/moosefs
copying build/lib.linux-x86_64-2.7/dpark/env.py -> build/bdist.linux-x86_64/egg/dpark
copying build/lib.linux-x86_64-2.7/dpark/broadcast.py -> build/bdist.linux-x86_64/egg/dpark
copying build/lib.linux-x86_64-2.7/dpark/serialize.py -> build/bdist.linux-x86_64/egg/dpark
copying build/lib.linux-x86_64-2.7/dpark/tabular.py -> build/bdist.linux-x86_64/egg/dpark
copying build/lib.linux-x86_64-2.7/dpark/mutable_dict.py -> build/bdist.linux-x86_64/egg/dpark
copying build/lib.linux-x86_64-2.7/dpark/context.py -> build/bdist.linux-x86_64/egg/dpark
copying build/lib.linux-x86_64-2.7/dpark/conf.py -> build/bdist.linux-x86_64/egg/dpark
copying build/lib.linux-x86_64-2.7/dpark/accumulator.py -> build/bdist.linux-x86_64/egg/dpark
copying build/lib.linux-x86_64-2.7/dpark/dstream.py -> build/bdist.linux-x86_64/egg/dpark
copying build/lib.linux-x86_64-2.7/dpark/hyperloglog.py -> build/bdist.linux-x86_64/egg/dpark
copying build/lib.linux-x86_64-2.7/dpark/portable_hash.so -> build/bdist.linux-x86_64/egg/dpark
copying build/lib.linux-x86_64-2.7/dpark/job.py -> build/bdist.linux-x86_64/egg/dpark
copying build/lib.linux-x86_64-2.7/dpark/dependency.py -> build/bdist.linux-x86_64/egg/dpark
copying build/lib.linux-x86_64-2.7/dpark/shuffle.py -> build/bdist.linux-x86_64/egg/dpark
copying build/lib.linux-x86_64-2.7/dpark/portable_hash.pyx -> build/bdist.linux-x86_64/egg/dpark
byte-compiling build/bdist.linux-x86_64/egg/dpark/decorator.py to decorator.pyc
byte-compiling build/bdist.linux-x86_64/egg/dpark/hotcounter.py to hotcounter.pyc
byte-compiling build/bdist.linux-x86_64/egg/dpark/schedule.py to schedule.pyc
byte-compiling build/bdist.linux-x86_64/egg/dpark/__init__.py to __init__.pyc
byte-compiling build/bdist.linux-x86_64/egg/dpark/util.py to util.pyc
byte-compiling build/bdist.linux-x86_64/egg/dpark/rdd.py to rdd.pyc
byte-compiling build/bdist.linux-x86_64/egg/dpark/cache.py to cache.pyc
byte-compiling build/bdist.linux-x86_64/egg/dpark/tracker.py to tracker.pyc
byte-compiling build/bdist.linux-x86_64/egg/dpark/bagel.py to bagel.pyc
byte-compiling build/bdist.linux-x86_64/egg/dpark/executor.py to executor.pyc
byte-compiling build/bdist.linux-x86_64/egg/dpark/task.py to task.pyc
byte-compiling build/bdist.linux-x86_64/egg/dpark/bitindex.py to bitindex.pyc
byte-compiling build/bdist.linux-x86_64/egg/dpark/table.py to table.pyc
byte-compiling build/bdist.linux-x86_64/egg/dpark/moosefs/__init__.py to __init__.pyc
byte-compiling build/bdist.linux-x86_64/egg/dpark/moosefs/consts.py to consts.pyc
byte-compiling build/bdist.linux-x86_64/egg/dpark/moosefs/cs.py to cs.pyc
byte-compiling build/bdist.linux-x86_64/egg/dpark/moosefs/utils.py to utils.pyc
byte-compiling build/bdist.linux-x86_64/egg/dpark/moosefs/master.py to master.pyc
byte-compiling build/bdist.linux-x86_64/egg/dpark/env.py to env.pyc
byte-compiling build/bdist.linux-x86_64/egg/dpark/broadcast.py to broadcast.pyc
byte-compiling build/bdist.linux-x86_64/egg/dpark/serialize.py to serialize.pyc
byte-compiling build/bdist.linux-x86_64/egg/dpark/tabular.py to tabular.pyc
byte-compiling build/bdist.linux-x86_64/egg/dpark/mutable_dict.py to mutable_dict.pyc
byte-compiling build/bdist.linux-x86_64/egg/dpark/context.py to context.pyc
byte-compiling build/bdist.linux-x86_64/egg/dpark/conf.py to conf.pyc
byte-compiling build/bdist.linux-x86_64/egg/dpark/accumulator.py to accumulator.pyc
byte-compiling build/bdist.linux-x86_64/egg/dpark/dstream.py to dstream.pyc
byte-compiling build/bdist.linux-x86_64/egg/dpark/hyperloglog.py to hyperloglog.pyc
byte-compiling build/bdist.linux-x86_64/egg/dpark/job.py to job.pyc
byte-compiling build/bdist.linux-x86_64/egg/dpark/dependency.py to dependency.pyc
byte-compiling build/bdist.linux-x86_64/egg/dpark/shuffle.py to shuffle.pyc
creating stub loader for dpark/portable_hash.so
byte-compiling build/bdist.linux-x86_64/egg/dpark/portable_hash.py to portable_hash.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
installing scripts to build/bdist.linux-x86_64/egg/EGG-INFO/scripts
running install_scripts
running build_scripts
creating build/bdist.linux-x86_64/egg/EGG-INFO/scripts
copying build/scripts-2.7/dquery -> build/bdist.linux-x86_64/egg/EGG-INFO/scripts
copying build/scripts-2.7/executor.py -> build/bdist.linux-x86_64/egg/EGG-INFO/scripts
copying build/scripts-2.7/scheduler.py -> build/bdist.linux-x86_64/egg/EGG-INFO/scripts
copying build/scripts-2.7/drun -> build/bdist.linux-x86_64/egg/EGG-INFO/scripts
copying build/scripts-2.7/mrun -> build/bdist.linux-x86_64/egg/EGG-INFO/scripts
copying build/scripts-2.7/dgrep -> build/bdist.linux-x86_64/egg/EGG-INFO/scripts
changing mode of build/bdist.linux-x86_64/egg/EGG-INFO/scripts/dquery to 755
changing mode of build/bdist.linux-x86_64/egg/EGG-INFO/scripts/executor.py to 755
changing mode of build/bdist.linux-x86_64/egg/EGG-INFO/scripts/scheduler.py to 755
changing mode of build/bdist.linux-x86_64/egg/EGG-INFO/scripts/drun to 775
changing mode of build/bdist.linux-x86_64/egg/EGG-INFO/scripts/mrun to 775
changing mode of build/bdist.linux-x86_64/egg/EGG-INFO/scripts/dgrep to 755
copying DPark.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying DPark.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying DPark.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying DPark.egg-info/not-zip-safe -> build/bdist.linux-x86_64/egg/EGG-INFO
copying DPark.egg-info/requires.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying DPark.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt
creating 'dist/DPark-0.3.2-py2.7-linux-x86_64.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing DPark-0.3.2-py2.7-linux-x86_64.egg
removing '/usr/lib64/python2.7/site-packages/DPark-0.3.2-py2.7-linux-x86_64.egg' (and everything under it)
creating /usr/lib64/python2.7/site-packages/DPark-0.3.2-py2.7-linux-x86_64.egg
Extracting DPark-0.3.2-py2.7-linux-x86_64.egg to /usr/lib64/python2.7/site-packages
DPark 0.3.2 is already the active version in easy-install.pth
Installing dquery script to /usr/bin
Installing executor.py script to /usr/bin
Installing scheduler.py script to /usr/bin
Installing drun script to /usr/bin
Installing mrun script to /usr/bin
Installing dgrep script to /usr/bin

Installed /usr/lib64/python2.7/site-packages/DPark-0.3.2-py2.7-linux-x86_64.egg
Processing dependencies for DPark==0.3.2
Searching for psutil
Reading https://pypi.python.org/simple/psutil/
^[[B^[[B^[Best match: psutil 4.3.0
Downloading https://pypi.python.org/packages/22/a8/6ab3f0b3b74a36104785808ec874d24203c6a511ffd2732dd215cf32d689/psutil-4.3.0.tar.gz#md5=ca97cf5f09c07b075a12a68b9d44a67d
Processing psutil-4.3.0.tar.gz
Writing /tmp/easy_install-T5wfHJ/psutil-4.3.0/setup.cfg
Running psutil-4.3.0/setup.py -q bdist_egg --dist-dir /tmp/easy_install-T5wfHJ/psutil-4.3.0/egg-dist-tmp-_tuYcX
Traceback (most recent call last):
  File "setup.py", line 53, in <module>
    'examples/dgrep',
  File "/usr/lib64/python2.7/distutils/core.py", line 152, in setup
    dist.run_commands()
  File "/usr/lib64/python2.7/distutils/dist.py", line 953, in run_commands
    self.run_command(cmd)
  File "/usr/lib64/python2.7/distutils/dist.py", line 972, in run_command
    cmd_obj.run()
  File "build/bdist.linux-x86_64/egg/setuptools/command/install.py", line 67, in run
  File "build/bdist.linux-x86_64/egg/setuptools/command/install.py", line 117, in do_egg_install
  File "build/bdist.linux-x86_64/egg/setuptools/command/easy_install.py", line 380, in run

  File "build/bdist.linux-x86_64/egg/setuptools/command/easy_install.py", line 610, in easy_install

  File "build/bdist.linux-x86_64/egg/setuptools/command/easy_install.py", line 661, in install_item

  File "build/bdist.linux-x86_64/egg/setuptools/command/easy_install.py", line 709, in process_distribution

  File "build/bdist.linux-x86_64/egg/pkg_resources/__init__.py", line 836, in resolve
  File "build/bdist.linux-x86_64/egg/pkg_resources/__init__.py", line 1081, in best_match
  File "build/bdist.linux-x86_64/egg/pkg_resources/__init__.py", line 1093, in obtain
  File "build/bdist.linux-x86_64/egg/setuptools/command/easy_install.py", line 629, in easy_install

  File "build/bdist.linux-x86_64/egg/setuptools/command/easy_install.py", line 659, in install_item

  File "build/bdist.linux-x86_64/egg/setuptools/command/easy_install.py", line 842, in install_eggs

  File "build/bdist.linux-x86_64/egg/setuptools/command/easy_install.py", line 1070, in build_and_install

  File "build/bdist.linux-x86_64/egg/setuptools/command/easy_install.py", line 1056, in run_setup

  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 240, in run_setup
  File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 193, in setup_context
  File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 152, in save_modules
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 126, in __exit__
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
  File "build/bdist.linux-x86_64/egg/setuptools/sandbox.py", line 110, in dump
MemoryError

question about the testcases of countByValueAndWindow

In the test codes about countByValueAndWindow, I found it's different from the Spark's test cases.

The format and meaning of these two output (Dpark streaming and Spark streaming) are different. Although I think Dpark accords with the documentation:
When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument.

Could someone please explain this difference a little bit?

Please also refer to this related question on stackoverflow.

Thanks!

使用gzip/bzip2时会timeout

使用未压缩文件运行正常。
在使用gz或bz2压缩时，
其它task运行要比未经压缩的文件快很多，异常的快。
然后会有一些任务一直timeout
[WARNING] [job ] re-submit task 118 for timeout 30.1, try 2

补充：我看到gzip rdd中的注释了，用pigz -i可以正常运行，但bzip2有什么需要注意的吗？
另外，请问豆瓣常用哪种压缩格式呢？

pymesos

Hello, have you considered releasing dpark.pymesos as an standalone package?

I have used it with great success and now that mesos 0.20.0 is out it is possible to install mesos interfaces from mesos.interface to avoid copy-pasting more code.

I like pymesos doesn't require extra packages compared to https://github.com/wickman/pesos project.

thanks

运行 demo时候，会在text search example 卡住

在我本上运行 demo时候，会在text search example 卡住， ctr+c后的 exception 为：
^Clogging
Traceback (most recent call last):
File "demo.py", line 16, in
print 'logging', log.count()
File "/home/xzl/hg/spark/dpark/dpark/rdd.py", line 368, in count
return sum(self.ctx.runJob(self, lambda x: sum(1 for i in x)))
File "/home/xzl/hg/spark/dpark/dpark/context.py", line 263, in runJob
for it in self.scheduler.runJob(rdd, func, partitions, allowLocal):
File "/home/xzl/hg/spark/dpark/dpark/schedule.py", line 292, in runJob
submitStage(finalStage)
File "/home/xzl/hg/spark/dpark/dpark/schedule.py", line 254, in submitStage
submitMissingTasks(stage)
File "/home/xzl/hg/spark/dpark/dpark/schedule.py", line 290, in submitMissingTasks
self.submitTasks(tasks)
File "/home/xzl/hg/spark/dpark/dpark/schedule.py", line 397, in submitTasks
_, reason, result, update = run_task(task, self.nextAttempId())
File "/home/xzl/hg/spark/dpark/dpark/schedule.py", line 376, in run_task
result = task.run(aid)
File "/home/xzl/hg/spark/dpark/dpark/task.py", line 52, in run
return self.func(self.rdd.iterator(self.split))
File "/home/xzl/hg/spark/dpark/dpark/rdd.py", line 368, in
return sum(self.ctx.runJob(self, lambda x: sum(1 for i in x)))
File "/home/xzl/hg/spark/dpark/dpark/rdd.py", line 368, in
return sum(self.ctx.runJob(self, lambda x: sum(1 for i in x)))
File "/home/xzl/hg/spark/dpark/dpark/util.py", line 111, in _
for r in result:
File "/home/xzl/hg/spark/dpark/dpark/cache.py", line 205, in getOrCompute
cachedVal = self.cache.get(key)
File "/home/xzl/hg/spark/dpark/dpark/cache.py", line 57, in get
locs = self.tracker.getCacheUri(rdd_id, index)
File "/home/xzl/hg/spark/dpark/dpark/cache.py", line 195, in getCacheUri
return self.client.call(GetValueMessage('cache:%s-%s' % (rdd_id, index)))
File "/home/xzl/hg/spark/dpark/dpark/tracker.py", line 116, in call
return sock.recv_pyobj()
File "/home/xzl/.virtualenvs/dpark/local/lib/python2.7/site-packages/zmq/sugar/socket.py", line 436, in recv_pyobj
s = self.recv(flags)
File "zmq/backend/cython/socket.pyx", line 674, in zmq.backend.cython.socket.Socket.recv (zmq/backend/cython/socket.c:6971)
File "zmq/backend/cython/socket.pyx", line 708, in zmq.backend.cython.socket.Socket.recv (zmq/backend/cython/socket.c:6763)
File "zmq/backend/cython/socket.pyx", line 145, in zmq.backend.cython.socket._recv_copy (zmq/backend/cython/socket.c:1931)
File "zmq/backend/cython/checkrc.pxd", line 12, in zmq.backend.cython.checkrc._check_rc (zmq/backend/cython/socket.c:7222)
KeyboardInterrupt
而且得按两次 ctr+c
在台式机上没啥问题。。。
是zmq的问题么，看traceback 好像是sock.recv_pyobj()没有收到 trackerServer的消息挂起了。。。

cache()无效？

是这样的，我的大致流程是：
map1 = map1.reduceByKey2.cache()
map2.reduceBykey2(map1)
map2.foreach
map1.foreach

在这个过程中，我发现加不加cache()好像没什么用？
rdd.py里cache函数也只是return self，但还有一行被注释掉的
'self.shouldCache = True'

是我用的方式不对吗？还是我对cache()的理解有误？

如何在dpark中使用mongodb?

代码在 https://gist.github.com/5062200

运行后报错
TypeError: 'Database' object is not callable. If you meant to call the 'getnewargs' method on a 'Connection' object it is failing because no such method exists.

暂时只想到加个代理层(比如nginx + lua做个http接口)来使用mongodb，不知有没有更好的办法？

reduceByKey计算结果有误

需求：统计每UID及UID各WORD中的pv, uv
代码在：https://gist.github.com/muxueqz/4748764
期望输出：
[(('uid', 'word'), (set(['0']), 1)), ('123', [set(['1', '3', '2']), 5]), (('123', 'home'), [set(['1', '2']), 3]),
(('123', 'work'), [set(['3', '2']), 2]), ('uid', (set(['0']), 1))]
实际输出：
[(('uid', 'word'), (set(['0']), 1)), ('123', [set(['1', '3', '2']), 5]), (('123', 'home'), [set(['1', '2']), 3]),
(('123', 'work'), [set(['1', '3', '2']), 2]), ('uid', (set(['0']), 1))]

重点在 (('123', 'work') 的部分
不知是我用法不对还是遇到bug?

安装失败

ImportError: Building module dpark.portable_hash failed: ["CompileError: command 'gcc' failed with exit status 1\n"]

在os x上安装完dpark，import dpark时报出以上错误，何解？

broadcast用法有变？

import dpark
bcast_test = dpark.broadcast(1)
bcast_2 = dpark.broadcast(2)

以前的版本是可以的，现在的版本会报：

Traceback (most recent call last):
  File "bcast_test.py", line 3, in <module>
    bcast_2 = dpark.broadcast(2)
TypeError: 'module' object is not callable

改为

import dpark
dpark_bcast = dpark.DparkContext()
bcast_test = dpark_bcast.broadcast(1)
bcast_2 = dpark_bcast.broadcast(2)

就OK

DPark failed to submitTasks when running on mesos

I set up mesos cluster on Amazon ec2 using mesos EC2-Scripts:

Then I run
python27 demo.py -m mesos://[email protected]:5050 -p 2

The program was halting there, it stuck at submitTasks() in schedule.py. Press Ctrl-C:
ec2-user@ip-10-31-194-149 examples]$ python27 demo.py -m mesos://[email protected]:5050 -p 2 2013-05-20 12:30:43,786 [INFO] [scheduler] Got a job with 4 tasks ^CTraceback (most recent call last): File "demo.py", line 10, in <module> print nums.count() File "/home/ec2-user/dpark/dpark/rdd.py", line 271, in count return sum(self.ctx.runJob(self, lambda x: ilen(x))) File "/home/ec2-user/dpark/dpark/context.py", line 204, in runJob for it in self.scheduler.runJob(rdd, func, partitions, allowLocal): File "/home/ec2-user/dpark/dpark/schedule.py", line 269, in runJob submitStage(finalStage) File "/home/ec2-user/dpark/dpark/schedule.py", line 231, in submitStage submitMissingTasks(stage) File "/home/ec2-user/dpark/dpark/schedule.py", line 267, in submitMissingTasks self.submitTasks(tasks) File "/home/ec2-user/dpark/dpark/schedule.py", line 436, in _ r = f(self, *a, **kw) File "/usr/lib64/python2.7/threading.py", line 154, in __exit__ self.release() File "/usr/lib64/python2.7/threading.py", line 142, in release raise RuntimeError("cannot release un-acquired lock") RuntimeError: cannot release un-acquired lock

In Mesos logs:
Log file created at: 2013/05/20 11:40:49 Running on machine: ip-10-31-194-149.ec2.internal Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg I0520 11:40:49.170337 2099 logging.cpp:70] Logging to /mnt/mesos-logs I0520 11:40:49.172806 2099 main.cpp:95] Build: 2011-12-03 06:24:10 by root I0520 11:40:49.172871 2099 main.cpp:96] Starting Mesos master I0520 11:40:49.176777 2099 webui.cpp:81] Starting master web server on port 8080 I0520 11:40:49.176911 2101 master.cpp:264] Master started at mesos://[email protected]:5050 I0520 11:40:49.177106 2104 webui.cpp:47] Master web server thread started I0520 11:40:49.177109 2101 master.cpp:279] Master ID: 201305201140-0 I0520 11:40:49.177775 2101 master.cpp:462] Elected as master! I0520 11:40:49.191300 2104 webui.cpp:59] Loading webui/master/webui.py I0520 11:40:54.348682 2101 master.cpp:814] Attempting to register slave 201305201140-0-0 at [email protected]:33513 I0520 11:40:54.349149 2101 master.cpp:1057] Master now considering a slave at ip-10-28-0-219.ec2.internal:33513 as active I0520 11:40:54.349210 2101 master.cpp:1588] Adding slave 201305201140-0-0 at ip-10-28-0-219.ec2.internal with cpus=2; mem=677 I0520 11:40:54.349393 2101 simple_allocator.cpp:71] Added slave 201305201140-0-0 with cpus=2; mem=677 W0520 12:00:52.365500 2101 protobuf.hpp:260] Initialization errors: framework.executor

Question:

Does DPark require a specific mesos version?
Is there any relavant documentation for setting up DPark+Mesos?

不支持mesos 0.2

已测试
会出现 “job lost” 错误

能不能加入环境说明

能不能加入环境说明，在win64，python3.5安装失败

安装报错

Installed /Library/Python/2.7/site-packages/mesos.interface-0.22.1.2-py2.7.egg
error: mesos.interface 0.22.1.2 is installed but mesos.interface==0.22.0 is required by set(['pymesos'])

是需要设置pymesos 的环境变量吗？请问怎么设置？

Off-by-one error in k-means example

The closestCenter function in the kmeans.py example may contain an off-by-one error:

for i in range(1, len(centers)):

skips over the first center, because the centers are stored in zero-based lists.

install的时候报错

Installed /Library/Python/2.7/site-packages/DPark-0.1-py2.7-macosx-10.8-intel.egg
Processing dependencies for DPark==0.1
error: Installed distribution protobuf 3.0.0-alpha-1 conflicts with requirement protobuf>=2.5.0,<3

protobuf有版本限制，找了一下应该是文件要求安装protobuf3.0以上版本，但是系统要求只能是2.5.0到3之间的版本。不知道怎么修改这个限制。谢谢