Git Product home page Git Product logo

Comments (8)

scratlzj avatar scratlzj commented on May 21, 2024

我又试了用 tee 命令把 terminal 中的输出保存成 TXT 文件。但是结果如下:

python weiboSpider.py |tee -a weibozanshuo.txt
Traceback (most recent call last):
File "weiboSpider.py", line 42, in get_username
print(u"用户名: " + self.username)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
Traceback (most recent call last):
File "weiboSpider.py", line 64, in get_user_info
print(u"微博数: " + str(self.weibo_num))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
进度: 0%| | 0/1146 [00:00<?, ?it/s]Traceback (most recent call last):
File "weiboSpider.py", line 104, in get_original_weibo
sys.stdout.encoding, "ignore").decode(
TypeError: encode() argument 1 must be string, not None
Traceback (most recent call last):
File "weiboSpider.py", line 197, in get_weibo_place
print(u"微博位置: " + weibo_place)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
Traceback (most recent call last):
File "weiboSpider.py", line 207, in get_publish_time
sys.stdout.encoding, "ignore").decode(sys.stdout.encoding)
TypeError: encode() argument 1 must be string, not None
Traceback (most recent call last):
File "weiboSpider.py", line 240, in get_publish_tool
sys.stdout.encoding, "ignore").decode(sys.stdout.encoding)
TypeError: encode() argument 1 must be string, not None

Traceback (most recent call last):
File "weiboSpider.py", line 289, in get_weibo_info
sys.stdout.encoding, "ignore").decode(sys.stdout.encoding)
TypeError: encode() argument 1 must be string, not None
Traceback (most recent call last):
File "weiboSpider.py", line 352, in write_txt
f.write(result.encode(sys.stdout.encoding))
TypeError: encode() argument 1 must be string, not None
Traceback (most recent call last):
File "weiboSpider.py", line 381, in main
print(u"用户名: " + wb.username)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
('Error: ', UnicodeEncodeError('ascii', u'\u7528\u6237\u540d: \u54b1\u8bf4', 0, 3, 'ordinal not in range(128)'))
('Error: ', UnicodeEncodeError('ascii', u'\u5fae\u535a\u6570: 11301', 0, 3, 'ordinal not in range(128)'))
('Error: ', TypeError('encode() argument 1 must be string, not None',))
None
('Error: ', UnicodeEncodeError('ascii', u'\u5fae\u535a\u4f4d\u7f6e: \u65e0', 0, 4, 'ordinal not in range(128)'))
('Error: ', TypeError('encode() argument 1 must be string, not None',))
('Error: ', TypeError('encode() argument 1 must be string, not None',))
('Error: ', TypeError('encode() argument 1 must be string, not None',))
('Error: ', TypeError('encode() argument 1 must be string, not None',))
('Error: ', UnicodeEncodeError('ascii', u'\u4fe1\u606f\u6293\u53d6\u5b8c\u6bd5', 0, 6, 'ordinal not in range(128)'))
('Error: ', UnicodeEncodeError('ascii', u'\u7528\u6237\u540d: \u54b1\u8bf4', 0, 3, 'ordinal not in range(128)'))

我用的系统是 Ubuntu 18.04 系统语言是英文。

from weibospider.

scratlzj avatar scratlzj commented on May 21, 2024

terminal:
Screenshot from 2019-04-22 07-32-51

系统:
Screenshot from 2019-04-22 07-34-06

from weibospider.

dataabc avatar dataabc commented on May 21, 2024

看起来是微博发布工具为None,在写文件之前出错,所以weibo没有保存。如果不需要“发布工具”,也可以去掉write_txt中的

+ u"发布工具: " + self.publish_tool[i - 1] + "\n\n"

,能否提供微博id测试下,谢谢

from weibospider.

scratlzj avatar scratlzj commented on May 21, 2024

@dataabc 谢谢你的回复。要爬微博ID是 1711243680。

comment掉你说的语句后,所得的反馈如下:

$python weiboSpider.py |tee -a weibozanshuo1.txt
Traceback (most recent call last):
File "weiboSpider.py", line 42, in get_username
print(u"用户名: " + self.username)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
Traceback (most recent call last):
File "weiboSpider.py", line 64, in get_user_info
print(u"微博数: " + str(self.weibo_num))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
进度: 0%| | 0/1147 [00:00<?, ?it/s]Traceback (most recent call last):
File "weiboSpider.py", line 104, in get_original_weibo
sys.stdout.encoding, "ignore").decode(
TypeError: encode() argument 1 must be string, not None
Traceback (most recent call last):
File "weiboSpider.py", line 197, in get_weibo_place
print(u"微博位置: " + weibo_place)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
Traceback (most recent call last):
File "weiboSpider.py", line 207, in get_publish_time
sys.stdout.encoding, "ignore").decode(sys.stdout.encoding)
TypeError: encode() argument 1 must be string, not None
Traceback (most recent call last):
File "weiboSpider.py", line 240, in get_publish_tool
sys.stdout.encoding, "ignore").decode(sys.stdout.encoding)
TypeError: encode() argument 1 must be string, not None

Traceback (most recent call last):
File "weiboSpider.py", line 289, in get_weibo_info
sys.stdout.encoding, "ignore").decode(sys.stdout.encoding)
TypeError: encode() argument 1 must be string, not None
Traceback (most recent call last):
File "weiboSpider.py", line 352, in write_txt
f.write(result.encode(sys.stdout.encoding))
TypeError: encode() argument 1 must be string, not None
Traceback (most recent call last):
File "weiboSpider.py", line 381, in main
print(u"用户名: " + wb.username)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
('Error: ', UnicodeEncodeError('ascii', u'\u7528\u6237\u540d: \u54b1\u8bf4', 0, 3, 'ordinal not in range(128)'))
('Error: ', UnicodeEncodeError('ascii', u'\u5fae\u535a\u6570: 11310', 0, 3, 'ordinal not in range(128)'))
('Error: ', TypeError('encode() argument 1 must be string, not None',))
None
('Error: ', UnicodeEncodeError('ascii', u'\u5fae\u535a\u4f4d\u7f6e: \u65e0', 0, 4, 'ordinal not in range(128)'))
('Error: ', TypeError('encode() argument 1 must be string, not None',))
('Error: ', TypeError('encode() argument 1 must be string, not None',))
('Error: ', TypeError('encode() argument 1 must be string, not None',))
('Error: ', TypeError('encode() argument 1 must be string, not None',))
('Error: ', UnicodeEncodeError('ascii', u'\u4fe1\u606f\u6293\u53d6\u5b8c\u6bd5', 0, 6, 'ordinal not in range(128)'))
('Error: ', UnicodeEncodeError('ascii', u'\u7528\u6237\u540d: \u54b1\u8bf4', 0, 3, 'ordinal not in range(128)'))

我在网上查的是说 str 和 uni 类型不能相加。要用 unicode()函数
但是我是小白,也不知道具体怎么改。

我现在尝试用script命令在纪录爬虫结果。

from weibospider.

scratlzj avatar scratlzj commented on May 21, 2024

似乎是跟我电脑设置有关,这是script出来的.txt文件开头的一部分,可以看到抓取的微博能正常显示出来,但是抓取微博前terminal中的一些语句在.txt文件中呈现乱码


weibospider�[00m$ python app.py�������������sudo gedit /etc/default/grub &��[K��[K��[K��[K��[K��[K��[K��[K��[K��[K��[K��[K��[K��[K��[K��[K��[K��[K��[K��[K��[K��[K��[K��[K��[K��[K��[K��[K��[K��[K�������python wei�bosp�i��[K��[K��[KSpider.py
用户名: 咱说
微博数: 11310
关注数: 236
粉丝数: 730601

进度: 0%| | 0/1147 [00:00<?, ?it/s]多年前给《笛卡尔的错误》写过一篇书评,也是本人迄今为止未能自我超越的一篇文章,据说影响了不少人。最近发现原文章的微博链接在手机上无法打开,今日稍作补充修订之后重新发表于此。重读这篇书评的时候,我想到两点,第一,最近自己在思考“从高端科普回归基础科普”,其实这篇书评可以视作,以基础科普的写法介绍了一个相当高端的科学观点。它除了篇幅很长之外,阅读门槛其实不高,很多基本概念我都做了解释,只要能顺着文章的逻辑流读下来,一定会有所收获。第二,这篇书评已经不止是一篇书评,它事实上是我把那一段时间所思考的学术问题,借着这本书的启发,进行了一次吐故纳新的整合。除去介绍这本书的核心议题,它还包含着来自其他学者的研究和观点,以及我个人的思考。不过,无论我这篇书评写得多么好,它仍然不能代替读者的思考,更不能代替读者的阅读。循着它的导引去读英文原著吧,去读十年前毛彩凤老师翻译的中文版吧。开卷有益。 心理在哪里  
微博位置: 无
微博发布时间: 2018-03-23 21:10
微博发布工具: 微博 weibo.com
点赞数: 623
转发数: 657
评论数: 138

转发理由:你才是个笑话,你现在不搜索资料,猜一下**有多少残障人士?或者你告诉我“没有那么多”是多少?你看不到不代表他们不存在,你看到的少不意味着他们人数少。//@echoedinthewell:其实有尊重和遇到的时候给予方便就可以了,推广是一个笑话,再说,也没有那么多不方便的人  
原始用户: 咱说
转发内容: 第一次知道国内的公交车有这设计,可见有关部门对这个功能的宣传是何其少,使用这个功能的残障人士何其少,以至于形同虚设。  原图 
微博位置: 无
微博发布时间: 2019-04-22 14:50
微博发布工具: 无
点赞数: 34
转发数: 20
评论数: 24

好在导出的微博没有出现乱码。

from weibospider.

dataabc avatar dataabc commented on May 21, 2024

似乎找到出现cannot concatenate 'str' and 'NoneType' objects的原因了。我刚刚试了下,发现第一次出现是在爬了100多页以后。然后,又测试了几次,出现很多None,而且页数小于100,甚至出现了第一条微博为None的情况,怀疑是因为爬取速度过快且数量较多,账号被微博限制了,使很多应该爬取的信息变成了None,导致在组合信息时出现上述错误。
建议,减慢爬取速度。如每爬取几页sleep一段时间。get_weibo_info方法中的

for page in tqdm(range(1, page_num + 1), desc=u"进度"):

可以控制速度,每循环一次代表爬取一页,你可以做一下判断,如

from time import sleep
......
for page in tqdm(range(1, page_num + 1), desc=u"进度"):
      if page % 5==0:
          sleep(3)
......

表示每爬5页暂停3秒,具体应该多少页暂停你可以自己测试,也可以参考#8

from weibospider.

scratlzj avatar scratlzj commented on May 21, 2024

谢谢回复。不过我的号应该没有被微博限制,因为我只要不输出成 txt 格式,只在terminal中爬取,就不会报错。所以我用 script 把terminal中的爬取结果全部纪录并存成 txt 格式,成功爬完了万余条。

这个错误看来是我个人问题,我自己再试下就好,可以关闭 issue 啦。

from weibospider.

scratlzj avatar scratlzj commented on May 21, 2024

我又试了下爬另几个人的百余条微博,就没有上述问题,成功输出 .txt 文件了。看来是个例,不用担心。

from weibospider.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.