Git Product home page Git Product logo

jumper2014 / lianjia-beike-spider Goto Github PK

View Code? Open in Web Editor NEW
2.7K 94.0 678.0 1021 KB

链家网和贝壳网房价爬虫,采集北京上海广州深圳等21个**主要城市的房价数据(小区,二手房,出租房,新房),稳定可靠快速!支持csv,MySQL, MongoDB,Excel, json存储,支持Python2和3,图表展示数据,注释丰富 ,点星支持,仅供学习参考,请勿用于商业用途,后果自负。

Python 97.50% TSQL 2.50%
lianjia spider crawler beike house

lianjia-beike-spider's Introduction

链家网(lianjia.com)和贝壳网(ke.com)爬虫

  • 爬取链家网、贝壳网的各类房价数据(小区数据,挂牌二手房, 出租房,新房)。
  • 如果好用,请点星支持 !
  • 支持北京上海广州深圳等国内21个主要城市;支持Python2和Python3; 基于页面的数据爬取,稳定可靠; 丰富的代码注释,帮助理解代码并且方便扩展功能。
  • 数据含义:城市-city, 区县-district, 板块-area, 小区-xiaoqu, 二手房-ershou, 租房-zufang, 新房-loupan。
  • 每个版块存储为一个csv文件,该文件可以作为原始数据进行进一步的处理和分析。
  • 支持图表展示。 alt text alt text
  • 如果链家和贝壳页面结构有调整,欢迎反馈,我将尽力保持更新。
  • 此代码仅供学习与交流,请勿用于商业用途,后果自负。

安装依赖

  • pip install -r requirements.txt
  • 运行前,请将当前目录加入到系统环境变量PYTHONPATH中。
  • 运行前,请指定要爬取的网站,见lib/spider/base_spider.py里面的SPIDER_NAME变量。
  • 清理数据,运行 python tool/clean.py

快速问答

  • Q: 如何降低爬取速度,避免被封IP?A:见base_spider.py里面的RANDOM_DELAY
  • Q: 如何减少并发的爬虫数? A: 见见base_spider.py的thread_pool_size
  • Q: 为何无法使用xiaoqu_to_chart.py? A: 该脚本现仅支持mac系统
  • Q: 有其他问题反馈途径么? A: 问题反馈QQ群号635276285。

小区房价数据爬取

  • 内容格式:采集日期,所属区县,板块名,小区名,挂牌均价,挂牌数
  • 内容如下:20180221,浦东,川沙,恒纬家苑,32176元/m2,3套在售二手房
  • 数据可以存入MySQL/MongoDB数据库,用于进一步数据分析,比如排序,计算区县和版块均价。
  • MySQL数据库结构可以通过导入tool/lianjia_xiaoqu.sql建立。
  • MySQL数据格式: 城市 日期 所属区县 版块名 小区名 挂牌均价 挂牌数
  • MySQL数据内容:上海 20180331 徐汇 衡山路 永嘉路621号 333333 0
  • MongoDB数据内容: { "_id" : ObjectId("5ac0309332e3885598b3b751"), "city" : "上海", "district" : "黄浦", "area" : "五里桥", "date" : "20180331", "price" : 81805, "sale" : 11, "xiaoqu" : "桥一小区" }
  • Excel数据内容:上海 20180331 徐汇 衡山路 永嘉路621号 333333 0
  • 运行, python xiaoqu.py 根据提示输入城市代码,回车确认,开始采集数据到csv文件
  • 运行, python xiaoqu.py city, 自动开始采集数据到csv文件
hz: 杭州, sz: 深圳, dl: 大连, fs: 佛山
xm: 厦门, dg: 东莞, gz: 广州, bj: 北京
cd: 成都, sy: 沈阳, jn: 济南, sh: 上海
tj: 天津, qd: 青岛, cs: 长沙, su: 苏州
cq: 重庆, wh: 武汉, hf: 合肥, yt: 烟台
nj: 南京, 
  • 修改 xiaoqu_to_db.py 中的database变量,设置数据最终存入mysql/mongodb/Excel/json
  • python xiaoqu_to_db.py 根据提示将今天采集到的csv数据存入数据库。(默认导出为单一csv文件)
  • python xiaoqu_to_chart.py 将单一csv文件数据通过图表展示。

挂牌二手房数据爬取

  • 获取链家网挂牌二手房价数据,数据格式如下:
  • 20180405,浦东,万祥镇,祥安菊苑 3室2厅 258万,258万,祥安菊苑 | 3室2厅 | 126.58平米 | 南 | 毛坯
  • 运行,python ershou.py 根据提示输入城市代码,回车确认,开始采集数据到csv文件
  • 运行,python ershou.py city,自动开始采集数据到csv文件

出租房数据爬取

  • 获取链家网挂牌出租房数据,数据格式如下:
  • 20180407,浦东,御桥,仁和都市花园  ,3室2厅,100平米,8000
  • 运行,python zufang.py 根据提示输入城市代码,回车确认,开始采集数据到csv文件
  • 运行,python zufang.py city,自动开始采集数据到csv文件

新房数据爬取

  • 获取链家网新房数据,数据格式如下:
  • 20180407,上海星河湾,76000,1672万
  • 运行,python loupan.py 根据提示输入城市代码,回车确认,开始采集数据到csv文件
  • 运行,python loupan.py city,自动开始采集数据到csv文件

结果存储

  • 根目录下建立data目录存放结果数据文件
  • 小区房价数据存储目录为 data/site/xiaoqu/city/date
  • 二手房房价数据存储目录为 data/site/ershou/city/date
  • 出租房房价数据存储目录为 data/site/zufang/city/date
  • 新房房价数据存储目录为 data/site/loupan/city/date

性能

  • 300秒爬取上海市207个版块的2.7万条小区数据,平均每秒90条数据。
Total crawl 207 areas.
Total cost 294.048109055 second to crawl 27256 data items.
  • 1000秒爬取上海215个版块的7.5万条挂牌二手房数据,平均每秒75条数据。
Total crawl 215 areas.
Total cost 1028.3090899 second to crawl 75448 data items.
  • 300秒爬取上海215个版块的3.2万条出租房数据, 平均每秒150条数据。
Total crawl 215 areas.
Total cost 299.7534770965576 second to crawl 32735 data items.
  • 30秒爬取上海400个新盘数据。
Total crawl 400 loupan.
Total cost 29.757128953933716 second

更新记录

  • 2019/06/21 去除requirements.txt中的webbrower
  • 2018/11/05 增加工具下载二手房缩略图tool/download_ershou_image.py
  • 2018/11/01 增加二手房缩略图地址
  • 2018/10/28 xiaoqu_to_db.py改造成支持命令行参数自动运行。
  • 2018/10/25 将主要爬取代码抽取到spider类中。
  • 2018/10/22 文件名,目录,代码重构。
  • 2018/10/20 增加中间文件清理功能,能够爬取贝壳网的小区,新房,二手房和租房数据。
  • 2018/10/19 支持贝壳网小区数据爬取
  • 2018/10/15 增加Spider类,优化异常处理,功能无变动
  • 2018/10/14 允许用户通过命令行指定要爬取的城市,而不仅仅通过交互模式选择,用于支持自动爬取。
  • 2018/10/11 增加初步log功能。
  • 2018/10/09 图表展示区县均价排名。
  • 2018/10/07 小区房价导出到json文件, csv文件。图表展示最贵的小区。
  • 2018/10/05 增加Referer。增加透明代理服务器获取(未使用)
  • 2018/06/01 支持User-Agent
  • 2018/04/07 支持采集新房的基本房价信息
  • 2018/04/07 支持采集出租房的相关信息
  • 2018/04/05 支持采集挂牌二手房信息
  • 2018/04/02 支持将采集到的csv数据导入Excel
  • 2018/04/01 同时支持Python2和Python3
  • 2018/04/01 支持将采集到的csv数据导入MongoDB数据库
  • 2018/03/31 支持将采集到的csv数据导入MySQL数据库
  • 2018/03/27 修复bug: 版块下只有一页小区数据时未能正确爬取
  • 2018/03/27 增加5个城市,现在支持21个城市的小区数据爬取
  • 2018/03/10 自动获取城市的区县列表,现在支持16个城市小区数据爬取
  • 2018/03/06 支持北京二手房小区数据采集
  • 2018/02/21 应对链家前端页面更新,使用内置urllib2代替第三方requests库,提升性能,减少依赖
  • 2018/02/01 支持上海二手房小区数据采集

lianjia-beike-spider's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lianjia-beike-spider's Issues

爬虫失败

爬取贝壳信息时报requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='nj.ke.com', port=443): Max retries exceeded with url:错误

The output columns issue

通过 ershou.py开始运行,运行后数据sample如下。 可以看到没有小区名字,而且后面的四列没有实际意义,不知道是故意这样子设计,还是现在的网站做了更新,所以抓取的内容发生了改变。

20200428 宝山 月浦 电梯房,双南户型,功能间全明,不沿街 180万 低楼层(共18层) 1995年建 2室1厅 67.2平米 https://ke-image.ljcdn.com/110000-inspection/pc1_CQEejQcfJ.jpg!m_fill w_280 h_210 f_jpg?from=ke.com
20200428 宝山 月浦 满五不唯一,全明格局,看房方便,诚心出售。必看好房 268万 中楼层(共24层) 2011年建 2室2厅 83.32平米 https://ke-image.ljcdn.com/110000-inspection/pc1_ItVNT4rKm.jpg!m_fill w_280 h_210 f_jpg?from=ke.com
20200428 宝山 月浦 月浦六七九村 一房一厅 非顶楼 精装修 112万诚售必看好房 112万 高楼层(共6层) 1994年建 1室1厅 36.94平米 https://ke-image.ljcdn.com/110000-inspection/pc1_AgGtusWON.jpg!m_fill w_280 h_210 f_jpg?from=ke.com

建议添加抓取延时

做为一名高素质爬虫coder,建议配置文件加一个随机延时选项,要是别人服务器顶不住了,一封爬虫,大家都GG,另外建议爬取数据可以直接存数据,不经过csv

xiaoqu_to_db.py在无法通过命令行模式自动运行

C:\Users\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pymysql\cursors.py:170:`

Warning: (1366, "Incorrect string value: '\xD6\xD0\xB9\xFA\xB1\xEA...' for column 'VARIABLE_VALUE' at row 519")
result = self._query(query)
Which city data do you want to save ?
bj: 北京, cd: 成都, cq: 重庆, cs: 长沙
dg: 东莞, dl: 大连, fs: 佛山, gz: 广州
hz: 杭州, hf: 合肥, jn: 济南, nj: 南京
qd: 青岛, sh: 上海, sz: 深圳, su: 苏州
sy: 沈阳, tj: 天津, wh: 武汉, xm: 厦门
yt: 烟台,

xiaoqu.py等python脚本可以通过命令行加参数直接执行这个导入数据库的却不能通过命令行执行不便于自动收集

好项目,两个小问题

  1. 有的小区名字内含有英文逗号,所以按逗号分隔写入csv的时候,这样的数据会多出一列,可以改成按制表符分隔或者把写入内容的逗号都替换成非逗号;
    2.贝壳网,网址为xx.ke.com这样的有的城市并没有小区数据,如何判断一个城市是否有小区数据?

急:总是被反爬虫(人机验证)怎么办呀?

请问大神怎么做到的300s,7万条数据的啊?
我thread_pool_size = 5,RANDOM_DELAY = 30,
用ershou.py才100多条数据就被人机验证了。实测只有thread_pool_size <= 2才行。。。
注:我修改了一点代码,适配目前的链家二手房网页。

list index out of range

你好 运行zufang.py 选择北京后 产生如下错误信息:

……(正常运行)

http://bj.lianjia.com/zufang/tianningsi1/
http://bj.lianjia.com/zufang/xuanwumen12/
Warning: only find one page for dongdan
list index out of range
http://bj.lianjia.com/zufang/dongdan/pg1
python(19966,0x700011453000) malloc: *** error for object 0x7ffd472deff8: incorrect checksum for freed object - object was probably modified after being freed.
*** set a breakpoint in malloc_error_break to debug
[1] 19966 abort ~/anaconda/bin/python zufang.py

希望能够修改一下生成xiaoqu的命名逻辑

我认为用抓取到的小区信息生成csv或者写入数据库的时候应该带有时间而不是直接覆盖
例如默认情况下xiaoqu_to_db.py生成的csv文件为xiaoqu.csv在自动化部署时会覆盖前一天的csv
写入数据库的时候建议把表名由xiaoqu改为城市名有利于减少数据分类的工作量

小区和二手房运行会出现这个错误。

http://bj.lianjia.com/xiaoqu/dingfuzhuang/
raise ReadTimeout(e, request=request)
ReadTimeout: HTTPSConnectionPool(host='bj.lianjia.com', port=443): Read timed out. (read timeout=10)
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/threadpool.py", line 158, in run
result = request.callable(*request.args, **request.kwds)
File "xiaoqu.py", line 34, in collect_xiaoqu_data
xqs = get_xiaoqu_info(city_name, area_name)
File "/Users/kaiyingwu/Downloads/lianjia-spider-master/lib/url/xiaoqu.py", line 101, in get_xiaoqu_info
response = requests.get(page, timeout=10, headers=headers)
File "/Library/Python/2.7/site-packages/requests-2.11.1-py2.7.egg/requests/api.py", line 70, in get
return request('get', url, params=params, **kwargs)
File "/Library/Python/2.7/site-packages/requests-2.11.1-py2.7.egg/requests/api.py", line 56, in request
return session.request(method=method, url=url, **kwargs)
File "/Library/Python/2.7/site-packages/requests-2.11.1-py2.7.egg/requests/sessions.py", line 475, in request
resp = self.send(prep, **send_kwargs)
File "/Library/Python/2.7/site-packages/requests-2.11.1-py2.7.egg/requests/sessions.py", line 617, in send
history = [resp for resp in gen] if allow_redirects else []
File "/Library/Python/2.7/site-packages/requests-2.11.1-py2.7.egg/requests/sessions.py", line 177, in resolve_redirects
**adapter_kwargs
File "/Library/Python/2.7/site-packages/requests-2.11.1-py2.7.egg/requests/sessions.py", line 628, in send
r.content
File "/Library/Python/2.7/site-packages/requests-2.11.1-py2.7.egg/requests/models.py", line 755, in content
self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
File "/Library/Python/2.7/site-packages/requests-2.11.1-py2.7.egg/requests/models.py", line 683, in generate
raise ConnectionError(e)

关于支持python2

代码中所有print都是采用python3结构。根本没办法支持python2。在get_city()中添加版本判断个人感觉是可以删掉的。

xiaoqu_to_chart.py 运行出错:UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: unexpected end of data

mac环境,运行xiaoqu_to_chart.py 生成html图表的时候,xiaoqu.html都可以正确生成,但是district.html生成过程中报错,错误信息如下:

Traceback (most recent call last):
  File "./xiaoqu_to_chart.py", line 63, in <module>
    bar.render(path="district.html")
  File "/usr/local/lib/python2.7/site-packages/pyecharts/base.py", line 146, in render
    **kwargs
  File "/usr/local/lib/python2.7/site-packages/pyecharts/engine.py", line 220, in render_chart_to_file
    html = tpl.render(**kwargs)
  File "/usr/local/lib/python2.7/site-packages/jinja2/environment.py", line 1008, in render
    return self.environment.handle_exception(exc_info, True)
  File "/usr/local/lib/python2.7/site-packages/jinja2/environment.py", line 780, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python2.7/site-packages/pyecharts/templates/simple_chart.html", line 10, in top-level template code
    {{ echarts_js_content(chart) }}
  File "/usr/local/lib/python2.7/site-packages/pyecharts/engine.py", line 129, in echarts_js_content
    return Markup(EMBED_SCRIPT_FORMATTER.format(generate_js_content(*charts)))
  File "/usr/local/lib/python2.7/site-packages/pyecharts/engine.py", line 101, in generate_js_content
    javascript_snippet = TRANSLATOR.translate(chart.options)
  File "/usr/local/lib/python2.7/site-packages/pyecharts_javascripthon/api.py", line 127, in translate
    option_snippet = json.dumps(options, indent=4, cls=self.json_encoder)
  File "/usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 251, in dumps
    sort_keys=sort_keys, **kw).encode(obj)
  File "/usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 209, in encode
    chunks = list(chunks)
  File "/usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 434, in _iterencode
    for chunk in _iterencode_dict(o, _current_indent_level):
  File "/usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
    for chunk in chunks:
  File "/usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 332, in _iterencode_list
    for chunk in chunks:
  File "/usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 443, in _iterencode
    for chunk in _iterencode(o, _current_indent_level):
  File "/usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 434, in _iterencode
    for chunk in _iterencode_dict(o, _current_indent_level):
  File "/usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
    for chunk in chunks:
  File "/usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 443, in _iterencode
    for chunk in _iterencode(o, _current_indent_level):
  File "/usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 431, in _iterencode
    for chunk in _iterencode_list(o, _current_indent_level):
  File "/usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 313, in _iterencode_list
    yield buf + _encoder(value)

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: unexpected end of data

connection time out

运行ershou.py,提示
HTTPSConnectionPool(host='yt.lianjia.com', port=443): Max retries exceeded with url: /xiaoqu/fushan (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x10a4308d0>, 'Connection to yt.lianjia.com timed out. (connect timeout=10)')) fushan: Area list: None Traceback (most recent call last): File "ershou.py", line 135, in <module> areas.extend(areas_of_district) TypeError: 'NoneType' object is not iterable

是不是请求频繁被block了?

新增城市

有什么方法可以在框架内自己新增城市么,比如石家庄,哈尔滨这种人口也很多的城市,需要增加哪部分代码呢?

写入文件开始出错

你好,代码前面跑着正常,貌似从写入文件就开始出错了:

  1. 中文开始有乱码;

  2. 多线程不太懂,这段报错是啥意思呀?
    谢谢

huyuanzhen   File "C:\ProgramData\Anaconda2\envs\LJ27\lib\site-packages\threadpool.py", line 158, in run
 淇濆瓨鏂囦欢璺緞:淇濆瓨鏂囦欢璺緞:  d:\python\sp\lj20180225/data/lianjia/sh/20180225/sanlin.txtd:\python\sp\lj20180225/data/lianjia/sh/20180225/shuyuanzhen.txt

   result = request.callable(*request.args, **request.kwds)
  File "xiaoqu.py", line 26, in collect_xiaoqu_data
    print "寮€濮嬬埇鍙栨澘鍧?", area_name, "淇濆瓨鏂囦欢璺緞:", csv_file
Traceback (most recent call last):
  File "xiaoqu.py", line 94, in <module>
    pool.poll()
  File "C:\ProgramData\Anaconda2\envs\LJ27\lib\site-packages\threadpool.py", line 315, in poll
    request.exc_callback(request, result)
  File "C:\ProgramData\Anaconda2\envs\LJ27\lib\site-packages\threadpool.py", line 78, in _handle_thread_exception
    traceback.print_exception(*exc_info)
  File "C:\ProgramData\Anaconda2\envs\LJ27\lib\traceback.py", line 125, in print_exception
    print_tb(tb, limit, file)
  File "C:\ProgramData\Anaconda2\envs\LJ27\lib\traceback.py", line 70, in print_tb
    if line: _print(file, '    ' + line.strip())
  File "C:\ProgramData\Anaconda2\envs\LJ27\lib\traceback.py", line 13, in _print
    file.write(str+terminator)
IOError: [Errno 0] Error

是不是被链家发现是爬虫,总是被kill

Districts: ['dongcheng', 'xicheng', 'chaoyang', 'haidian', 'fengtai', 'shijingshan', 'tongzhou', 'changping', 'daxing', 'yizhuangkaifaqu', 'shunyi', 'fangshan', 'mentougou', 'pinggu', 'huairou', 'miyun', 'yanqing', 'yanjiao', 'xianghe']
dongcheng: Area list: []
xicheng: Area list: ['baizhifang1', 'caihuying', 'changchunjie', 'chongwenmen', 'chegongzhuang1', 'dianmen', 'deshengmen', 'fuchengmen', 'guanganmen', 'guanyuan', 'jinrongjie', 'liupukang', 'madian1', 'maliandao1', 'muxidi1', 'niujie', 'taoranting1', 'taipingqiao1', 'tianningsi1', 'xisi1', 'xuanwumen12', 'xizhimen1', 'xinjiekou2', 'xidan', 'yuetan', 'youanmennei11']
chaoyang: Area list: ['andingmen', 'anzhen1', 'aolinpikegongyuan11', 'beiyuan2', 'beigongda', 'baiziwan', 'chengshousi1', 'changying', 'chaoyangmenwai1', 'cbd', 'chaoqing', 'chaoyanggongyuan', 'dongzhimen', 'dongba', 'dawanglu', 'dongdaqiao', 'dashanzi', 'dougezhuang', 'dingfuzhuang', 'fangzhuang1', 'fatou', 'guangqumen', 'gongti', 'gaobeidian', 'guozhan1', 'ganluyuan', 'guanzhuang', 'hepingli', 'huanlegu', 'huixinxijie', 'hongmiao', 'huaweiqiao', 'jianxiangqiao1', 'jiuxianqiao', 'jinsong', 'jianguomenwai', 'lishuiqiao1', 'madian1', 'nongzhanguan', 'nanshatan1', 'panjiayuan1', 'sanyuanqiao', 'shaoyaoju', 'shifoying', 'shilibao', 'shoudoujichang1', 'shuangjing', 'shilihe', 'shibalidian1', 'shuangqiao', 'sanlitun', 'sihui', 'tongzhoubeiyuan', 'tuanjiehu', 'taiyanggong', 'tianshuiyuan', 'wangjing', 'xibahe', 'yayuncun', 'yayuncunxiaoying', 'yansha1', 'zhongyangbieshuqu1', 'zhaoyangqita']
haidian: Area list: ['aolinpikegongyuan11', 'anningzhuang1', 'baishiqiao1', 'beitaipingzhuang', 'changpingqita1', 'changwa', 'dinghuisi', 'erlizhuang', 'gongzhufen', 'ganjiakou', 'haidianqita1', 'haidianbeibuxinqu1', 'junbo1', 'liuliqiao1', 'mudanyuan', 'madian1', 'malianwa', 'qinghe11', 'suzhouqiao', 'shangdi1', 'shijicheng', 'sijiqing', 'shuangyushu', 'tiancun1', 'wudaokou', 'weigongcun', 'wukesong1', 'wanliu', 'wanshoulu1', 'xishan21', 'xisanqi1', 'xibeiwang', 'xueyuanlu1', 'xiaoxitian1', 'xizhimen1', 'xinjiekou2', 'xierqi1', 'yangzhuang1', 'yuquanlu11', 'yuanmingyuan', 'yiheyuan', 'zhichunlu', 'zaojunmiao', 'zhongguancun', 'zizhuqiao']
fengtai: Area list: ['beidadi', 'beijingnanzhan1', 'chengshousi1', 'caoqiao', 'caihuying', 'dahongmen', 'fengtaiqita1', 'fangzhuang1', 'guanganmen', 'heyi', 'huaxiang', 'jiugong1', 'jiaomen', 'kejiyuanqu', 'kandanqiao', 'lize', 'liujiayao', 'lugouqiao1', 'liuliqiao1', 'muxiyuan1', 'majiabao', 'maliandao1', 'puhuangyu', 'qingta1', 'qilizhuang', 'songjiazhuang', 'shilihe', 'taipingqiao1', 'wulidian', 'xihongmen', 'xiluoyuan', 'xingong', 'yuegezhuang', 'yuquanying', 'youanmenwai', 'yangqiao1', 'zhaogongkou']
shijingshan: Area list: ['bajiao1', 'chengzi', 'gucheng', 'laoshan1', 'lugu1', 'pingguoyuan1', 'shijingshanqita1', 'yangzhuang1', 'yuquanlu11']
tongzhou: Area list: ['beiguan', 'daxingqita11', 'guoyuan1', 'jiukeshu12', 'luyuan', 'liyuan', 'linheli', 'majuqiao1', 'qiaozhuang', 'shoudoujichang1', 'tongzhoubeiyuan', 'tongzhouqita11', 'wuyihuayuan', 'xinhuadajie', 'yizhuang1', 'yuqiao']
HTTPSConnectionPool(host='bj.lianjia.com', port=443): Read timed out. (read timeout=10)
changping: Area list: None
Traceback (most recent call last):
File "ershou.py", line 134, in
areas.extend(areas_of_district)
TypeError: 'NoneType' object is not iterable

抓取不到数据

[nonroot@fbox lianjia-beike-spider]$ python ershou.py nj
Today date is: 20181125
Target site is lianjia.com
City is: nj
OK, start to crawl 南京
City: nj
Districts: []
('Area:', [])
('District and areas:', {})
Total crawl 0 areas.
Total cost 5.6205971241 second to crawl 0 data items.

其他城市也抓不到, 没能成功安装 webbrowser, 不知道是否又影响

执行pip install -r requirements.txt疯狂报错

Collecting pillow (from pyecharts-snapshot->-r requirements.txt (line 12))
Using cached https://files.pythonhosted.org/packages/40/50/406ea88c6d3c4fdffd45f2cf7528628586e1651e5c6f95f0193870832175/Pillow-6.2.0-cp35-cp35m-win_amd64.whl
Collecting pyppeteer>=0.0.25 (from pyecharts-snapshot->-r requirements.txt (line 12))
Using cached https://files.pythonhosted.org/packages/b0/16/a5e8d617994cac605f972523bb25f12e3ff9c30baee29b4a9c50467229d9/pyppeteer-0.0.25.tar.gz
ERROR: Command errored out with exit status 1:
command: 'c:\users\sunpeng3\appdata\local\programs\python\python35\python.exe' -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\sunpeng3\AppData\Local\Temp\pip-install-5xao5og2\pyppeteer\setup.py'"'"'; file='"'"'C:\Users\sunpeng3\AppData\Local\Temp\pip-install-5xao5og2\pyppeteer\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base pip-egg-info
cwd: C:\Users\sunpeng3\AppData\Local\Temp\pip-install-5xao5og2\pyppeteer
Complete output (32 lines):
Requirement already satisfied: py-backwards in c:\users\sunpeng3\appdata\local\programs\python\python35\lib\site-packages (0.7)
Requirement already satisfied: colorama in c:\users\sunpeng3\appdata\local\programs\python\python35\lib\site-packages (from py-backwards) (0.4.1)
Requirement already satisfied: py-backwards-astunparse in c:\users\sunpeng3\appdata\local\programs\python\python35\lib\site-packages (from py-backwards) (1.5.0.post3)
Requirement already satisfied: typed-ast in c:\users\sunpeng3\appdata\local\programs\python\python35\lib\site-packages (from py-backwards) (1.4.0)
Requirement already satisfied: autopep8 in c:\users\sunpeng3\appdata\local\programs\python\python35\lib\site-packages (from py-backwards) (1.4.4)
Requirement already satisfied: wheel<1.0,>=0.23.0 in c:\users\sunpeng3\appdata\local\programs\python\python35\lib\site-packages (from py-backwards-astunparse->py-backwards) (0.33.6)
Requirement already satisfied: six<2.0,>=1.6.1 in c:\users\sunpeng3\appdata\local\programs\python\python35\lib\site-packages (from py-backwards-astunparse->py-backwards) (1.12.0)
Requirement already satisfied: pycodestyle>=2.4.0 in c:\users\sunpeng3\appdata\local\programs\python\python35\lib\site-packages (from autopep8->py-backwards) (2.5.0)
Traceback (most recent call last):
File "C:\Users\sunpeng3\AppData\Local\Temp\pip-install-5xao5og2\pyppeteer\setup.py", line 19, in
from py_backwards.compiler import compile_files
File "c:\users\sunpeng3\appdata\local\programs\python\python35\lib\site-packages\py_backwards\compiler.py", line 8, in
from .files import get_input_output_paths, InputOutput
File "c:\users\sunpeng3\appdata\local\programs\python\python35\lib\site-packages\py_backwards\files.py", line 9, in
from .exceptions import InvalidInputOutput, InputDoesntExists
File "c:\users\sunpeng3\appdata\local\programs\python\python35\lib\site-packages\py_backwards\exceptions.py", line 1, in
from typing import Type, TYPE_CHECKING
ImportError: cannot import name 'Type'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\sunpeng3\AppData\Local\Temp\pip-install-5xao5og2\pyppeteer\setup.py", line 25, in <module>
    from py_backwards.compiler import compile_files
  File "c:\users\sunpeng3\appdata\local\programs\python\python35\lib\site-packages\py_backwards\compiler.py", line 8, in <module>
    from .files import get_input_output_paths, InputOutput
  File "c:\users\sunpeng3\appdata\local\programs\python\python35\lib\site-packages\py_backwards\files.py", line 9, in <module>
    from .exceptions import InvalidInputOutput, InputDoesntExists
  File "c:\users\sunpeng3\appdata\local\programs\python\python35\lib\site-packages\py_backwards\exceptions.py", line 1, in <module>
    from typing import Type, TYPE_CHECKING
ImportError: cannot import name 'Type'
----------------------------------------

ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

房间布局和面积的提取

由于数据的不规律,爬取贝壳时用下标提取常常会有一些异常值,建议使用正则表达式
pattern_layout = re.compile(r'[0-9]{1,2}[\u4e00-\u9fa5][0-9]{1,2}[\u4e00-\u9fa5][0-9]{1,2}[\u4e00-\u9fa5]')
pattern_size = re.compile(r'([0-9]{1,3})㎡')
descs = desc2.text.strip().replace("\n", "").replace(" ", "").replace("/", "")
m_layout = pattern_layout.search(descs)
m_size = pattern_size.search(descs)
layout = m_layout.group()
size = m_size.group()

有历史数据吗

有没有历史数据呢
想了解房价是涨了还是跌了
貌似行情要走低了

遇到的一个关于浏览器的问题

作者你好,感谢你的开发和分享
我在运行pip install -r requirements.txt时出现以下报错信息
Could not find a version that satisfies the requirement webbrowser (from -r requirements.txt (line 13)) (from versions: none)
ERROR: No matching distribution found for webbrowser (from -r requirements.txt (line 13))
请问这个是跟我的浏览器版本有关吗?按网上指示将python的pip升级到了最新仍无效。
请教一下应该如何解决?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.