henryhaohao / wenshu_spider Goto Github PK
View Code? Open in Web Editor NEW:rainbow:Wenshu_Spider-Scrapy框架爬取**裁判文书网案件数据(2019-1-9最新版)
Home Page: http://wenshu.court.gov.cn/
License: MIT License
:rainbow:Wenshu_Spider-Scrapy框架爬取**裁判文书网案件数据(2019-1-9最新版)
Home Page: http://wenshu.court.gov.cn/
License: MIT License
(base) E:\2018CoutDocu\1HenryhaohaoWenshu_Spider\Wenshu_Spider-master\Wenshu_Pro
ject>scrapy crawl Wenshu
Traceback (most recent call last):
File "C:\Anaconda3\Scripts\scrapy-script.py", line 10, in
sys.exit(execute())
File "C:\Anaconda3\lib\site-packages\scrapy\cmdline.py", line 149, in execute
cmd.crawler_process = CrawlerProcess(settings)
File "C:\Anaconda3\lib\site-packages\scrapy\crawler.py", line 252, in init
log_scrapy_info(self.settings)
File "C:\Anaconda3\lib\site-packages\scrapy\utils\log.py", line 149, in log_sc
rapy_info
for name, version in scrapy_components_versions()
File "C:\Anaconda3\lib\site-packages\scrapy\utils\versions.py", line 35, in sc
rapy_components_versions
("pyOpenSSL", _get_openssl_version()),
File "C:\Anaconda3\lib\site-packages\scrapy\utils\versions.py", line 43, in g
et_openssl_version
import OpenSSL
File "C:\Anaconda3\lib\site-packages\OpenSSL_init.py", line 8, in
from OpenSSL import rand, crypto, SSL
File "C:\Anaconda3\lib\site-packages\OpenSSL\rand.py", line 12, in
from OpenSSL._util import (
File "C:\Anaconda3\lib\site-packages\OpenSSL_util.py", line 6, in
from cryptography.hazmat.bindings.openssl.binding import Binding
File "C:\Anaconda3\lib\site-packages\cryptography\hazmat\bindings\openssl\bind
ing.py", line 12, in
from cryptography.hazmat.bindings._openssl import ffi, lib
ImportError: DLL load failed: 操作系统无法运行 %1。
有几种报错,我不太明白,您能帮我看一下吗?
1、
Traceback (most recent call last):
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output
for x in result:
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in
return (_set_referer(r) for r in result or ())
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "/Users/FiveMeter/Desktop/kaoputou-project/wenshu_monitor/Wenshu/spiders/wenshu.py", line 118, in get_docid
result = eval(json.loads(html))
File "", line 1
["RunEval":"w61aSW7Cg0AQfAvClg/Cg8KIw7IBw6TCk8KfwpDDowhZFnZiDjHDkcKYwpwsw789QCwCw4wCwoTCrRlKQmUxa3V3dQ8gby/DkcOpfAtFw7TClcOsw54SEV0/XsOfRcO8wrnCvxzDhT4+wp3CmcOneDwAwpDChhc4AcKwDsKFw5hgw4hKw5IVVQnCsQXDgMOvw7AsAMORwoPDhxBcw7gTwo7CgcOBwoDDlUcAwrrCgyvDoUXDmA9AaBA4AMKiAj/DgTjCmATCgBBswrYGADkBwqApAAA6wrPDnC4AVXB9w7/CsMObwoTDscO1wpbCiMOvMMKJw4XDhj/DsEPCkF7CjHnDt0c2wqzCuMOuD8KXw7/DjsOkKQTCgcOHwpzCvMODw6XDlTXCs8KOw6bCnsO8dsO7w7fCl2vDjsKyw6Z0BHfCsh4ew7DDtMKnwrZhdcK4w7PCnMKgw5/CrMOGWznDlsOIwqnCvMKawrrCvcOYKmctwrtIwqLCoMK0w4HCsHZBwrLCnUdLwrfCjcK0W8KOw6gywqzDs8OYw79Nw6gxwqvDr8OUQcOmD8K3SFUecsKmw6YXwqslw71wOw/DqcKNUsKuwpjDtMOdTcOeM0VEYVfCpcKiwqXCrUV5NcOWHcKSNk3CtXrCjy3CmkrDryhUJ8OdR2PCpsOqwqwcwpnDhMO0ZmscUG7DlcKdwpdTwolAOsKIw5U8Ryk4w7XCnU3DjwTDj8OHwq7CpcOJZSbCuTXDrz0nwrHCgQnDkD5dF3DDtsOTYGEsCFQMb2nCvcKqwq7CqsOrbMO9ccKrw5paXsKQw7d4RSPDpDzDsMORw6teLyLCg8K8wqzDukBBSxbCusOUwpVywrvDvTfCqMK+dlp0wpzCkMKMNjVmw6ZEwpYrwocpwo8nPhtQw6TCu0Z6w4zCj0ZrdMOpw6wcwpfDoMKfZsKiFsK9K8KOwrswwrnCisOnMsOXw78B",]
^
SyntaxError: invalid syntax
2、
Traceback (most recent call last):
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output
for x in result:
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in
return (_set_referer(r) for r in result or ())
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "/Users/FiveMeter/Desktop/kaoputou-project/wenshu_monitor/Wenshu/spiders/wenshu.py", line 123, in get_docid
docid = self.js_2.call('getdocid', runeval, casewenshuid)
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/execjs/_abstract_runtime_context.py", line 37, in call
return self._call(name, *args)
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/execjs/_external_runtime.py", line 92, in _call
return self._eval("{identifier}.apply(this, {args})".format(identifier=identifier, args=args))
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/execjs/_external_runtime.py", line 78, in eval
return self.exec(code)
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/execjs/abstract_runtime_context.py", line 18, in exec
return self.exec(source)
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/execjs/_external_runtime.py", line 88, in exec
return self._extract_result(output)
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/execjs/_external_runtime.py", line 167, in _extract_result
raise ProgramError(value)
execjs._exceptions.ProgramError: Error: Malformed UTF-8 data
3、
Traceback (most recent call last):
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output
for x in result:
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in
return (_set_referer(r) for r in result or ())
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "/Users/FiveMeter/Desktop/kaoputou-project/wenshu_monitor/Wenshu/spiders/wenshu.py", line 150, in get_detail
content_1 = json.loads(re.search(r'JSON.stringify((.*?));', html).group(1))
AttributeError: 'NoneType' object has no attribute 'group'
还有就是我爬取的筛选条件下,比如数据条数,是错误的,感觉就和随机出来的数字一样。
能解答一下吗?谢谢
如题,
大神能不能共享一份爬出来的数据,我不会Python,下载源码后运行没成功爬到数据,但是想要一份数据!
2018-12-17 10:11:06 [scrapy.core.engine] INFO: Spider opened 2018-12-17 10:11:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2018-12-17 10:11:08 [scrapy.core.scraper] ERROR: Spider error processing <POST http://wenshu.court.gov.cn/List/ListContent> (referer: http://wenshu.court.gov.cn/ValiCode/GetCode) Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback yield next(it) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output for x in result: File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr> return (_set_referer(r) for r in result or ()) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr> return (r for r in result or () if _filter(r)) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr> return (r for r in result or () if _filter(r)) File "/Users/koy/git/wenshu/Wenshu_Spider/Wenshu_Project/Wenshu/spiders/wenshu.py", line 66, in get_content result = eval(json.loads(html)) File "<string>", line 1, in <module> NameError: name 'remind' is not defined 2018-12-17 10:11:08 [scrapy.core.scraper] ERROR: Spider error processing <POST http://wenshu.court.gov.cn/List/ListContent> (referer: http://wenshu.court.gov.cn/ValiCode/GetCode) Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback yield next(it) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output for x in result: File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr> return (_set_referer(r) for r in result or ()) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr> return (r for r in result or () if _filter(r)) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr> return (r for r in result or () if _filter(r)) File "/Users/koy/git/wenshu/Wenshu_Spider/Wenshu_Project/Wenshu/spiders/wenshu.py", line 66, in get_content result = eval(json.loads(html)) File "<string>", line 1, in <module> NameError: name 'remind' is not defined 2018-12-17 10:11:08 [scrapy.core.engine] INFO: Closing spider (finished)
"[{\"RunEval\":\"w63Cm8OdbsKCQBDChcKfBcOjw4USwprCvgDDscOKR8Oow6XChsKYBm3DtcKiw5Igwr0ywr57w4FSKsKwAsKWRXbDpUvDiDHCsD9zw6bDjMOsLGzDonzCu1tvDmHCvMO7TBYvScK8w5vCvz/Cv8OFw5HDh3LDuxovwqPDtUZ4wo4nA8OAaHhCBGAaGcOyCMKOTGTCuVLClcKILcKAw64oCxA9FCPCuEgJwpBOw7gKEBXCsg0gwqfCkBc5AMOiwoPDuABqwqMJegJIDsKQMMOowoRzCMKDw4PDhMKQCAAkAsOew6A1QMKeAcKABnB9f8K1wpjChcORw77CkMOEX2ESw4UzfyVXQXoJw6EdT1nCt8OiOsKeXMO5M1LCphAEwp5ww44Nwq4sw4/CmTXCtMK3wpxvw6d/w786wpIie8Kcw7bCkE7DliIDwpnDvlQMwpbCuzvDucKAw6vDhirCvMKVfRs5XcOOwqZ+XsKYOsKzwq5LVMKjwqDDtMKhYcOuwoJkN0Uvbltpd8OscUvCt8Kbw7vDvm9Awo9RfcKHahnDnycXwobDoMKow4/CqsOmw6l0w45UXcKLwqrCqsOSw5vDtSHDpFQ4USrCj8KDwpnCu8ODw6zCg8Kmw6rCgHEYwpfDhMKVwp1sw7BAw53DlcKOw4JYw77CjnBHw7vCv3LDi8OSwrIbR8KLwpFCYMKQw60iw7llw6sLwpvChcKiGMK7C8KPTsOFIFfDmsO5wphGw5ZcUsKGM8KDw57Co8OTwrPChmNOVFQ+w647HV3DmMKqwrrDojDCo8Omwr9UfQ1df1nDq3UcY8KXwr5Qw5bDhcKHY07DrcKhacKsZMO5woPDrsOHw499w4nCscKHGsOEw4fCtl3CrGkhw5crR8OTOn7CmF3Dh1bDnsK2Sl1kwpbDlFMPR1JpWnVudA84wqbCmA5vG8KLwrErXMO/Gw==\",\"Count\":\"1\"},{\"裁判要旨段原文\":\"本院认为,原告方国生在被告**人寿保险股份有限公司唐山市路北支公司投保了《保险合同》,系双方真实意思表示,合法有效,双方均应按约定及法律规定履行相应的权利义务。第三人王燕军提交的书证作为授权委托书应当准确的载明代理的具体事项、权限、时间等内容,该书证未写明时间\",\"不公开理由\":\"\",\"案件类型\":\"2\",\"裁判日期\":\"2017-03-28\",\"案件名称\":\"方国生与**人寿保险股份有限公司唐山市路北支公司保险纠纷一审民事判决书\",\"文书ID\":\"FcOOwrsRBDEIBcOBwpTDhB/DjCcEw7nCh3R7w544U8OVwpIcacKMw60Dwr7CmyzDuWXDicOsS8KBXms2w6cVwqtpPcKlAlR2wpPCmsOfHcKIOXTDojrCoyVfwrPCuMKTw5fCnDxCS1IRc8K1woPCucO2wpHDsBDDtywrwpgzw6tew7fCrcOYw4fCpsOew7owGUFtEgllw4rDr8Kyw4ARwr49eGfDlQM2wrQiwonCjzTDhSHDiMORwoB7w4Eqw45fwrPDuyDDqXXDvMKYw78A\",\"审判程序\":\"一审\",\"案号\":\"(2016)冀0203民初1093号\",\"法院名称\":\"唐山市路北区人民法院\"}]"
比如上面这个,就解析不了
谢谢大佬,实测有用哈哈,牛皮 ,以后我也要开源。
该弄的弄好了,输入了命令之后没有反应
鉴于目前文书网一次返回的结果太少。我的想法是地点加日期,可是现在不能单独已某天的范围来抓取数据呢,如:param:"案件类型:民事案件,中级法院:北京市第二中级人民法院,裁判日期:2018-11-13 TO 2018-11-20",这中条件似乎不能返回结果。如果去掉日期改为:param:"案件类型:民事案件,中级法院:北京市第二中级人民法院",则可以。
我就很困惑,似乎也没有其他的筛选方法了。
谢谢。
from twisted.internet import defer,reactor
class MyspiderPipeline(object):
def init(self):
host = settings['MONGODB_HOST']
port = settings['MONGODB_PORT']
dbname = settings['MONGODB_DBNAME']
docname = settings['MONGODB_DOCNAME']
self.client = pymongo.MongoClient(host=host,port=port)
db = self.client[dbname]
self.post = db[docname]
def close_spider(self, spider):
self.client.close()
# 下面的操作是重点
@defer.inlineCallbacks
def process_item(self, item, spider):
out = defer.Deferred()
reactor.callInThread(self._insert, item, out, spider)
yield out
defer.returnValue(item)
return item
def _insert(self, item, out, spider):
time.sleep(10)
try:
data = dict(item)
self.post.insert(data)
reactor.callFromThread(out.callback, item)
except BaseException:
# 索引相同,即为重复数据,捕获错误
spider.logger.debug('duplicate key error collection')
reactor.callFromThread(out.callback, item)
docid = self.js_2.call('getdocid', runeval, casewenshuid),每次执行到这一句就没有结果了
我是PyCharm+Conda,
运行后显示202
<202 http://wenshu.court.gov.cn/list/list/?sorttype=1>
正在重新请求************
2019-01-02 15:08:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:09:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:10:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:11:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:12:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:13:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:14:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:15:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:16:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:17:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:18:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:19:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:20:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:21:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:22:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:23:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:24:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:25:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
在获取docid时一直报错'execjs._exceptions.ProgramError: TypeError: 'key' 为 null 或不是对象'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.