Git Product home page Git Product logo

funpyspidersearchengine's Introduction

Word2vec 个性化搜索实现 +Scrapy2.3.0(爬取数据) + ElasticSearch7.9.1(存储数据并提供对外Restful API) + Django3.1.1 搜索

Build Status MIT Licence

本仓库为爬虫端数据入库ElasticSearch代码,实现整个搜索需要结合Django网站端项目 https://github.com/mtianyan/mtianyanSearch

可用功能:

  1. 知乎答案问题爬虫存入ElasticSearch
  2. 全文搜索(需结合网站端一起使用),搜索词高亮标红
  3. Redis实现的实时三站已爬取数目展示,热门搜索Top-5
  4. word2vec改变ElasticSearch(function_score, script_score)评分, 比如历史上你搜索过Apple, 会使得Apple经过 Word2vec 计算出的苹果,乔布斯等关键词打分排名靠前

word2vec 模型训练全过程请查看项目Word2VecModel 中README word2vec 使用,影响ElasticSearch打分,请查看mtianyanSearch中相关代码

核心打分代码:

"source": "double final_score=_score;int count=0;int total = params.title_keyword.size();while(count < total) { String upper_score_title = params.title_keyword[count]; if(doc['title_keyword'].value.contains(upper_score_title)){final_score = final_score+_score;}count++;}return final_score;"

标题每包含一个相关词,分数加倍

项目演示图:

如何开始使用?

  1. 安装ElasticSearch7.9.1, (可选配置ElasticSearch-head)
  2. 配置ElasticSearch-analysis-ik插件
  3. 安装Redis

本机运行

git clone https://github.com/mtianyan/FunpySpiderSearchEngine
# 修改config_template中配置信息后重命名为config.py
# 执行 sites/zhihu/es_zhihu.py

cd FunpySpiderSearchEngine
pip install -r requirements.txt
scrapy crawl zhihu

Docker 运行

docker network create search-spider
git clone https://github.com/mtianyan/mtianyanSearch.git
cd mtianyanSearch
docker-compose up -d
git clone https://github.com/mtianyan/FunpySpiderSearchEngine
cd FunpySpiderSearchEngine
docker-compose up -d

访问127.0.0.1:8080

赞助

如果我的项目代码对你有帮助,请我吃包辣条吧!

mark

funpyspidersearchengine's People

Contributors

dependabot[bot] avatar mtianyan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

funpyspidersearchengine's Issues

No module named 'FunpySpiderSearch.spiders'

安装scrapyd-client ,使用scrapyd-deploy default -p FunpySpiderSearch命令后,报的错,请问是不是需要修改哪儿的路径?

Traceback (most recent call last):\n File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main\n "main", mod_spec)\n File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code\n exec(code, run_globals)\n File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapyd/runner.py", line 40, in \n main()\n File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapyd/runner.py", line 37, in main\n execute()\n File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/cmdline.py", line 149, in execute\n cmd.crawler_process = CrawlerProcess(settings)\n File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/crawler.py", line 249, in init\n super(CrawlerProcess, self).init(settings)\n File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/crawler.py", line 137, in init\n self.spider_loader = _get_spider_loader(settings)\n File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/crawler.py", line 336, in _get_spider_loader\n return loader_cls.from_settings(settings.frozencopy())\n File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/spiderloader.py", line 61, in from_settings\n return cls(settings)\n File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/spiderloader.py", line 25, in init\n self._load_all_spiders()\n File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/spiderloader.py", line 47, in _load_all_spiders\n for module in walk_modules(name):\n File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/utils/misc.py", line 63, in walk_modules\n mod = import_module(path)\n File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/importlib/init.py", line 126, in import_module\n return _bootstrap._gcd_import(name[level:], package, level)\n File "", line 994, in _gcd_import\n File "", line 971, in _find_and_load\n File "", line 953, in _find_and_load_unlocked\nModuleNotFoundError: No module named 'FunpySpiderSearch.spiders'\n

本地部署运行scrapy crawl zhihu 报错

2021-03-24 18:26:53 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.zhihu.com/app/> from <GET https://www.zhihu.com/app?auto_download=true&utm_source=zhihu&utm_campaign=guest_feed&utm_content=guide>
2021-03-24 18:26:53 [scrapy.core.engine] DEBUG: Crawled (400) <GET https://www.zhihu.com?utm_source=zhihu&utm_campaign=guest_feed&utm_content=guide> (referer: https://www.zhizhu.com)
2021-03-24 18:26:53 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://www.zhihu.com?utm_source=zhihu&utm_campaign=guest_feed&utm_content=guide>: HTTP status code is not handled or not allowed
2021-03-24 18:26:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.zhihu.com/signin?next=http%3A%2F%2Fwww.zhihu.com%2Fquestion%2F28227721> (referer: https://www.zhizhu.com)
2021-03-24 18:26:54 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.zhihu.com/signin?next=http%3A%2F%2Fwww.zhihu.com%2Fquestion%2F28227721> (referer: https://www.zhizhu.com)
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.7/site-packages/scrapy/utils/defer.py", line 120, in iter_errback
yield next(it)
File "/usr/local/python3/lib/python3.7/site-packages/scrapy/utils/python.py", line 347, in next
return next(self.data)
File "/usr/local/python3/lib/python3.7/site-packages/scrapy/utils/python.py", line 347, in next
return next(self.data)
File "/usr/local/python3/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/usr/local/python3/lib/python3.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/usr/local/python3/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/usr/local/python3/lib/python3.7/site-packages/scrapy/spidermiddlewares/referer.py", line 340, in
return (_set_referer(r) for r in result or ())
File "/usr/local/python3/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/usr/local/python3/lib/python3.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "/usr/local/python3/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/usr/local/python3/lib/python3.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "/usr/local/python3/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/home/search/FunpySpiderSearchEngine/mtianyanSpider/spiders/zhihu.py", line 113, in parse_question
item_loader.add_value("zhihu_id", question_id)
File "/usr/local/python3/lib/python3.7/site-packages/itemloaders/init.py", line 190, in add_value
self._add_value(field_name, value)
File "/usr/local/python3/lib/python3.7/site-packages/itemloaders/init.py", line 208, in _add_value
processed_value = self._process_input_value(field_name, value)
File "/usr/local/python3/lib/python3.7/site-packages/itemloaders/init.py", line 312, in _process_input_value
proc = self.get_input_processor(field_name)
File "/usr/local/python3/lib/python3.7/site-packages/itemloaders/init.py", line 293, in get_input_processor
self.default_input_processor
File "/usr/local/python3/lib/python3.7/site-packages/itemloaders/init.py", line 308, in _get_item_field_attr
field_meta = ItemAdapter(self.item).get_field_meta(field_name)
File "/usr/local/python3/lib/python3.7/site-packages/itemadapter/adapter.py", line 89, in get_field_meta
return MappingProxyType(self.item.fields[field_name])
KeyError: 'zhihu_id'

啟動不了ES: models.es_jobbole.py

我試者執行es pipeline, 但我發現es沒有先生成index, 所以我去跑了model裡的es_jobbole.py:

python es_jobbole.py

以下error產生了:

=====

PUT http://192.168.1.129:9200/jobbole [status:400 request:0.007s]
Traceback (most recent call last):
File "es_jobbole.py", line 48, in
ArticleType.init()
File "/usr/local/lib/python3.6/dist-packages/elasticsearch_dsl/document.py", line 147, in init
cls._doc_type.init(index, using)
File "/usr/local/lib/python3.6/dist-packages/elasticsearch_dsl/document.py", line 94, in init
self.mapping.save(index or self.index, using=using or self.using)
File "/usr/local/lib/python3.6/dist-packages/elasticsearch_dsl/mapping.py", line 79, in save
return index.save()
File "/usr/local/lib/python3.6/dist-packages/elasticsearch_dsl/index.py", line 195, in save
return self.create()
File "/usr/local/lib/python3.6/dist-packages/elasticsearch_dsl/index.py", line 179, in create
self.connection.indices.create(index=self._name, body=self.to_dict(), **kwargs)
File "/usr/local/lib/python3.6/dist-packages/elasticsearch/client/utils.py", line 73, in _wrapped
return func(*args, params=params, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/elasticsearch/client/indices.py", line 107, in create
params=params, body=body)
File "/usr/local/lib/python3.6/dist-packages/elasticsearch/transport.py", line 318, in perform_request
status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
File "/usr/local/lib/python3.6/dist-packages/elasticsearch/connection/http_urllib3.py", line 128, in perform_request
self._raise_error(response.status, raw_data)
File "/usr/local/lib/python3.6/dist-packages/elasticsearch/connection/base.py", line 124, in _raise_error
raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.RequestError: TransportError(400, 'mapper_parsing_exception', 'analyzer [ik_max_word] not found for field [title]')

本地部署运行后搜索问题

Traceback (most recent call last):
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\django\core\handlers\exception.py", line 47, in inner
response = get_response(request)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\django\core\handlers\base.py", line 179, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\django\views\generic\base.py", line 70, in view
return self.dispatch(request, *args, **kwargs)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\django\views\generic\base.py", line 98, in dispatch
return handler(request, *args, **kwargs)
File "F:\pythonbishe\searchspider\mtianyanSearch-master\search\views.py", line 102, in get
suggestions_question = s_question.execute()
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\elasticsearch_dsl\search.py", line 698, in execute
**self._params
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\elasticsearch\client\utils.py", line 152, in wrapped
return func(*args, params=params, headers=headers, **kwargs)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\elasticsearch\client_init
.py", line 1617, in search
body=body,
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\elasticsearch\transport.py", line 392, in perform_request
raise e
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\elasticsearch\transport.py", line 365, in perform_request
timeout=timeout,
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\elasticsearch\connection\http_urllib3.py", line 269, in perform_request
self._raise_error(response.status, raw_data)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\elasticsearch\connection\base.py", line 301, in _raise_error
status_code, error_message, additional_info
elasticsearch.exceptions.RequestError: RequestError(400, 'search_phase_execution_exception', 'no mapping found for field [suggest]')

无法登陆

知乎登陆刚开始提示缺少grant_type,chrome降级以后确实可以登陆了,但是执行完:
return [scrapy.Request(url=self.start_urls[0], dont_filter=True, cookies=cookie_dict, headers=self.headers)]
以后,all_urls还是拿不到登陆以后的question url,显示访问的还是登陆之前的地址。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.