Git Product home page Git Product logo

csdn_searchengine's Introduction

CSDN爬虫+搜索引擎

1.项目介绍

爬取CSDN博客,利用Whoosh实现倒排索引与排序,django作为后端实现小型CSDN搜索引擎。并实现高亮、相关搜索等功能。

2.效果展示

主页

结果页面

详细效果请见展示.pdf

3.环境:

python3.6 + django2.1 + 若干python库

4.配置

(1)django settings.py

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql',
        'NAME': 'csdnBlogs',
        'USER': 'root',
        'PASSWORD': 'password',
        'HOST': '127.0.0.1',
        'PORT': '3306',
    }
}

(2)爬虫部分 \search_engine\csdn_crawler.py

driver = webdriver.Chrome(executable_path='/home/chromedriver')
        # 要加chromedriver绝对路径 or 把chromedriver加到系统PATH里

下载chromedriver,并添加绝对路径。

username = driver.find_element_by_id('username')
password = driver.find_element_by_id('password')
username.send_keys("username")		# 输入csdn登陆账户密码
password.send_keys("password")		

(3)连接数据库配置

# 数据库配置
host = ''   
user = ''
password = ''
dbname = ''  # 数据库名字

5. 文件介绍

csdn_crawler.py : 爬虫模块,包含模拟登陆。保存数据库

searcher.py : 利用whoosh建立倒排索引和排序。搭建搜索引擎

DBsettings.py : 数据库配置文件

views.py: 后端处理

word2vec.py : 读取文本,训练模型来实现相关搜索。目前效果还不太理想。模型要保存在/data/word2Vec/model

6.备注

(1) 实现过程:先爬虫爬取数据,再建立索引排序,最后前后端。

(2) 新手项目,如有问题,多多指教!

csdn_searchengine's People

Contributors

schrodingersbug avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.