Search-Engine

Crawler-based search engine that contains the main modules of a search engine (crawler, indexer, and ranker)
with features like image search, voice recognition and personalized search written in java

Database Design

The database is created using MySQL with IndexerDbAdapter.java
The database of the search engine server has the following tables

table name
tb1_urls
tb2_words
tb3_links
tb4_images
tb5_queries
tb6_user_urls

The first table is used to store information about each URL, and has the following schema

column	type
_id	int primary key auto_increment
url	varchar(256)
content	meduimtext
title	varchar(512)
carwled_at	datetime
indexed	tinyint
page_rank	double
date_score	double
geographic_score	double

The second table is used to store the keywords in each URL with word score
The score of the word is the (sum (each HTML-tag score * number of occurrences of this tag) ) / number of total words
The table has the following schema

column	type
_id	int primary key auto_increment
word	varchar(100)
url	varchar(256) foreign key
score	double

The third table is used to store links between URLs to use it in PageRank.java
It has the following schema

column	type
_id	int primary key auto_increment
src_url	varchar(256) foreign key
dst_url	varchar(256) foreign key

The fourth table is used to store image links in each URL, and has the following schema

column	type
_id	int primary key auto_increment
url	varchar(256) foreign key
image	varchar(512) foreign key

The fifth table is used to store users' queries that is used to make suggestions when a user enter some text in search bar

column	type
_id	int primary key auto_increment
query	text unique

The sixth table is used to store the URLs that the user has clicked on and its frequency,
to favor its base-link in Personalized search feature (assuming one user)

column	type
_id	int primary key auto_increment
url	varchar(256) unique
freq	int

System Design

Web Crawler

It has some URLs in its seed, it does the following steps

add the URLs from the seed in the queue with other un-crawled URLs fetched from the database
fetch URL from the queue
parse its content and get other URLs in it
add the source and destination URLs in tb1_urls
add the destination URL to the queue
add the link between source and destination URLs in tb3_links

Indexer

It does the following steps

fetch un-indexed URL from the database
parse the document and put its content in tb1_urls to be used in phrase search, and store other URL-related information
remove stopping words form the document, apply stemming on them and calculate their score, then store them in tb2_words
store image links in tb4_images

Query Processor

It does the following steps

check if the query has double quotes that surrounds it, if so, it does phrase search, otherwise it does normal search
the words in the query are split, the stopping words are removed, and stemming is applied
the database is queried to get the results and rank the URLs according to sum(score * IDF) where score is the word score calculated in the ranker, and IDF is the inverse document frequency
The content of the returned URLs is searched for any word that match the original query, and the content is clipped to start with that word, limited by 200 character
The result of the 10 returned URLs is sorted by clicking frequency of base-URL

5l1v3r1 / search-engine-1 Goto Github PK

search-engine-1's Introduction

Search-Engine

Database Design

System Design

Web Crawler

Indexer

Query Processor

search-engine-1's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent