Git Product home page Git Product logo

search-engine's Introduction

๐Ÿ”ฐ Search Engine

Aplikasi search engine yang dibuat dengan menggunakan crawler, document ranking, dan page ranking

โšก Cara Penggunaan

  1. Pastikan komputer/server sudah terinstall Python 3.6+ dan MySQL
  2. Buka file .env dan ubah konfigurasinya dengan benar (akses database, konfigurasi crawler, dll)
  3. Install library python yang diperlukan dengan menjalankan pip install -r requirements.txt
  4. Jalankan program sesuai dengan perintah di bawah

๐Ÿ“ฆ Perintah

General

  • Python run_crawl.py untuk menjalankan crawler
  • Python run_page_rank.py untuk menjalankan page rank
  • Python run_tf_idf.py untuk menjalankan tf idf
  • Python run_api.py untuk menjalankan REST API
  • Python run_search_engine_console.py untuk menjalankan search engine berbasis console

Background Services

  • Gunakan crawl.service di folder services untuk menjalankan crawler, page rank, dan tf idf di background menggunakan systemd pada server

๐Ÿ““ File Dokumentasi

๐Ÿ”ง Dokumentasi API

[GET] Get Similarity Overall Ranking
  • URL: /api/v1.0/overall_ranking/similarity?keyword=barcelona&sort=similarity&start=0&length=10

  • Params Detail: sort options: similarity, tfidf, or pagerank. start and length are optional.

  • Method: GET

  • Response:

{
  "data": [
    {
      "id_page": 5,
      "pagerank_score": 0.05263157894736842,
      "similarity_score": 0.49220540497550275,
      "tfidf_total": 0.06595637671355718,
      "url": "https://20.detik.com/live"
    },
    {
      "id_page": 1,
      "pagerank_score": 0.05263157894736842,
      "similarity_score": 0.45263157894736844,
      "tfidf_total": 0.0,
      "url": "https://detik.com"
    }
  ],
  "message": "Sukses",
  "ok": true
}
[GET] Get TF-IDF Document Ranking
  • URL: /api/v1.0/document_ranking/tf_idf?keyword=barcelona&start=0&length=10

  • Params Detail: start and length are optional.

  • Method: GET

  • Response:

{
  "data": [
    {
      "id_tfidf": 18,
      "keyword": "klub barcelona",
      "tfidf_total": 0.9,
      "url": "https://detik.com/barcelona"
    },
    {
      "id_tfidf": 19,
      "keyword": "klub barcelona",
      "tfidf_total": 0.8,
      "url": "https://www.detik.com/?tagfrom=klub"
    }
  ],
  "message": "Sukses",
  "ok": true
}
[GET] Get Page-Rank Page Ranking
  • URL: /api/v1.0/page_ranking/page_rank?start=0&length=10

  • Params Detail: start and length are optional.

  • Method: GET

  • Response:

{
  "data": [
    {
      "id_pagerank": 7,
      "pagerank_score": 0.006093279237620995,
      "url": "https://news.detik.com"
    },
    {
      "id_pagerank": 15,
      "pagerank_score": 0.005689670500678926,
      "url": "https://news.detik.com/x"
    }
  ],
  "message": "Sukses",
  "ok": true
}
[GET] Run Crawling
  • URL: /api/v1.0/crawling/crawl?duration=10

  • Method: GET

  • Response:

{
  "message": "Sukses",
  "ok": true
}
[GET] Get Crawled Pages
  • URL: /api/v1.0/crawling/pages?start=0&length=10

  • Params Detail: start and length are optional.

  • Method: GET

  • Response:

{
  "data": [
    {
      "content_text": "",
      "crawl_id": 1,
      "created_at": "2022-09-29 19:39:13",
      "description": "Indeks berita terkini dan terbaru hari ini dari peristiwa, kecelakaan, kriminal, hukum, berita unik, Politik, dan liputan khusus di Indonesia dan Internasional",
      "duration_crawl": "0:00:03",
      "hot_url": 0,
      "html5": 1,
      "id_information": 1,
      "keywords": "berita hari ini, berita terkini, berita terbaru, info berita, peristiwa, kecelakaan, kriminal, hukum, berita unik, Politik, liputan khusus, Indonesia, Internasional",
      "model_crawl": "BFS crawling",
      "size_bytes": 252595,
      "title": "detikcom - Informasi Berita Terkini dan Terbaru Hari Ini",
      "url": "https://detik.com"
    },
    {
      "content_text": "",
      "crawl_id": 1,
      "created_at": "2022-09-29 19:39:16",
      "description": "Indeks berita terkini dan terbaru hari ini dari peristiwa, kecelakaan, kriminal, hukum, berita unik, Politik, dan liputan khusus di Indonesia dan Internasional",
      "duration_crawl": "0:00:02",
      "hot_url": 0,
      "html5": 1,
      "id_information": 2,
      "keywords": "berita hari ini, berita terkini, berita terbaru, info berita, peristiwa, kecelakaan, kriminal, hukum, berita unik, Politik, liputan khusus, Indonesia, Internasional",
      "model_crawl": "BFS crawling",
      "size_bytes": 252607,
      "title": "detikcom - Informasi Berita Terkini dan Terbaru Hari Ini",
      "url": "https://www.detik.com/?tagfrom=framebar"
    }
  ],
  "message": "Sukses",
  "ok": true
}
[POST] Get Page Information
  • URL: /api/v1.0/crawling/page_information

  • Method: POST

  • Request Payload:

{
  "id_pages": [1]
}
  • Response:
{
  "data": [
    {
      "content_text": "",
      "crawl_id": 2,
      "created_at": "2022-10-06 06:47:17",
      "description": "Indeks berita terkini dan terbaru hari ini dari peristiwa, kecelakaan, kriminal, hukum, berita unik, Politik, dan liputan khusus di Indonesia dan Internasional",
      "duration_crawl": "0:00:00",
      "hot_url": 0,
      "html5": 1,
      "id_page": 1,
      "keywords": "berita hari ini, berita terkini, berita terbaru, info berita, peristiwa, kecelakaan, kriminal, hukum, berita unik, Politik, liputan khusus, Indonesia, Internasional",
      "model_crawl": "BFS crawling",
      "size_bytes": 244796,
      "title": "detikcom - Informasi Berita Terkini dan Terbaru Hari Ini",
      "url": "https://detik.com"
    }
  ],
  "message": "Sukses",
  "ok": true
}
[POST] Start Insert Crawled Pages
  • URL: /api/v1.0/crawling/start_insert

  • Method: POST

  • Request Payload:

{
  "start_urls": "https://www.indosport.com https://detik.com https://www.curiouscuisiniere.com",
  "keyword": "",
  "duration_crawl": 28800
}
  • Response:
{
  "data": {
    "id_crawling": 6
  },
  "message": "Sukses",
  "ok": true
}
[POST] Insert Crawled Page
  • URL: /api/v1.0/crawling/insert_page

  • Method: POST

  • Request Payload:

{
  "page_information": {
    "crawl_id": 3,
    "url": "https://www.indosport.com",
    "html5": 0,
    "title": "INDOSPORT - Berita Olahraga Terkini dan Sepak Bola Indonesia",
    "description": "INDOSPORT.com โ€“ Portal Berita Olahraga dan Sepakbola. Menyajikan berita bola terkini, hasil pertandingan, prediksi dan jadwal pertandingan, Liga 1, Liga Inggris, Liga Spanyol, Liga Italia, Liga Champions.",
    "keywords": "Jadwal Pertandingan, Hasil Pertandingan, Klasemen, Prediksi Pertandingan, Liga 1, Liga Inggris, Sepakbola, Liga Champions, Liga Spanyol, Liga Italia, Badminton, Bulutangkis, Link Live Streaming, MotoGP, Berita Sepakbola, Piala Dunia, Tempat Olahraga, Olahraga, Berita Bola, Esport, Basketball.",
    "content_text": "Jumat,19 Agustus 2022 21:05 WIB 3 Bintang Murah dengan Statistik Lebih Mentereng dari Casemiro yang Bisa Dilirik Man United Jumat,19 Agustus 2022 19:32 WIB 4 Kali Dipecat Termasuk saat Latih Timnas Indonesia,Mampukah Luis Milla Bawa Persib Berprestasi? Jumat,19 Agustus 2022 18:42 WIB Resmi Latih Persib,Ini 3 Prestasi Mentereng Luis 13:45 WIB Potret Kemenangan Dramatis PSM Makassar Atas RANS Nusantara di Liga 1 Liga Indonesia |  Minggu,24 Juli 2022 21:13 WIB Kemegahan dan Fasilitas Mewah Stadion JIS di Hari Launching       Tentang Indosport Redaksi Karir Pedoman Media Siber SOP Perlindungan Wartawan Iklan & Kerjasama RSS Copyright ยฉ 2012 - 2022 INDOSPORT. All rights reserved",
    "hot_url": 0,
    "size_bytes": 121345,
    "model_crawl": "BFS Crawling",
    "duration_crawl": 28800
  },
  "page_forms": [
    {
      "url": "https://www.indosport.com",
      "form": "<form action='https://www.indosport.com/search' method='get'></form>"
    },
    {
      "url": "https://www.indosport.com",
      "form": "<form action='https://www.indosport.com/searchv2' method='post'></form>"
    }
  ],
  "page_images": [
    {
      "url": "https://www.indosport.com",
      "image": "<img alt='' height='1' src='https://certify.alexametrics.com/atrk.gif?account=/HVtm1akKd607i' style='display:none' width='1'/>"
    },
    {
      "url": "https://www.indosport.com",
      "image": "<img alt='' height='1' src='https://sb.scorecardresearch.com/blabla.jpeg' style='display:none' width='1'/>"
    }
  ],
  "page_linking": [
    {
      "crawl_id": 3,
      "url": "https://www.indosport.com",
      "outgoing_link": "https://www.indosport.com/sepakbola"
    },
    {
      "crawl_id": 1,
      "url": "https://www.indosport.com",
      "outgoing_link": "https://www.indosport.com/liga-spanyol"
    }
  ],
  "page_list": [
    {
      "url": "https://www.indosport.com",
      "list": "<li class='bc_home'><a href='https://www.indosport.com'><i class='sprite sprite-mobile sprite-icon_home icon-sidebar'></i></li>"
    },
    {
      "url": "https://www.indosport.com",
      "list": "<li class='bc_home'><a href='https://www.indosport.com'><i class='sprite sprite-mobile sprite-icon_home icon-sidebar'></i></li>"
    }
  ],
  "page_scripts": [
    {
      "url": "https://www.indosport.com",
      "script": "<script type='text/javascript'>window.ga=window.ga||function(){(ga.q=ga.q||[]).push(arguments)};ga.l=+new Date;</script>"
    },
    {
      "url": "https://www.indosport.com",
      "script": "<script type='text/javascript'>window.ga=window.bc||function(){(ga.q=ga.q||[]).push(arguments)};ga.l=+new Date;</script>"
    }
  ],
  "page_styles": [
    {
      "url": "https://www.indosport.com",
      "style": "<style>.bn_skin{z-index: 2 !important;}</style>"
    },
    {
      "url": "https://www.indosport.com",
      "style": "<style>.bn_skin{z-index: 115 !important;}</style>"
    }
  ],
  "page_tables": [
    {
      "url": "https://www.indosport.com",
      "table_str": "<table class='table'><thead><tr><th class='waktu'>Waktu</th><th class='pertandingan'>Pertandingan</th><th class='tv'>Live TV</th></tr></thead><tbody></tr></tbody></table>"
    },
    {
      "url": "https://www.indosport.com",
      "table_str": "<table class='table'><thead><tr><th class='waktu'>Waktu</th><th class='pertandingan'>Pertandingan</th><th class='tv'>Live TV</th></tr></thead><tbody></tr></tbody></table>"
    }
  ]
}
  • Response:
{
  "message": "Sukses",
  "ok": true
}

๐Ÿ“ Struktur Direktori & File

.
โ”œโ”€โ”€ docs                                          # Sebagai tempat dokumentasi file seperti diagram, product backlog, dll
โ”œโ”€โ”€ html                                          # Berisi dokumentasi class dan fungsi yang di-generate dari library pdoc3
โ”œโ”€โ”€ services                                      # Kumpulan konfigurasi background service yang dipakai di systemd/systemctl
โ”œโ”€โ”€ src                                           # Source code search engine
โ”‚   โ”œโ”€โ”€ api                                       # Folder untuk kodingan REST API
โ”‚   |   โ”œโ”€โ”€ app.py                                # Untuk run Flask dan menggabungkan routes
โ”‚   |   โ”œโ”€โ”€ crawling.py                           # Routes dan fungsi API untuk crawling
โ”‚   |   โ”œโ”€โ”€ document_ranking.py                   # Routes dan fungsi API untuk document ranking
โ”‚   |   โ”œโ”€โ”€ overall_ranking.py                    # Routes dan fungsi API untuk overall ranking
โ”‚   |   โ””โ”€โ”€ page_ranking.py                       # Routes dan fungsi API untuk page ranking
|   |
โ”‚   โ”œโ”€โ”€ crawling                                  # Folder untuk kodingan crawling
โ”‚   |   โ”œโ”€โ”€ methods                               # Folder untuk berbagai metode crawling
โ”‚   |   |   โ”œโ”€โ”€ breadth_first_search.py           # Fungsi-fungsi crawling metode BFS
|   |   |   โ””โ”€โ”€ modified_similarity_based.py      # Fungsi-fungsi crawling metode MSB
โ”‚   |   โ”œโ”€โ”€ crawl.py                              # Untuk run crawling dengan menggabungkan metode yang ada
โ”‚   |   โ”œโ”€โ”€ page_content.py                       # Fungsi-fungsi yang menghubungkan ke database dan halaman html
โ”‚   |   โ””โ”€โ”€ util.py                               # Fungsi-fungsi pendukung crawling
|   |
โ”‚   โ”œโ”€โ”€ database                                  # Folder untuk kodingan database
โ”‚   |   โ””โ”€โ”€ database.py                           # Berisi kode untuk pengoperasian database seperti koneksi, query, dll
|   |
โ”‚   โ”œโ”€โ”€ document_ranking                          # Folder untuk kodingan document ranking
โ”‚   |   โ””โ”€โ”€ tf_idf.py                             # Implementasi dari TF-IDF
|   |
โ”‚   โ”œโ”€โ”€ overall_ranking                           # Folder untuk kodingan overall ranking
โ”‚   |   โ””โ”€โ”€ similarity.py                         # Implementasi dari similarity score
|   |
โ”‚   โ”œโ”€โ”€ page_ranking                              # Folder untuk kodingan page ranking
โ”‚   |   โ””โ”€โ”€ page_rank.py                          # Implementasi dari Google PageRank
|
โ”œโ”€โ”€ .env                                          # Konfigurasi credentials database dan crawler
โ”œโ”€โ”€ run_api.py                                    # Script utama untuk run REST API
โ”œโ”€โ”€ run_crawl.py                                  # Script utama untuk run crawling
โ”œโ”€โ”€ run_page_rank.py                              # Script utama untuk run page rank
โ”œโ”€โ”€ run_search_engine_console.py                  # Script utama untuk run search engine console
โ”œโ”€โ”€ run_tf_idf.py                                 # Script utama untuk run TF IDF
โ”œโ”€โ”€ requirements.txt                              # Berisi list library yang diperlukan

๐Ÿ“„ Referensi

search-engine's People

Contributors

lazuardyk avatar zaidanprtm avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.