Git Product home page Git Product logo

maradmin-search's Introduction

MARADMIN WEB CRAWLER IOT PROVIDE A BETTER SEARCH FUNCTION

Below are the documentations on how to set it up from scratch. Then further below will be how to just clone it and run it. The website is https://www.marines.mil/News/Messages/MARADMINS/

Methodology of building the application

  1. Scrapy
  2. Django API backend
  3. Celery Worker to automate the scraping and inserting new maradmin into the db
  4. React frontend

VIRTUAL ENV

  1. UTILIZE BASH in windows or natively in Unix machines
  2. python -m venv mybash
  3. Activate the virutal environment in bash.
    1. In windows utilizing bash source mybash/bin/activate
  4. python -m pip install --upgrade pip

These are the packages installed. They will all role up into a requirements.txt later

  1. pip install Scrapy

SCRAPY

  1. Create a new project scrapy startproject maradmin_scrapy_project
  2. Add csv into the settings
  3. Create a spider under spiders folder called maradminspider
  4. Trial 1: it scraps just the basic information and not the body - initial scrape is 50 pages
  5. Crawl utilizing the command scrapy crawl maradminspider inside the root director of maradmin_scrapy_project
  6. The scraped data is inserted straight into the database via django model

DJANGO

  1. pip install django
  2. django-admin.exe startproject backend
  3. Make updates to the settings.py
  4. python manage.py migrate
  5. python manage.py runserver - to test if it runs
  6. python manage.py startapp search_api
  7. Add search_api into settings to installed apps
  8. Create the models based on the scraped data
  9. python manage.py makemigrations search_api
  10. python manage.py migrate search_api
  11. Set up admin to for testing purpose only
    1. python manage.py createsuperuser
  12. Set up URL links from project to app
  13. Create a bulk insert manager and run it with a django command
  14. python manage.py maradmin_uploader

Django Rest Framework

  1. pip install djangorestframework
  2. pip install django-filter
  3. pip install markdown
  4. Add rest_framework to settings
  5. Create serializer.py
  6. Create view to display serialized objects
  7. Update URL to view with simple Router

Pagination and SearchFilter

  1. Update settings and view

Integrated Scrapy into Django

  1. pip install scrapy-djangoitem
  2. Moved maradmin_scrapy_project into same level as search_api
  3. Update management/commands/maradmin_uploader.py
  4. Update maradmin_scrapy_project files from adding apps.py, to the items, pipelines, and settings to integrate Django models
  5. Run scrapy crawl maradminspider inside the maradmin_scrapy_project to scrape and save into django database

Testing

  1. pip install coverage
  2. made tests folder
  3. add test_models and test_view
  4. coverage run
  5. coverage html
  6. coverage report

Celery Worker - IW (PAUSE) - workaround is to establish a cron job later on to automate it. Twister error wins...

  1. This is to automate the scraping and uploading into the database to ensure it is up-to-date
  2. pip install celery for the worker and pip install redis for the broker
  3. Create celery.py inside the backend/backend
  4. Add celery settings at the bottom of settings
  5. Add celery to backend/backend/__init__.py to ensure it is loaded every time django starts up
  6. Create tasks.py and make simple task inside directory search_api
  7. pip install crochet - handle Twisted errors. See below reference stackoverflow on ReactorNotRestartable
  8. pause celery worker
  9. Install redis-server
    1. sudo apt-get install redis-server
    2. sudo service redis-server restart
  10. Run redis server on separate terminal
    1. redis-server
  11. Run celery worker on separate terminal
    1. celery worker -A backend -l info
    2. test celery beat -A backend -l info

React

  1. yarn create react-app frontend --template typescript
  2. yarn add antd
  3. yarn start

Redis

  1. redis-server
  2. redis-cli
  3. pip install django-redis
  4. redis-cli monitor

References

  1. Scrapy
  2. Django Girls
  3. Django Rest Framework
  4. Scraping with Scrapy and Django Integration
  5. Django Celery Scrapy Error ReactorNotRestartable
  6. Carbalert
  7. React Typescript Cheetsheet
  8. Yarn
  9. Redis

maradmin-search's People

Contributors

phansiri avatar lit-dds avatar dependabot[bot] avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.