Git Product home page Git Product logo

motors-crawlers's Introduction

Motors Crawlers

Docs about setting-up this project and folders structure of mechanism

  1. Setting-up steps:
    • Install all dependencies, run npm install
    • Install anticaptcha extension to chromium a guidance could be found here: antcpt.com
    • Set path to manifest.json of installed extension to chrome.js class
    • Find config.js in js folder of installed extension and change account key value to your api key so chromium can load it on start
    • Set database parameters for your mysql server in ./Models/index.js

Folder structure of project

Directory tree of project:

├── common
├── Models
├── Scrapers
│   └── mobile-de
├── Server
│   ├── DB
│   └── Engine
└── Workers

All scraping logic are written in 3 folders: Scrapers, Server and Workers

Scrapers folder:

Scrapers
└── mobile-de
    ├── detail.js
    ├── listing.js
    └── mobilede.js

This folder contains folder of each website which you want to scrape
NOTE: When you create a new crawler the main folder MUST be named as crawler in database, this stands for dynamically running crawlers from only two workers.
NOTE: New crawler MUST contain three files inside folder: detail.js for detail crawler, listing.js for listing crawler and file named as spider for a super class.

Server folder:

Server
├── DB
│   └── DB.js
└── Engine
    ├── Chrome.js
    ├── Engine.js
    ├── Request.js
    └── UserAgents.js

This folder is charge for database communication and sending request logic
DB.js contains queries for database communication
Engine.js Engine class that manipulates way of sending request. It can be as simple request or using chromium
Request.js Request class for sending simple requests
Chrome.js Chromium browser for parsing javascript-heavy websites
UserAgents.js contains 10 most common User Agents

Workers folder:

Workers
├── detail.worker.js
└── listing.worker.js

This folder holds two workers, one for each type of crawlers
detail.worker.js controls detail crawler
listing.worker.js controls listing crawler

NOTE: Uncomment lines of code that are commented when you properly configure .env file and you want to use environment variables, also uncomment lines that contains slack api implementation when you set-up slack webhook in .env file.

Run crawler with:

node run.js <SPIDER> <TYPE> <ENGINE> Example: node run.js mobile-de detail Chrome

motors-crawlers's People

Contributors

obrad13 avatar stefan-jevtic avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.