Git Product home page Git Product logo

cmlabs-backend-crawler-freelance-test's Introduction

Go Web Crawler API

This is a simple API built in Golang to perform website crawling and save the results in HTML files. The API uses Chromedp, a headless browser library, to render JavaScript-based websites such as Single-Page Applications (SPA), Server-Side Rendered (SSR) websites, and Progressive Web Apps (PWA).

Features

  1. Crawling Single or Multiple Websites: The API allows you to crawl a single website or multiple websites simultaneously by providing the URLs as a comma-separated list.
  2. Custom User-Agent: You can specify a custom User-Agent header in the request to mimic different browsers or devices.

Installation

  1. Clone the repository to your local machine:
git clone hhttps://github.com/PutraFajarF/cmlabs-backend-crawler-freelance-test.git

cd cmlabs-backend-crawler-freelance-test
  1. Install the dependencies:
go get github.com/gorilla/mux
go get github.com/chromedp/chromedp
go get github.com/chromedp/cdproto/network

Usage

  1. Start the server
go run .
  1. Perform a crawling request using your web browser or a tool like Postman:
GET http://localhost:8080/crawl?url=https://example.com

Replace https://example.com with the URL you want to crawl. You can also specify multiple URLs as a comma-separated list.

Optional parameters:

  • user_agent: Set a custom User-Agent header (Default: Chrome on Windows).

Example

To crawl a single website:

GET http://localhost:8080/crawl?url=https://cmlabs.co

To crawl multiple websites:

GET http://localhost:8080/crawl?url=https://cmlabs.co,https://sequence.day

Using a User-Agent of Google Chrome on Windows:

GET http://localhost:8080/crawl?url=https://example.com&user_agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36

Using a User-Agent of Apple Safari on macOS:

GET http://localhost:8080/crawl?url=https://example.com&user_agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Safari/537.36

Using a User-Agent of Mozilla Firefox on Windows:

GET http://localhost:8080/crawl?url=https://example.com&user_agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0

Using a User-Agent of Apple Iphone:

GET http://localhost:8080/crawl?url=https://example.com&user_agent=Mozilla/5.0 (iPhone; CPU iPhone OS 14_7 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1

Output

After performing the crawling request, the API will generate HTML files for each URL crawled. The files will be named with the format output_.html and will be saved in the same directory as the API.

The API will respond with the message "Crawling finished. Results are saved in HTML files." once the crawling process is completed.

Note

  • This API is designed to handle SPA, SSR, and PWA websites.
  • Make sure to use a valid URL for crawling.
  • The default User-Agent header mimics Chrome on Windows. You can change it by providing the user_agent parameter in the request.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.