Topic: web-crawler Goto Github

Some thing interesting about web-crawler

👇 Here are 887 public repositories matching this topic...

abaykan / crawlbox

web-crawler,Easy way to brute-force web directory.

User: abaykan

admin-finder crawler python web-crawler wordlist

adithya-s-k / omniparse

web-crawler,Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks

User: adithya-s-k

Home Page: https://docs.cognitivelab.in

ingestion-api ocr omniparser parse-server parser-library vision-transformer web-crawler whisper-api

algebra-fun / wereadscan

web-crawler,扫描“微信读书”已购图书并下载本地PDF的爬虫

User: algebra-fun

Home Page: https://algebra-fun.github.io/WeReadScan/

selenium weread web-crawler book-downloader

antchfx / antch

web-crawler,Antch, a fast, powerful and extensible web crawling & scraping framework for Go

Organization: antchfx

crawler crawling framework golang scraping web-crawler web-spider

apache / incubator-stormcrawler

web-crawler,A scalable, mature and versatile web crawler based on Apache Storm

Organization: apache

Home Page: https://stormcrawler.apache.org/

web-crawler apache-storm distributed java crawler stormcrawler

apache / nutch

web-crawler,Apache Nutch is an extensible and scalable web crawler

Organization: apache

Home Page: https://nutch.apache.org/

java nutch web-crawler crawling hadoop apache

web-crawler,Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

Organization: apify

Home Page: https://crawlee.dev

apify automation crawler crawling headless headless-chrome javascript nodejs npm playwright puppeteer scraper scraping typescript web-crawler web-crawling web-scraping

apify / crawlee-python

web-crawler,Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

Organization: apify

Home Page: https://crawlee.dev/python/

apify automation beautifulsoup crawler crawling headless headless-chrome pip playwright python

brendonboshell / supercrawler

web-crawler,A web crawler. Supercrawler automatically crawls websites. Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.

User: brendonboshell

crawler distributed-crawler robot sitemap web-crawler

brianmadden / krawler

web-crawler,A web crawling framework written in Kotlin

User: brianmadden

webcrawler kotlin framework crawler4j link-checker web-crawler web-crawling

brucedone / awesome-crawler

web-crawler,A collection of awesome web crawler,spider in different languages

User: brucedone

web-crawler crawler web-scraper spider node-crawler scraper awesome

commoncrawl / news-crawl

web-crawler,News crawling with StormCrawler - stores content as WARC

Organization: commoncrawl

crawler news warc web-crawler apache-storm common-crawl commoncrawl storm-crawler

crawlab-team / crawlab

web-crawler,Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台，支持任何语言和框架

Organization: crawlab-team

Home Page: https://www.crawlab.cn

webcrawler scrapy crawlab spiders-management go scrapyd-ui spider crawler webspider web-crawler

crawlab-team / crawlab-lite

web-crawler,Lite version of Crawlab. 轻量版 Crawlab 爬虫管理平台

Organization: crawlab-team

crawlab scrapy crawler scrapy-ui web-crawler spider crawling-tasks platform crawler-management scrapyd-ui

crawler-commons / crawler-commons

web-crawler,A set of reusable Java components that implement functionality common to any web crawler

Organization: crawler-commons

web-crawler java sitemaps robots-txt open-source library

crwlrsoft / crawler

web-crawler,Library for Rapid (Web) Crawler and Scraper Development

Organization: crwlrsoft

Home Page: https://www.crwlr.software/packages/crawler

crawling php scraper scraping scraping-websites web-crawler web-crawling web-scraping hacktoberfest crawler

duyet / awesome-web-scraper

web-crawler,A collection of awesome web scaper, crawler.

User: duyet

web-crawler web-scraper slimerjs phantomjs goutte awesome awesome-list php storage scrapy

dwarfthief / raspagem-de-dados-para-iniciantes

web-crawler,Raspagem de dados para iniciante usando Scrapy e outras libs básicas

User: dwarfthief

datascraping estudo hacktoberfest jupyter-notebook opensource python raspagem-de-dados scrapy spyder web-crawler webcrawling

elliotxx / zhihu-crawler-people

web-crawler,A simple distributed crawler for zhihu && data analysis

User: elliotxx

python crawler spider web-spider web-crawler python-crawler

gildas-lormeau / single-file-cli

web-crawler,CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)

User: gildas-lormeau

cli nodejs single-file web-archiving web-scraper web-scraping archiving scraping-websites crawler web-crawler

havanagrawal / goodreadsscraper

web-crawler,Scrape data from Goodreads using Scrapy and Selenium :books:

User: havanagrawal

python python3 scrapy scrapy-spider goodreads goodreads-data selenium data-mining web-crawler scraping

hecate2 / ignareo-isml-auto-voter

web-crawler,Ignareo the Carillon, a web crawler/spider template of ultimate high concurrency built for leprechauns. Carillons as the best web spiders; Long live the golden years of leprechauns! (ISML=international saimoe; 2022 ISML is last ISML)

User: hecate2

web-spider concurrency http isml distributed gevent asyncio sukasuka sukamoka ignareo tiat chtholly high-performance python web-crawler microservice

hominee / dyer

web-crawler,Dyer is designed for reliable, flexible and fast web crawling, providing some high-level, comprehensive features without compromising speed.

User: hominee

Home Page: https://hominee.github.io/dyer/

crawler web-crawler web-scraping web-framework rust spider rust-programming-language

hyunwoongko / kochat

web-crawler,Opensource Korean chatbot framework

User: hyunwoongko

chatbot deeplearning deep-learning korean korean-chatbot sentence-classification sequance-tagging web-crawler

infinilabs / crawler

web-crawler,🕷️ An easy-to-use spider written in Golang. (previous named GOPA.)

Organization: infinilabs

crawler crawling elasticsearch lightweight scraping spider web-crawler web-scraping web-spider

lewisdonovan / google-news-scraper

web-crawler,Lightweight scraper for Google News

User: lewisdonovan

google-news google-news-scraper news news-scraper news-articles web-scraper crawler web-crawler news-crawler google-crawler

lucasxlu / lagoujob

web-crawler,Data Analysis & Mining for lagou.com

User: lucasxlu

Home Page: https://www.zhihu.com/question/36132174/answer/94392659

lagou data-analysis web-crawler machine-learning data-mining python3 nlp

marginaliasearch / marginaliasearch

web-crawler,Internet search engine for text-oriented websites. Indexing the small, old and weird web.

Organization: marginaliasearch

Home Page: https://search.marginalia.nu/

search-engine no-cloud small-web internet-search indexer language-processing web-crawler alt-search no-ai-used self-hostable

maxvalue / terpene-profile-parser-for-cannabis-strains

web-crawler,Parser and database to index the terpene profile of different strains of Cannabis from online databases

User: maxvalue

Home Page: https://maxvalue.github.io/Terpene-Profile-Parser-for-Cannabis-Strains/

cannabis data-science web-crawler-python web-crawler web-crawling python-3 terpenes plants biological-data-analysis biological-data

mazzzystar / proxy

web-crawler,A simple tool for fetching usable proxies from several websites.

User: mazzzystar

proxypool proxy-list proxies web-crawler

mendableai / firecrawl

web-crawler,🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.

Organization: mendableai

Home Page: https://firecrawl.dev

ai ai-scraping crawler data html-to-markdown llm markdown rag scraper scraping web-crawler

microfisher / strong-web-crawler

web-crawler,基于C#.NET+PhantomJS+Sellenium的高级网络爬虫程序。可执行Javascript代码、触发各类事件、操纵页面Dom结构。

User: microfisher

web-crawler crawler phantomjs sellenium

norconex / crawlers

web-crawler,Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.

Organization: norconex

Home Page: https://opensource.norconex.com/crawlers

search-engine web-crawler java collector-http flexible crawler crawlers filesystem-crawler collector-fs

platonai / pulsarrpa

web-crawler,Automate webpages at scale, scrape web data completely and accurately with high performance, distributed RPA.

User: platonai

web-crawler web-mining data-science web-sql crawler scraper scraping web-scraping data-mining rpa

postmodern / spidr

web-crawler,A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

User: postmodern

spider ruby spider-links crawler web scraper web-scraping web-spider web-crawler web-scraper