Crawler tests

I need to make a crawler; PHP or else.

This repo is to quickly test if it cna be done effectively in PHP or I need to switch the language. Below are the notes of my short R&D.

Initial path

Few URL

I found below URLs useful during initial search.

Possible options

Tools

phpcrawl In PHP, but just for idea. Not good fit due to GPL license. (I do not consider GPL as Free license)
Scrapy in Python.
Scrape in Python
Apache Nutch in java
Confluence Heritrix in Java
Web Sphinx in Java

PHP Options

During last few years, I am mainly working in PHP so yes, PHP is my preferred tool of choice as it will not involve big learning curve. Yes I'm open for Python and also have professional Java experience in the past, PHP will be my preference if it can provide decent performance. I believe PHP can do it on high scale with PHP 7, different caching, multiple servers along with message queue.

Decision 1: Try first with PHP. Again, this is just initial experiment si I'd like to give PHP first chance.

Looking available PHP options

PHP crawl is not an option due to its restrictive license (GPL). However I'd like to check its code to see how it is doing things at lower level. May be I get few ideas from there.

Since there is no other crawler option in PHP, I probably need to make my own crawler. Another reason, in future, if idea clicks, crawler will have lot of responsibility. It will be heart and soul of my application so I do not want to be restricted by any third party tool. Also I want to learn how crawler.

Decision 2: Try custom crawler Look at other open source solutions but at least attempt to make own crawler. May be it could be new open source crawler or at least I'll learn some thing new :)

Preference: New open source project. (Get idea from other open source projects)

Open for: If custom take a lot of time, open for other open source project in PHP, Python and Java.

PHP Simple Test Scriptable Browser seems another part to look at. It is not actually a crawler but if I need to make a crawler from scratch, this could make reading web pages easy.

PHP1

First experiment in php is listed in php1.md

kapilsharma / crawlertest Goto Github PK

crawlertest's Introduction

Crawler tests

Initial path

Few URL

Possible options

Tools

PHP Options

Looking available PHP options

PHP1

crawlertest's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent