Git Product home page Git Product logo

ipproxy's Introduction

IPProxy

中文版

A simple tool to crawl proxy ip.

Requirements

  • Python 2.7
  • Virtualenv(optional)
  • Pip(optional)

You can use virtualenv to make a new python virtual environment, and pip to install any dependencies. However, you can use any other tool you like.

Usage

Build up env

Build up a new virtualenv for this project, run in a shell:

$ virtualenv ~/virtualenvs/ipproxy
$ source ~/virtualenvs/ipproxy/bin/activate 
(ipproxy)$ pip install -r requirements.txt 

Crawl possible proxy ip

Then crawl any possible proxy ip from some pre-defined website:

(ipproxy)$ python crawl.py 

Wait for a while, just a cup of coffee (may be a little bit longer, it all depends on your network), and you'll get the result in the data directory:

all.csv
china.csv
foreign.csv
high_anonymous.csv
low_anonymous.csv
non_anonymous.csv

Every csv file consist four columns: ip, port, anonymous, info. Looks like:

ip,port,anonymous,info
110.73.0.125,8123,3,**-广西-防城港
207.226.142.113,3128,3,**-香港
......

For anonymous column, it means:

  • 0: unknown
  • 1: none
  • 2: low
  • 3: high

Check available proxy ip

(ipproxy)$ python check.py --help
usage: check.py [-h] [--target TARGET] [--timeout TIMEOUT] [--worker WORKER]
                [--thread THREAD] [--loglevel LOGLEVEL]
                input

positional arguments:
  input                the input proxy ip list, in csv format(supprot gz)

optional arguments:
  -h, --help           show this help message and exit
  --target TARGET      target uri to validate proxy ip, default:
                       http://www.baidu.com
  --timeout TIMEOUT    timeout of validating each ip, default: 15s
  --worker WORKER      run with multi workers, default: CPU cores
  --thread THREAD      run with multi thread in each worker, default: 100
  --loglevel LOGLEVEL  set log level, e.g. debug, info, warn, error; default:
                       info

So take the above csv as input, you can just run:

(ipproxy)$ python check.py data/high_anonymous.csv

You can also specific some more arguments:

(ipproxy)$ python main.py input.csv --target http://www.google.com.hk --timeout 10 --worker 4 --thread 200 --loglevel debug

Output(data/proxyip.csv) is similiar to input, with one more col speed(the smaller the better):

ip,port,anonymous,info,speed
110.84.128.143,3128,1,**-福建-福州,0.10766482353210449
58.247.125.205,10032,3,**-上海-上海,0.5216059684753418
......

Example

Take a look at example.py.

Data Source

License

Just enjoy it.

ipproxy's People

Contributors

jiehua233 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.