Git Product home page Git Product logo

scraper_cli's Introduction

scraper_cli

use perl Web::Scraper to extract webpage data on simple command line

简单命令行调用perl的 Web::Scraper 取网页数据

like python scrape

example

read from : url , html file, stdin url string, stdin html content

读入url,html文件,或标准输入指定的url、html内容

scraper_cli.pl -i http://www.google.com -p "//a" -r "TEXT"
scraper_cli.pl -i google.html -p "//a" -r "TEXT"
echo "http://www.google.com" | scraper_cli.pl -p "//a" -r "TEXT"
cat "google.html" | scraper_cli.pl -p "//a" -r "TEXT"

extract data : single data like TEXT/HTML/@attr => plain text, multi data => json text

单个元素提取直接写入普通文本文件,多个元素提取则写入json文本文件

scraper_cli.pl -i http://www.google.com -p "//a" -r "TEXT"
scraper_cli.pl -i http://www.google.com -p "//a" -r "HTML"
scraper_cli.pl -i http://www.google.com -p "//a" -r "@href"
scraper_cli.pl -i http://www.google.com -p "//a" -r " id=> '@href', text => 'TEXT' "

args

-i url or input html file; otherwise, use stdin
-p xpath string
-c charset
-r return data, for example, HTML / TEXT / \@href
-f process or process_first
-o extract data write to file; otherwise, use stdout
-h help

install

cpan App::cpanminus
cpanm Encode::Locale JSON LWP::UserAgent Web::Scraper

scraper_cli's People

Contributors

abbypan avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.