Git Product home page Git Product logo

words-scraper's Introduction

Words Scraper

Selenium based web scraper to generate passwords list.

Installation

# Download Firefox webdriver from https://github.com/mozilla/geckodriver/releases
$ tar xzf geckodriver-v{VERSION-HERE}.tar.gz
$ sudo mv geckodriver /usr/local/bin # Make sure it is in your PATH
$ geckodriver --version # Make sure webdriver is properly installed
$ git clone https://github.com/dariusztytko/words-scraper
$ sudo pip3 install -r words-scraper/requirements.txt

Use cases

Scraping words from the target's pages

$ python3 words-scraper.py -o words.txt https://www.example.com https://blog.example.com

Such generated words list can be used to perform online brute-force attack or for cracking password hashes:

$ hashcat -m 0 hashes.txt words.txt

Use --depth option to scrape words from the linked pages as well. Optional --show-gui switch may be used to track the progress and make a quick view of the page:

$ python3 words-scraper.py -o words.txt --depth 1 --show-gui https://www.example.com

Generated words list can be expanded by using words-converter.py script. This script removes special chars and accents. An example Polish word źdźbło! will be transformed into the following words:

  • źdźbło!
  • zdzblo!
  • źdźbło
  • zdzblo
$ cat words.txt | python3 words-converter.py | sort -u > words2.txt

Scraping words from the target's Twitter

Twitter page is dynamically loaded while scrolling. Use --max-scrolls option to scrape words:

$ python3 words-scraper.py -o words.txt --max-scrolls 300 --show-gui https://twitter.com/example.com

Scraping via Socks proxy

$ ssh -D 1080 -Nf {USER-HERE}@{IP-HERE} >/dev/null 2>&
$ python3 words-scraper.py -o words.txt --socks-proxy 127.0.0.1:1080 https://www.example.com

Usage

usage: words-scraper.py [-h] [--depth DEPTH] [--max-scrolls MAX_SCROLLS]
                         [--min-word-length MIN_WORD_LENGTH]
                         [--page-load-delay PAGE_LOAD_DELAY]
                         [--page-scroll-delay PAGE_SCROLL_DELAY] [--show-gui]
                         [--socks-proxy SOCKS_PROXY] -o OUTPUT_FILE
                         url [url ...]

Words scraper (version: 1.0)

positional arguments:
  url                   URL to scrape

optional arguments:
  -h, --help            show this help message and exit
  --depth DEPTH         scraping depth, default: 0
  --max-scrolls MAX_SCROLLS
                        maximum number of the page scrolls, default: 0
  --min-word-length MIN_WORD_LENGTH
                        default: 3
  --page-load-delay PAGE_LOAD_DELAY
                        page loading delay, default: 3.0
  --page-scroll-delay PAGE_SCROLL_DELAY
                        page scrolling delay, default: 1.0
  --show-gui            show browser GUI
  --socks-proxy SOCKS_PROXY
                        socks proxy e.g. 127.0.0.1:1080
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        save words to file

Known bugs

  • Native browser dialog boxes (e.g. file download) freeze scraper

Changes

Please see the CHANGELOG

words-scraper's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.