Light

demogorgonz / easyweb-scrape Goto Github PK

View Code? Open in Web Editor NEW

This project forked from thecount12/easyweb-scrape

0.0 2.0 0.0 168 KB

Web Scraping for Email and other contact information

License: GNU General Public License v2.0

Python 100.00%

easyweb-scrape's Introduction

easyweb-scrape

Web Scraping for Email and other contact information

Requires BeautifulSoup

Web scraping is a messy science. Most people stay clear from it for the following reasons.

Websites constantly change.
Every website is essentially different.
Not standards. Wouldn't it be nice if the standard for contact was in XML.
Degradation of DOM.
Flash driven websites.
Contact info embedded in images or other media.
Extracting contact name (a person) and email: nearly impossible.

Names most often don't follow an email tag.
Not on the same line or same couple of lines.
Hard to distinguish name: example (dolphin, razor, Peter) words or names?

This script is designed to scrape a list of urls:

http://foo.com
http://bar.com 
http://hop.com
etc...

Builds an array of all urls from a domain.
Stay within the domain name.
Search for email addresses.
Creates csv of result.dat consisting of: URL,Email

just rename the file to result.csv.

Creates error.dat consisting of domains it could not connect to.
Script is extremely simple to follow and modify

Todo

Perhaps add error file on second loop and write another file on links it could not scrape. List might get large.
Create two methods and clean up code.
explore additional unicode features for extracting email addresses that are not using @ at: character.
What about searching for name @ foo .com
DNS lookup for contact info (last resort)
Speed up search by only searching links that contain /About /Contact /Info not all sites follow this convention.

easyweb-scrape's People

Contributors

Watchers

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.