Web Scraping for Email and other contact information
Requires BeautifulSoup
Web scraping is a messy science. Most people stay clear from it for the following reasons.
- Websites constantly change.
- Every website is essentially different.
- Not standards. Wouldn't it be nice if the standard for contact was in XML.
- Degradation of DOM.
- Flash driven websites.
- Contact info embedded in images or other media.
- Extracting contact name (a person) and email: nearly impossible.
- Names most often don't follow an email tag.
- Not on the same line or same couple of lines.
- Hard to distinguish name: example (dolphin, razor, Peter) words or names?
This script is designed to scrape a list of urls:
http://foo.com
http://bar.com
http://hop.com
etc...
- Builds an array of all urls from a domain.
- Stay within the domain name.
- Search for email addresses.
- Creates csv of result.dat consisting of: URL,Email
- just rename the file to result.csv.
- Creates error.dat consisting of domains it could not connect to.
- Script is extremely simple to follow and modify
Todo
- Perhaps add error file on second loop and write another file on links it could not scrape. List might get large.
- Create two methods and clean up code.
- explore additional unicode features for extracting email addresses that are not using @ at: character.
- What about searching for
name @ foo .com
- DNS lookup for contact info (last resort)
- Speed up search by only searching links that contain /About /Contact /Info not all sites follow this convention.