Crawl domain.com.au everyday using scrapy + mongodb + schedule.
- It is based on scrapy framework.
- It uses mongodb as database.
- It uses python schedule to simplely follow a daily routine.
- It has been configed friendly to web server since 1 request per second.
- It is written in python 3.x.
To get started, you need:
- A running mongod service on your local(127.0.0.1)
- Change the username and password in /domain/MongoCache.py to fit your own case.
- For one time crawling, just run:
python3 runner.py
4.For scheduled crawling: open scheduler.py, scroll to the bottom, change parameters in the call to timing_crawl
if __name__ == "__main__":
timing_crawl(hour, min)
Then,
python3 scheduler.py
The first crawling takes up 6 hours since there are average 15000+ house on sell on domain.com.au. The followed crawling will take only 1 hour since no need to crawl every detail page.
- This project is helpful to providing real estate market informations in Sydney for house seekers.
- it is good data source for data science studying.
- A summary analysis based on this work can be visited here
- For more detailed analysis results, please contact me by email([email protected]).