Git Product home page Git Product logo

public-amazon-crawler's Introduction

Amazon Crawler

A relatively simple amazon.com crawler written in python. It has the following features:

  • supports hundreds of simultaneous requests, depending on machine's limits
  • supports using proxy servers
  • supports scaling to multiple machines orchestrating the crawl and keeping in sync
  • can be paused and restarted without losing its place
  • logs progress and warning conditions to a file for later analysis

It was used to pull over 1MM+ products and their images from amazon in a few hours. Read more.

Getting it Setup

After you get a copy of this codebase pulled down locally (either downloaded as a zip or git cloned), you'll need to install the python dependencies:

pip install -r requirements.txt

Then you'll need to go into the settings.py file and update a number of values:

  • Database Name, Host and User - Connection information for storing products in a postgres database
  • Redis Host, Port and Database - Connection information for storing the URL queue in redis
  • Proxy List as well as User, Password and Port - Connection information for your list of proxy servers

Once you've updated all of your connection information, you'll need to run the following at the command line to setup the postgres table that will store the product records:

python models.py

The fields that are stored for each product are the following:

  • title
  • product_url (URL for the detail page)
  • listing_url (URL of the subcategory listing page we found this product on)
  • price
  • primary_img (the URL to the full-size primary product image)
  • crawl_time (the timestamp of when the crawl began)

How it Works

You begin the crawler for the first time by running:

python crawler.py start

This runs a function that looks at all of the category URLs stored in the start-urls.txt file, and then explodes those out into hundreds of subcategory URLs it finds on the category pages. Each of these subcategory URLs is placed in the redis queue that holds the frontier listing URLs to be crawled.

Then the program spins up the number of threads defined in settings.max_threads and each one of those threads pops a listing URL from the queue, makes a request to it and then stores the (usually) 10-12 products it finds on the listing page. It also looks for the "next page" URL and puts that in the queue.

Restarting the crawler

If you're restarting the crawler and don't want it to go back to the beginning, you can simply run it with

python crawler.py

This will skip the step of populating the URL queue with subcategory links, and assumes that there are already URLs stored in redis from a previous instance of the crawler.

This is convenient for making updates to crawler or parsing logic that only affect a few pages, without going back to the beginning and redoing all of your previous crawling work.

Piping Output to a Logfile

If you'd like to redirect the logging output into a logfile for later analysis, run the crawler with:

python crawler.py [start] > /var/log/crawler.log

Known Limitations

Amazon uses many different styles of markup depending on the category and product type. This crawler focused mostly on the "Music, Movies & Games" category as well as the "Sports & Outdoors" category.

The extractors for finding product listings and their details will likely need to be changed to crawl different categories, or as the site's markup changes over time.

public-amazon-crawler's People

Contributors

dependabot[bot] avatar hartleybrody avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

public-amazon-crawler's Issues

syntax error

Hey,

I'm getting system error @ "print "{}: {}".format(datetime.now(), msg)" in helpers.

How must proxy list look like?

I buyed some proxys from proxybonanza.

But when i will start crawler.py
python crawler.py start

Its showing
2018-03-20 19:50:44.586334: Seeding the URL frontier with subcategory URLs

But nothing doing than... there nothing coming anymore...

BeautifulSoup 3.2.1

Collecting BeautifulSoup==3.2.1
  Downloading BeautifulSoup-3.2.1.tar.gz (31 kB)
    ERROR: Command errored out with exit status 1:
     command: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-1xhij7ia/BeautifulSoup/setup.py'"'"'; __file__='"'"'/tmp/pip-install-1xhij7ia/BeautifulSoup/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-1xhij7ia/BeautifulSoup/pip-egg-info                
         cwd: /tmp/pip-install-1xhij7ia/BeautifulSoup/                                                             
    Complete output (6 lines):                                                                                     
    Traceback (most recent call last):                                                                             
      File "<string>", line 1, in <module>                                                                         
      File "/tmp/pip-install-1xhij7ia/BeautifulSoup/setup.py", line 22                                             
        print "Unit tests have failed!"                                                                            
              ^                                                                                                    
    SyntaxError: Missing parentheses in call to 'print'. Did you mean print("Unit tests have failed!")?            
    ----------------------------------------                                                                       
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Its possible buid it in PHP?

I would like to know if is possible i build this awesome tool in php? And how i can start? I would like to create in importer products for prestashop and magento.

code license

Hi,
I searched in the repository, but nowhere I found a License file or a note about the license of the code.

What is the license of public-amazon-crawler ?

Thanks

WARNING 503 from Amazon and WARNING No URLs found in the queue

Most time i get now 503 WARNINGS...

Why is that? I need more Proxies or something?
Any amount recommended or something? why get so much 503 now... before much hours it works fine

And i get much time
WARNING: No URLs found in the queue. Retrying...
But i dont know why

Proxybonanza settings

Did you use shared proxies or exclusive proxies? And how many?
And did you experience some/any failures?
Thanks!

Amazon returns blank pages.

Amazon returns blank pages. This has been happening as of late. I've tried from different proxies, in 1 thread, with different headers. However, on the same request built through CURL it returns as it should. What may be the reason
pic
?

postgreSQL

root@4-8-CPU-Optimized:/crawler# python models.py
Traceback (most recent call last):
  File "models.py", line 5, in <module>
    conn = psycopg2.connect(database=settings.database, host=settings.host, user=settings.user)
  File "/usr/local/lib/python2.7/dist-packages/psycopg2/__init__.py", line 130, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: fe_sendauth: no password supplied

What version of postgreSQL do i need to use?
In this case i used 10 and 9.5.

RuntimeError: maximum recursion depth exceeded

I tried to test this script.
I found following issues
a)I first got Encoding error which I resolved by changing line 48 in helpers.py with
page_text = r.text.encode('utf-8').decode('ascii', 'ignore')
return BeautifulSoup(page_text), page_text
b)The script did not find any sub category and probably that results in recursion depth error.
python error.txt
log2.txt

Requests fail, crawler not working.

$ python crawler.py start
2017-10-16 13:01:04.988490: Seeding the URL frontier with subcategory URLs
2017-10-16 13:02:05.917383: WARNING: Request for https://www.amazon.in/s/ref=s9_acss_bw_cg_DressLBD_1a1_w?rh=i%3Aapparel%2Cn%3A1571271031%2Cn%3A%211571272031%2Cn%3A1953602031%2Cn%3A11400137031%2Cn%3A1968445031%2Cp_36%3A-79900%2Cp_98%3A10440597031 failed, trying again.

Is it not support py3 ? SyntaxError: Missing parentheses in call to 'print'

-- Win7 64bit / Python 3.4 32bit --

D:\study\mysite\public-amazon-crawler
λ virtualenv amazon
Using base prefix 'c:\python34'
New python executable in D:\study\mysite\public-amazon-crawler\amazon\Scripts\python.exe
Installing setuptools, pip, wheel...done.

D:\study\mysite\public-amazon-crawler
λ amazon\Scripts\activate

(amazon) D:\study\mysite\public-amazon-crawler
λ pip install -r requirements.txt
Collecting BeautifulSoup==3.2.1 (from -r requirements.txt (line 1))
Downloading BeautifulSoup-3.2.1.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\ADMINI~1\AppData\Local\Temp\pip-build-c3q7qqty\BeautifulSoup\setup.py", line 22
print "Unit tests have failed!"
^
SyntaxError: Missing parentheses in call to 'print'

Command "python setup.py egg_info" failed with error code 1 in C:\Users\ADMINI~1\AppData\Local\Temp\pip-build-c3q7qqty\BeautifulSoup\

Download image

Hello, i was wondering if it's also possible to add a line to include product image scraping. Your response is much appreciated.

Thanks

Amazon Warehouse Deals

Is it possible to crawl only Amazon Warehouse Deals articles? I don't mean warehouse products in general, but only the warehouse deals articles who get -30% with the checkout. This would be amazing!

Can't get anything running

Constantly bombarded with issues of connection, have you tested this recently?

redis.exceptions.ConnectionError: Error 61 connecting to 127.0.0.1:6379. Connection refused.

Amazon link structure changed (?)

Looks like Amazon has changed structure of links a bit, as the crawler fails to run:

Traceback (most recent call last):
  File "crawler.py", line 98, in <module>
    begin_crawl()  # put a bunch of subcategory URLs into the queue
  File "crawler.py", line 42, in begin_crawl
    enqueue_url(link)
  File "/tmp/public-amazon-crawler/helpers.py", line 104, in enqueue_url
    url = format_url(u)
  File "/tmp/public-amazon-crawler/helpers.py", line 67, in format_url
    k, v = piece.split("=")
ValueError: too many values to unpack

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.