hartleybrody / public-amazon-crawler Goto Github PK

View Code? Open in Web Editor NEW

672.0 47.0 225.0 13 KB

Python 100.00%

public-amazon-crawler's Introduction

Amazon Crawler

A relatively simple amazon.com crawler written in python. It has the following features:

supports hundreds of simultaneous requests, depending on machine's limits
supports using proxy servers
supports scaling to multiple machines orchestrating the crawl and keeping in sync
can be paused and restarted without losing its place
logs progress and warning conditions to a file for later analysis

It was used to pull over 1MM+ products and their images from amazon in a few hours. Read more.

Getting it Setup

After you get a copy of this codebase pulled down locally (either downloaded as a zip or git cloned), you'll need to install the python dependencies:

pip install -r requirements.txt

Then you'll need to go into the settings.py file and update a number of values:

Database Name, Host and User - Connection information for storing products in a postgres database
Redis Host, Port and Database - Connection information for storing the URL queue in redis
Proxy List as well as User, Password and Port - Connection information for your list of proxy servers

Once you've updated all of your connection information, you'll need to run the following at the command line to setup the postgres table that will store the product records:

python models.py

The fields that are stored for each product are the following:

title
product_url (URL for the detail page)
listing_url (URL of the subcategory listing page we found this product on)
price
primary_img (the URL to the full-size primary product image)
crawl_time (the timestamp of when the crawl began)

How it Works

You begin the crawler for the first time by running:

python crawler.py start

This runs a function that looks at all of the category URLs stored in the start-urls.txt file, and then explodes those out into hundreds of subcategory URLs it finds on the category pages. Each of these subcategory URLs is placed in the redis queue that holds the frontier listing URLs to be crawled.

Then the program spins up the number of threads defined in settings.max_threads and each one of those threads pops a listing URL from the queue, makes a request to it and then stores the (usually) 10-12 products it finds on the listing page. It also looks for the "next page" URL and puts that in the queue.

Restarting the crawler

If you're restarting the crawler and don't want it to go back to the beginning, you can simply run it with

python crawler.py

This will skip the step of populating the URL queue with subcategory links, and assumes that there are already URLs stored in redis from a previous instance of the crawler.

This is convenient for making updates to crawler or parsing logic that only affect a few pages, without going back to the beginning and redoing all of your previous crawling work.

Piping Output to a Logfile

If you'd like to redirect the logging output into a logfile for later analysis, run the crawler with:

python crawler.py [start] > /var/log/crawler.log

Known Limitations

Amazon uses many different styles of markup depending on the category and product type. This crawler focused mostly on the "Music, Movies & Games" category as well as the "Sports & Outdoors" category.

The extractors for finding product listings and their details will likely need to be changed to crawl different categories, or as the site's markup changes over time.

public-amazon-crawler's People

Contributors

Stargazers

Watchers

Forkers

ppiatkowski bertomartin inan1993 seagatesoft hazardary darf-ferrara powerdude sokratisvidros arunpn sthebuoy rzhou1993 yjil9341 lpaydat nntoan amit-dingare nikolaynenov chenkuochen norshtein tonywangcn yappawu xtmhm2000 imranh wilgert srth12 jicksonp mcedica swaroope liubo2055 ioffl maxkimambo rav0 miej ahelsinger steathy laxmanverma brianpugh micahwallace light-bringer chrisjoyce liangdabiao irokhes multiaki akosel amirbosch datasciencewonk didiooi chiddianozie cc13ny zerobitzero marspure nikitamore91 tayfuntuna gordonshamway easyguyme fubo tomcangbk msweetland damingyang kasaram sexcode sportsbitenews jimmyc815 nhtera claytonpabst iamnodejs007 juno249 thien-thach siva7891 ashishkumar148 zenners paras-gaji pieterbork realjkeee yariv-h kirschd mpekalski rwcodex zinxon andresvidal silasxue suhaib-v shivamagrawal2014 akshay-kumar lwang89 dconnx jamestse nicolasescobar0325 ericson-cepeda alonso-santaella getsong rohitnwn omairzaman aishwaryaganesan romirpatney actiboost daymos noparade moonnoom304 nukedbit crazyrocks

public-amazon-crawler's Issues

syntax error

Hey,

I'm getting system error @ "print "{}: {}".format(datetime.now(), msg)" in helpers.

How must proxy list look like?

I buyed some proxys from proxybonanza.

But when i will start crawler.py
python crawler.py start

Its showing
2018-03-20 19:50:44.586334: Seeding the URL frontier with subcategory URLs

But nothing doing than... there nothing coming anymore...

BeautifulSoup 3.2.1

Collecting BeautifulSoup==3.2.1
  Downloading BeautifulSoup-3.2.1.tar.gz (31 kB)
    ERROR: Command errored out with exit status 1:
     command: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-1xhij7ia/BeautifulSoup/setup.py'"'"'; __file__='"'"'/tmp/pip-install-1xhij7ia/BeautifulSoup/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-1xhij7ia/BeautifulSoup/pip-egg-info                
         cwd: /tmp/pip-install-1xhij7ia/BeautifulSoup/                                                             
    Complete output (6 lines):                                                                                     
    Traceback (most recent call last):                                                                             
      File "<string>", line 1, in <module>                                                                         
      File "/tmp/pip-install-1xhij7ia/BeautifulSoup/setup.py", line 22                                             
        print "Unit tests have failed!"                                                                            
              ^                                                                                                    
    SyntaxError: Missing parentheses in call to 'print'. Did you mean print("Unit tests have failed!")?            
    ----------------------------------------                                                                       
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Its possible buid it in PHP?

I would like to know if is possible i build this awesome tool in php? And how i can start? I would like to create in importer products for prestashop and magento.

code license

Hi,
I searched in the repository, but nowhere I found a License file or a note about the license of the code.

What is the license of public-amazon-crawler ?

Thanks

WARNING 503 from Amazon and WARNING No URLs found in the queue

Most time i get now 503 WARNINGS...

Why is that? I need more Proxies or something?
Any amount recommended or something? why get so much 503 now... before much hours it works fine

And i get much time
WARNING: No URLs found in the queue. Retrying...
But i dont know why

Proxybonanza settings

Did you use shared proxies or exclusive proxies? And how many?
And did you experience some/any failures?
Thanks!

image_dir unused

Amazon returns blank pages.

Amazon returns blank pages. This has been happening as of late. I've tried from different proxies, in 1 thread, with different headers. However, on the same request built through CURL it returns as it should. What may be the reason

?

Which Settings recommended for proxies?

Hi, i tried some proxies cause that much requests costs very much with the time.

So i tried like that site http://pubproxy.com/#premium but which settings are needed for this crawler?
I have proxybonanze already, working fine with SOCKS5, so i tried so much on pubproxy but its not working...

postgreSQL

root@4-8-CPU-Optimized:/crawler# python models.py
Traceback (most recent call last):
  File "models.py", line 5, in <module>
    conn = psycopg2.connect(database=settings.database, host=settings.host, user=settings.user)
  File "/usr/local/lib/python2.7/dist-packages/psycopg2/__init__.py", line 130, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: fe_sendauth: no password supplied

What version of postgreSQL do i need to use?
In this case i used 10 and 9.5.

RuntimeError: maximum recursion depth exceeded

I tried to test this script.
I found following issues
a)I first got Encoding error which I resolved by changing line 48 in helpers.py with
page_text = r.text.encode('utf-8').decode('ascii', 'ignore')
return BeautifulSoup(page_text), page_text
b)The script did not find any sub category and probably that results in recursion depth error.
python error.txt
log2.txt

Requests fail, crawler not working.

$ python crawler.py start
2017-10-16 13:01:04.988490: Seeding the URL frontier with subcategory URLs
2017-10-16 13:02:05.917383: WARNING: Request for https://www.amazon.in/s/ref=s9_acss_bw_cg_DressLBD_1a1_w?rh=i%3Aapparel%2Cn%3A1571271031%2Cn%3A%211571272031%2Cn%3A1953602031%2Cn%3A11400137031%2Cn%3A1968445031%2Cp_36%3A-79900%2Cp_98%3A10440597031 failed, trying again.

Is it not support py3 ? SyntaxError: Missing parentheses in call to 'print'

-- Win7 64bit / Python 3.4 32bit --

D:\study\mysite\public-amazon-crawler
λ virtualenv amazon
Using base prefix 'c:\python34'
New python executable in D:\study\mysite\public-amazon-crawler\amazon\Scripts\python.exe
Installing setuptools, pip, wheel...done.

D:\study\mysite\public-amazon-crawler
λ amazon\Scripts\activate

(amazon) D:\study\mysite\public-amazon-crawler
λ pip install -r requirements.txt
Collecting BeautifulSoup==3.2.1 (from -r requirements.txt (line 1))
Downloading BeautifulSoup-3.2.1.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\ADMINI~1\AppData\Local\Temp\pip-build-c3q7qqty\BeautifulSoup\setup.py", line 22
print "Unit tests have failed!"
^
SyntaxError: Missing parentheses in call to 'print'

Command "python setup.py egg_info" failed with error code 1 in C:\Users\ADMINI~1\AppData\Local\Temp\pip-build-c3q7qqty\BeautifulSoup\

Traceback (most recent call last):
  File "crawler.py", line 98, in <module>
    begin_crawl()  # put a bunch of subcategory URLs into the queue
  File "crawler.py", line 42, in begin_crawl
    enqueue_url(link)
  File "/tmp/public-amazon-crawler/helpers.py", line 104, in enqueue_url
    url = format_url(u)
  File "/tmp/public-amazon-crawler/helpers.py", line 67, in format_url
    k, v = piece.split("=")
ValueError: too many values to unpack