Git Product home page Git Product logo

public-amazon-crawler's Issues

Amazon Warehouse Deals

Is it possible to crawl only Amazon Warehouse Deals articles? I don't mean warehouse products in general, but only the warehouse deals articles who get -30% with the checkout. This would be amazing!

BeautifulSoup 3.2.1

Collecting BeautifulSoup==3.2.1
  Downloading BeautifulSoup-3.2.1.tar.gz (31 kB)
    ERROR: Command errored out with exit status 1:
     command: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-1xhij7ia/BeautifulSoup/setup.py'"'"'; __file__='"'"'/tmp/pip-install-1xhij7ia/BeautifulSoup/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-1xhij7ia/BeautifulSoup/pip-egg-info                
         cwd: /tmp/pip-install-1xhij7ia/BeautifulSoup/                                                             
    Complete output (6 lines):                                                                                     
    Traceback (most recent call last):                                                                             
      File "<string>", line 1, in <module>                                                                         
      File "/tmp/pip-install-1xhij7ia/BeautifulSoup/setup.py", line 22                                             
        print "Unit tests have failed!"                                                                            
              ^                                                                                                    
    SyntaxError: Missing parentheses in call to 'print'. Did you mean print("Unit tests have failed!")?            
    ----------------------------------------                                                                       
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

code license

Hi,
I searched in the repository, but nowhere I found a License file or a note about the license of the code.

What is the license of public-amazon-crawler ?

Thanks

Can't get anything running

Constantly bombarded with issues of connection, have you tested this recently?

redis.exceptions.ConnectionError: Error 61 connecting to 127.0.0.1:6379. Connection refused.

Is it not support py3 ? SyntaxError: Missing parentheses in call to 'print'

-- Win7 64bit / Python 3.4 32bit --

D:\study\mysite\public-amazon-crawler
λ virtualenv amazon
Using base prefix 'c:\python34'
New python executable in D:\study\mysite\public-amazon-crawler\amazon\Scripts\python.exe
Installing setuptools, pip, wheel...done.

D:\study\mysite\public-amazon-crawler
λ amazon\Scripts\activate

(amazon) D:\study\mysite\public-amazon-crawler
λ pip install -r requirements.txt
Collecting BeautifulSoup==3.2.1 (from -r requirements.txt (line 1))
Downloading BeautifulSoup-3.2.1.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\ADMINI~1\AppData\Local\Temp\pip-build-c3q7qqty\BeautifulSoup\setup.py", line 22
print "Unit tests have failed!"
^
SyntaxError: Missing parentheses in call to 'print'

Command "python setup.py egg_info" failed with error code 1 in C:\Users\ADMINI~1\AppData\Local\Temp\pip-build-c3q7qqty\BeautifulSoup\

syntax error

Hey,

I'm getting system error @ "print "{}: {}".format(datetime.now(), msg)" in helpers.

Proxybonanza settings

Did you use shared proxies or exclusive proxies? And how many?
And did you experience some/any failures?
Thanks!

Requests fail, crawler not working.

$ python crawler.py start
2017-10-16 13:01:04.988490: Seeding the URL frontier with subcategory URLs
2017-10-16 13:02:05.917383: WARNING: Request for https://www.amazon.in/s/ref=s9_acss_bw_cg_DressLBD_1a1_w?rh=i%3Aapparel%2Cn%3A1571271031%2Cn%3A%211571272031%2Cn%3A1953602031%2Cn%3A11400137031%2Cn%3A1968445031%2Cp_36%3A-79900%2Cp_98%3A10440597031 failed, trying again.

Amazon link structure changed (?)

Looks like Amazon has changed structure of links a bit, as the crawler fails to run:

Traceback (most recent call last):
  File "crawler.py", line 98, in <module>
    begin_crawl()  # put a bunch of subcategory URLs into the queue
  File "crawler.py", line 42, in begin_crawl
    enqueue_url(link)
  File "/tmp/public-amazon-crawler/helpers.py", line 104, in enqueue_url
    url = format_url(u)
  File "/tmp/public-amazon-crawler/helpers.py", line 67, in format_url
    k, v = piece.split("=")
ValueError: too many values to unpack

Amazon returns blank pages.

Amazon returns blank pages. This has been happening as of late. I've tried from different proxies, in 1 thread, with different headers. However, on the same request built through CURL it returns as it should. What may be the reason
pic
?

WARNING 503 from Amazon and WARNING No URLs found in the queue

Most time i get now 503 WARNINGS...

Why is that? I need more Proxies or something?
Any amount recommended or something? why get so much 503 now... before much hours it works fine

And i get much time
WARNING: No URLs found in the queue. Retrying...
But i dont know why

postgreSQL

root@4-8-CPU-Optimized:/crawler# python models.py
Traceback (most recent call last):
  File "models.py", line 5, in <module>
    conn = psycopg2.connect(database=settings.database, host=settings.host, user=settings.user)
  File "/usr/local/lib/python2.7/dist-packages/psycopg2/__init__.py", line 130, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: fe_sendauth: no password supplied

What version of postgreSQL do i need to use?
In this case i used 10 and 9.5.

Download image

Hello, i was wondering if it's also possible to add a line to include product image scraping. Your response is much appreciated.

Thanks

How must proxy list look like?

I buyed some proxys from proxybonanza.

But when i will start crawler.py
python crawler.py start

Its showing
2018-03-20 19:50:44.586334: Seeding the URL frontier with subcategory URLs

But nothing doing than... there nothing coming anymore...

RuntimeError: maximum recursion depth exceeded

I tried to test this script.
I found following issues
a)I first got Encoding error which I resolved by changing line 48 in helpers.py with
page_text = r.text.encode('utf-8').decode('ascii', 'ignore')
return BeautifulSoup(page_text), page_text
b)The script did not find any sub category and probably that results in recursion depth error.
python error.txt
log2.txt

Its possible buid it in PHP?

I would like to know if is possible i build this awesome tool in php? And how i can start? I would like to create in importer products for prestashop and magento.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.