public-amazon-crawler's Issues
Amazon Warehouse Deals
Is it possible to crawl only Amazon Warehouse Deals articles? I don't mean warehouse products in general, but only the warehouse deals articles who get -30% with the checkout. This would be amazing!
image_dir unused
Which Settings recommended for proxies?
Hi, i tried some proxies cause that much requests costs very much with the time.
So i tried like that site http://pubproxy.com/#premium but which settings are needed for this crawler?
I have proxybonanze already, working fine with SOCKS5, so i tried so much on pubproxy but its not working...
BeautifulSoup 3.2.1
Collecting BeautifulSoup==3.2.1
Downloading BeautifulSoup-3.2.1.tar.gz (31 kB)
ERROR: Command errored out with exit status 1:
command: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-1xhij7ia/BeautifulSoup/setup.py'"'"'; __file__='"'"'/tmp/pip-install-1xhij7ia/BeautifulSoup/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-1xhij7ia/BeautifulSoup/pip-egg-info
cwd: /tmp/pip-install-1xhij7ia/BeautifulSoup/
Complete output (6 lines):
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-install-1xhij7ia/BeautifulSoup/setup.py", line 22
print "Unit tests have failed!"
^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print("Unit tests have failed!")?
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
code license
Hi,
I searched in the repository, but nowhere I found a License file or a note about the license of the code.
What is the license of public-amazon-crawler ?
Thanks
Can't get anything running
Constantly bombarded with issues of connection, have you tested this recently?
redis.exceptions.ConnectionError: Error 61 connecting to 127.0.0.1:6379. Connection refused.
Is it not support py3 ? SyntaxError: Missing parentheses in call to 'print'
-- Win7 64bit / Python 3.4 32bit --
D:\study\mysite\public-amazon-crawler
λ virtualenv amazon
Using base prefix 'c:\python34'
New python executable in D:\study\mysite\public-amazon-crawler\amazon\Scripts\python.exe
Installing setuptools, pip, wheel...done.
D:\study\mysite\public-amazon-crawler
λ amazon\Scripts\activate
(amazon) D:\study\mysite\public-amazon-crawler
λ pip install -r requirements.txt
Collecting BeautifulSoup==3.2.1 (from -r requirements.txt (line 1))
Downloading BeautifulSoup-3.2.1.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\ADMINI~1\AppData\Local\Temp\pip-build-c3q7qqty\BeautifulSoup\setup.py", line 22
print "Unit tests have failed!"
^
SyntaxError: Missing parentheses in call to 'print'
Command "python setup.py egg_info" failed with error code 1 in C:\Users\ADMINI~1\AppData\Local\Temp\pip-build-c3q7qqty\BeautifulSoup\
syntax error
Hey,
I'm getting system error @ "print "{}: {}".format(datetime.now(), msg)" in helpers.
Proxybonanza settings
Did you use shared proxies or exclusive proxies? And how many?
And did you experience some/any failures?
Thanks!
Requests fail, crawler not working.
$ python crawler.py start
2017-10-16 13:01:04.988490: Seeding the URL frontier with subcategory URLs
2017-10-16 13:02:05.917383: WARNING: Request for https://www.amazon.in/s/ref=s9_acss_bw_cg_DressLBD_1a1_w?rh=i%3Aapparel%2Cn%3A1571271031%2Cn%3A%211571272031%2Cn%3A1953602031%2Cn%3A11400137031%2Cn%3A1968445031%2Cp_36%3A-79900%2Cp_98%3A10440597031 failed, trying again.
Amazon link structure changed (?)
Looks like Amazon has changed structure of links a bit, as the crawler fails to run:
Traceback (most recent call last):
File "crawler.py", line 98, in <module>
begin_crawl() # put a bunch of subcategory URLs into the queue
File "crawler.py", line 42, in begin_crawl
enqueue_url(link)
File "/tmp/public-amazon-crawler/helpers.py", line 104, in enqueue_url
url = format_url(u)
File "/tmp/public-amazon-crawler/helpers.py", line 67, in format_url
k, v = piece.split("=")
ValueError: too many values to unpack
Amazon returns blank pages.
WARNING 503 from Amazon and WARNING No URLs found in the queue
Most time i get now 503 WARNINGS...
Why is that? I need more Proxies or something?
Any amount recommended or something? why get so much 503 now... before much hours it works fine
And i get much time
WARNING: No URLs found in the queue. Retrying...
But i dont know why
postgreSQL
root@4-8-CPU-Optimized:/crawler# python models.py
Traceback (most recent call last):
File "models.py", line 5, in <module>
conn = psycopg2.connect(database=settings.database, host=settings.host, user=settings.user)
File "/usr/local/lib/python2.7/dist-packages/psycopg2/__init__.py", line 130, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: fe_sendauth: no password supplied
What version of postgreSQL do i need to use?
In this case i used 10 and 9.5.
Download image
Hello, i was wondering if it's also possible to add a line to include product image scraping. Your response is much appreciated.
Thanks
How must proxy list look like?
I buyed some proxys from proxybonanza.
But when i will start crawler.py
python crawler.py start
Its showing
2018-03-20 19:50:44.586334: Seeding the URL frontier with subcategory URLs
But nothing doing than... there nothing coming anymore...
RuntimeError: maximum recursion depth exceeded
I tried to test this script.
I found following issues
a)I first got Encoding error which I resolved by changing line 48 in helpers.py with
page_text = r.text.encode('utf-8').decode('ascii', 'ignore')
return BeautifulSoup(page_text), page_text
b)The script did not find any sub category and probably that results in recursion depth error.
python error.txt
log2.txt
Its possible buid it in PHP?
I would like to know if is possible i build this awesome tool in php? And how i can start? I would like to create in importer products for prestashop and magento.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.