ecoron / serpscrap Goto Github PK

SEO python scraper to extract data from major searchengine result pages. Extract data like url, title, snippet, richsnippet and the type from searchresults for given keywords. Detect Ads or make automated screenshots. You can also fetch text content of urls provided in searchresults or by your own. It's usefull for SEO and business related research tasks.

Home Page: https://github.com/ecoron/SerpScrap

License: MIT License

Python 98.77% Shell 0.98% Dockerfile 0.25%

research scraper scraping screenshot search seo

serpscrap's People

Contributors

Stargazers

Watchers

serpscrap's Issues

Standard example example_simple.py returns only SERP Youtube results insertions

Output of standard exmaple file example_simple.py. Only Youtube insertions have been scrapped. Also I saw that Google pages css selectors have been changed.

{'query': 'bienen',rocessed.�[0m
'query_num_results_page': 4,
'query_num_results_total': '0',
'query_page_number': 1,
'screenshot': '/tmp/screenshots/2020-10-15/google_bienen-p1.png',
'serp_domain': 'www1.wdr.de',
'serp_rank': 1,
'serp_rating': None,
'serp_sitelinks': None,
'serp_snippet': '14:24#bienenlive — Basiswissen BienenWDR - 2019-12-05',
'serp_title': '14:24#bienenlive — Basiswissen BienenWDR - 2019-12-05',
'serp_type': 'videos',
'serp_url': 'https://www1.wdr.de/mediathek/video/sendungen/planet-schule/video-bienenlive--basiswissen-bienen-100.html',
'serp_visible_link': None}

{'query': 'bienen',
'query_num_results_page': 4,
'query_num_results_total': '0',
'query_page_number': 1,
'screenshot': '/tmp/screenshots/2020-10-15/google_bienen-p1.png',
'serp_domain': 'www.youtube.com',
'serp_rank': 2,
'serp_rating': None,
'serp_sitelinks': None,
'serp_snippet': 'PeržiūraPeržiūra44:39Rettung für unsere Bienen? Ein Forscher '
'macht Hoffnung für ...NDR DokuYouTube - 2019-04-15',
'serp_title': 'PeržiūraPeržiūra44:39Rettung für unsere Bienen? Ein Forscher '
'macht Hoffnung für ...NDR DokuYouTube - 2019-04-15',
'serp_type': 'videos',
'serp_url': 'https://www.youtube.com/watch?v=1f0sjiGtWIQ',
'serp_visible_link': None}

{'query': 'bienen',
'query_num_results_page': 4,
'query_num_results_total': '0',
'query_page_number': 1,
'screenshot': '/tmp/screenshots/2020-10-15/google_bienen-p1.png',
'serp_domain': 'www.youtube.com',
'serp_rank': 3,
'serp_rating': None,
'serp_sitelinks': None,
'serp_snippet': 'PeržiūraPeržiūra9:27Die wunderbare Organisation der Bienen | '
'BRBayerischer RundfunkYouTube - 2018-03-22',
'serp_title': 'PeržiūraPeržiūra9:27Die wunderbare Organisation der Bienen | '
'BRBayerischer RundfunkYouTube - 2018-03-22',
'serp_type': 'videos',
'serp_url': 'https://www.youtube.com/watch?v=DH0uywA5CrU',
'serp_visible_link': None}

{'query': 'bienen',
'query_num_results_page': 4,
'query_num_results_total': '0',
'query_page_number': 1,
'screenshot': '/tmp/screenshots/2020-10-15/google_bienen-p1.png',
'serp_domain': 'www.youtube.com',
'serp_rank': 4,
'serp_rating': None,
'serp_sitelinks': None,
'serp_snippet': 'PeržiūraPeržiūra4:48Bienen: Leben im Bienenstock - Biologie '
'| Duden LearnattackDuden LearnattackYouTube - 2019-04-07',
'serp_title': 'PeržiūraPeržiūra4:48Bienen: Leben im Bienenstock - Biologie | '
'Duden LearnattackDuden LearnattackYouTube - 2019-04-07',
'serp_type': 'videos',
'serp_url': 'https://www.youtube.com/watch?v=eVcNIgnvXhE',
'serp_visible_link': None}

Add maps/local results detection

It would be nice to see Maps results detected in scrapes.

Error at first run on Windows

Windows 10, 64bit

scrape.py is a copy-paste of this: https://github.com/ecoron/SerpScrap/blob/master/examples/example_simple.py

2018-01-06 07:46:11,421 - root - INFO - preparing phantomjs
2018-01-06 07:46:11,421 - root - INFO - detecting phantomjs
2018-01-06 07:46:11,424 - root - INFO - downloading phantomjs
Traceback (most recent call last):
  File ".\scrape.py", line 11, in <module>
    scrap.init(config=config.get(), keywords=keywords)
  File "C:\Users\Csaba Okrona\Devel\Own\serp-scraping\venv\lib\site-packages\serpscrap\serpscrap.py", line 84, in init
    firstrun.download()
  File "C:\Users\Csaba Okrona\Devel\Own\serp-scraping\venv\lib\site-packages\serpscrap\phantom_install.py", line 69, in download
    urllib.request.urlretrieve(base_url + file_name, '/tmp/' + file_name)
  File "c:\users\csaba okrona\appdata\local\programs\python\python36-32\Lib\urllib\request.py", line 258, in urlretrieve
    tfp = open(filename, 'wb')
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/phantomjs-2.1.1-windows.zip'
`

No scraping Ads

HI,

It works well with shopping Ads but there are no search ads in extracted data.

Error when running in Jupyter Notebook..

Exception:
Platform not supported.
install phantomjs manualy and update the path in your config

scrapcore.scraper.selenium - ERROR - Scrape Exception pass. Selector: #resultStats: I just copy pasted the example present in how to use section and it is giving me this error. Any help on why this might be happening is much appreciated.

scrapcore.scraper.selenium - ERROR - Message: session not created: This version of ChromeDriver only supports Chrome version 76

Hey, this looks like an awesome python project, but I'm not able to run the demo code provided due to this error:

scrapcore.scraper.selenium - ERROR - Message: session not created: This version of ChromeDriver only supports Chrome version 76

I'm on a Mac, and Chrome is up to v83.0.4103.116 now. I see the SerpScrap changelog shows support for chrome driver versions >= 76.*. Any ideas on how to fix this?

0.13.0: updated dependencies: chromedriver >= 76.0.3809.68 to use actual driver, sqlalchemy>=1.3.7 to solve security issues and other minor update changes

Uncaught exceptions

Using 0.9.1 with lots of proxies.
The log is full of unhandled Exceptions.

Exception in thread [google]SelScrape:
Traceback (most recent call last):
  File "/home/test/anaconda2/envs/python3/lib/python3.4/threading.py", line 911, in _bootstrap_inner
    self.run()
  File "/home/test/anaconda2/envs/python3/lib/python3.4/site-packages/scrapcore/scraper/selenium.py", line 710, in run
    if not self._get_webdriver():
  File "/home/test/anaconda2/envs/python3/lib/python3.4/site-packages/scrapcore/scraper/selenium.py", line 230, in _get_webdriver
    return self._get_PhantomJS()
  File "/home/test/anaconda2/envs/python3/lib/python3.4/site-packages/scrapcore/scraper/selenium.py", line 344, in _get_PhantomJS
    desired_capabilities=dcap
  File "/home/test/anaconda2/envs/python3/lib/python3.4/site-packages/selenium/webdriver/phantomjs/webdriver.py", line 58, in __init__
    desired_capabilities=desired_capabilities)
  File "/home/test/anaconda2/envs/python3/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 140, in __init__
    self.start_session(desired_capabilities, browser_profile)
  File "/home/test/anaconda2/envs/python3/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 229, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "/home/test/anaconda2/envs/python3/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 295, in execute
    response = self.command_executor.execute(driver_command, params)
  File "/home/test/anaconda2/envs/python3/lib/python3.4/site-packages/selenium/webdriver/remote/remote_connection.py", line 464, in execute
    return self._request(command_info[0], url, body=data)
  File "/home/test/anaconda2/envs/python3/lib/python3.4/site-packages/selenium/webdriver/remote/remote_connection.py", line 526, in _request
    resp = opener.open(request, timeout=self._timeout)
  File "/home/test/anaconda2/envs/python3/lib/python3.4/urllib/request.py", line 464, in open
    response = self._open(req, data)
  File "/home/test/anaconda2/envs/python3/lib/python3.4/urllib/request.py", line 482, in _open
    '_open', req)
  File "/home/test/anaconda2/envs/python3/lib/python3.4/urllib/request.py", line 442, in _call_chain
    result = func(*args)
  File "/home/test/anaconda2/envs/python3/lib/python3.4/urllib/request.py", line 1211, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/home/test/anaconda2/envs/python3/lib/python3.4/urllib/request.py", line 1186, in do_open
    r = h.getresponse()
  File "/home/test/anaconda2/envs/python3/lib/python3.4/http/client.py", line 1227, in getresponse
    response.begin()
  File "/home/test/anaconda2/envs/python3/lib/python3.4/http/client.py", line 386, in begin
    version, status, reason = self._read_status()
  File "/home/test/anaconda2/envs/python3/lib/python3.4/http/client.py", line 348, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/home/test/anaconda2/envs/python3/lib/python3.4/socket.py", line 378, in readinto
    return self._sock.recv_into(b)
**ConnectionResetError: [Errno 104] Connection reset by peer**

Exception in thread [google]SelScrape:
Traceback (most recent call last):
  File "/home/test/anaconda2/envs/python3/lib/python3.4/urllib/request.py", line 1183, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "/home/test/anaconda2/envs/python3/lib/python3.4/http/client.py", line 1137, in request
    self._send_request(method, url, body, headers)
  File "/home/test/anaconda2/envs/python3/lib/python3.4/http/client.py", line 1182, in _send_request
    self.endheaders(body)
  File "/home/test/anaconda2/envs/python3/lib/python3.4/http/client.py", line 1133, in endheaders
    self._send_output(message_body)
  File "/home/test/anaconda2/envs/python3/lib/python3.4/http/client.py", line 963, in _send_output
    self.send(msg)
  File "/home/test/anaconda2/envs/python3/lib/python3.4/http/client.py", line 898, in send
    self.connect()
  File "/home/test/anaconda2/envs/python3/lib/python3.4/http/client.py", line 871, in connect
    self.timeout, self.source_address)
  File "/home/test/anaconda2/envs/python3/lib/python3.4/socket.py", line 516, in create_connection
    raise err
  File "/home/test/anaconda2/envs/python3/lib/python3.4/socket.py", line 507, in create_connection
    sock.connect(sa)
**ConnectionRefusedError: [Errno 111] Connection refused**

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/test/anaconda2/envs/python3/lib/python3.4/threading.py", line 911, in _bootstrap_inner
    self.run()
  File "/home/test/anaconda2/envs/python3/lib/python3.4/site-packages/scrapcore/scraper/selenium.py", line 710, in run
    if not self._get_webdriver():
  File "/home/test/anaconda2/envs/python3/lib/python3.4/site-packages/scrapcore/scraper/selenium.py", line 230, in _get_webdriver
    return self._get_PhantomJS()
  File "/home/test/anaconda2/envs/python3/lib/python3.4/site-packages/scrapcore/scraper/selenium.py", line 344, in _get_PhantomJS
    desired_capabilities=dcap
  File "/home/test/anaconda2/envs/python3/lib/python3.4/site-packages/selenium/webdriver/phantomjs/webdriver.py", line 58, in __init__
    desired_capabilities=desired_capabilities)
  File "/home/test/anaconda2/envs/python3/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 140, in __init__
    self.start_session(desired_capabilities, browser_profile)
  File "/home/test/anaconda2/envs/python3/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 229, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "/home/test/anaconda2/envs/python3/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 295, in execute
    response = self.command_executor.execute(driver_command, params)
  File "/home/test/anaconda2/envs/python3/lib/python3.4/site-packages/selenium/webdriver/remote/remote_connection.py", line 464, in execute
    return self._request(command_info[0], url, body=data)
  File "/home/test/anaconda2/envs/python3/lib/python3.4/site-packages/selenium/webdriver/remote/remote_connection.py", line 526, in _request
    resp = opener.open(request, timeout=self._timeout)
  File "/home/test/anaconda2/envs/python3/lib/python3.4/urllib/request.py", line 464, in open
    response = self._open(req, data)
  File "/home/test/anaconda2/envs/python3/lib/python3.4/urllib/request.py", line 482, in _open
    '_open', req)
  File "/home/test/anaconda2/envs/python3/lib/python3.4/urllib/request.py", line 442, in _call_chain
    result = func(*args)
  File "/home/test/anaconda2/envs/python3/lib/python3.4/urllib/request.py", line 1211, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/home/test/anaconda2/envs/python3/lib/python3.4/urllib/request.py", line 1185, in do_open
    raise URLError(err)
**urllib.error.URLError: <urlopen error [Errno 111] Connection refused>**

Not alternating between proxies

Can be closed, the issue somehow solved itself :)

Issue with proxies

Hello, i put 3 proxies in a txt file (socks4 xx.xx.xx.xx:80 username:password) and what i get as a result is
2018-11-23 17:30:57,179 - scrapcore.scraper.selenium - WARNING - 'NoneType' object has no attribute 'group'
2018-11-23 17:31:04,509 - scrapcore.scraper.selenium - WARNING - 'NoneType' object has no attribute 'group'
2018-11-23 17:31:14,844 - scrapcore.scraper.selenium - WARNING - 'NoneType' object has no attribute 'group'

I can't find anything related to this error

Only returning 'shopping' serp_types

I've been using this package for months but this week it starting only returning SERPs with the serp_type value of "shopping". Here is the code I'm running...

config = serpscrap.Config()
config.set('scrape_urls', False)
config.set('do_caching', False)
config.set('num_pages_for_keyword', 1)
config.set('clean_cache_after', 0)

scrap = serpscrap.SerpScrap()
scrap.init(config=config.get(), keywords="best dog food")
results = scrap.run()

results -

{'query_num_results_total': '0', 'query_num_results_page': 3, 'query_page_number': 1, 'query': 'best dog food', 'serp_rank': 1, 'serp_type': 'shopping', 'serp_url': 'https://wildearth.com/products/clean-protein-dog-food?variant=31491525673003', 'serp_rating': None, 'serp_title': 'Wild Earth - Healthy Protein Dog Food, Clean High-Protein Formula, 18lbs', 'serp_domain': 'wildearth.com', 'serp_visible_link': None, 'serp_snippet': 'Wild Earth - Healthy Protein Dog Food, Clean High-Protein Formula, 18lbsWild Earth - Healthy Protein Dog Food, Clean High-Protein Formula, 18lbs$70.00Wild EarthWild Earth(5)Wild Earth - Healthy Protein Dog Food, Clean High-Protein Formula, 18lbs$70.00Wild Earth(5)', 'serp_sitelinks': None, 'screenshot': '/tmp/screenshots/2020-10-04/google_best dog food-p1.png'}

{'query_num_results_total': '0', 'query_num_results_page': 3, 'query_page_number': 1, 'query': 'best dog food', 'serp_rank': 2, 'serp_type': 'shopping', 'serp_url': 'https://www.petco.com/shop/en/petcostore/product/orijen-original-dry-dog-food-25-lbs-2992171', 'serp_rating': None, 'serp_title': 'ORIJEN Original Dry Dog Food, 25 lbs.', 'serp_domain': 'www.petco.com', 'serp_visible_link': None, 'serp_snippet': 'CURBSIDE PICKUPPick up todayORIJEN Original Dry Dog Food, 25 lbs.ORIJEN Original Dry Dog Food, 25 lbs.$90.99Petco.comPetco.com(1k+)CURBSIDE PICKUPPick up todayORIJEN Original Dry Dog Food, 25 lbs.$90.99Petco.com(1k+)', 'serp_sitelinks': None, 'screenshot': '/tmp/screenshots/2020-10-04/google_best dog food-p1.png'}

{'query_num_results_total': '0', 'query_num_results_page': 3, 'query_page_number': 1, 'query': 'best dog food', 'serp_rank': 3, 'serp_type': 'shopping', 'serp_url': 'https://www.chewy.com/blue-buffalo-life-protection-formula/dp/170938?utm_source=google-product&utm_medium=organic&utm_campaign=f&utm_content=Blue%20Buffalo&utm_term=%7Bkeyword%7D', 'serp_rating': None, 'serp_title': 'Blue Buffalo Life Protection Formula Adult Chicken & Brown Rice Recipe Dry Dog Food, 3-lb bag', 'serp_domain': 'www.chewy.com', 'serp_visible_link': None, 'serp_snippet': 'Blue Buffalo Life Protection Formula Adult Chicken & Brown Rice Recipe Dry Dog Food, 3-lb bagBlue Buffalo Life Protection Formula Adult Chicken & Brown Rice Recipe Dry Dog Food, 3-lb bag$9.98Chewy.comChewy.comSpecial offerCode copied to clipboardAutoship and Save 5%Chewy.comNo code requiredDiscount is automatically applied at checkout.\xa0CONTINUE TO STORECANCELBlue Buffalo Life Protection Formula Adult Chicken & Brown Rice Recipe Dry Dog Food, 3-lb bag$9.98Chewy.comSpecial offerCode copied to clipboardAutoship and Save 5%Chewy.comNo code requiredDiscount is automatically applied at checkout.\xa0CONTINUE TO STORECANCELFor most items:365-day return policy', 'serp_sitelinks': None, 'screenshot': '/tmp/screenshots/2020-10-04/google_best dog food-p1.png'}

mobile SERP

Hi @ecoron , thanks a lot for your script and I'm wondering if I can scrape the SERP as mobile device. I've seen the user_agent.py and I've switched the computer and mobile user agent. I guess it's not a good idea, as the warning below:

2018-07-24 10:25:30,725 - scrapcore.scraper.selenium - ERROR - Skip it, no such element - SeleniumSearchError
Exception in thread [google]SelScrape:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/scrapcore/scraper/selenium.py", line 600, in wait_until_serp_loaded
    str(self.page_number)
  File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/support/wait.py", line 80, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message: 
Screenshot: available via screen


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/scrapcore/scraper/selenium.py", line 606, in wait_until_serp_loaded
    self.webdriver.find_element_by_css_selector(selector).text
  File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 589, in find_element_by_css_selector
    return self.find_element(by=By.CSS_SELECTOR, value=css_selector)
  File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 955, in find_element
    'value': value})['value']
  File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 312, in execute
    self.error_handler.check_response(response)
  File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/errorhandler.py", line 237, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: {"errorMessage":"Unable to find element with css selector '#navcnt td.cur'","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Length":"105","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:58099","User-Agent":"Python http auth"},"httpVersion":"1.1","method":"POST","post":"{\"using\": \"css selector\", \"value\": \"#navcnt td.cur\", \"sessionId\": \"d1d12330-8f2b-11e8-8bc9-69459efa1ced\"}","url":"/element","urlParsed":{"anchor":"","query":"","file":"element","directory":"/","path":"/element","relative":"/element","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/element","queryKey":{},"chunks":["element"]},"urlOriginal":"/session/d1d12330-8f2b-11e8-8bc9-69459efa1ced/element"}}
Screenshot: available via screen


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.6/dist-packages/scrapcore/scraper/selenium.py", line 761, in run
    self.search()
  File "/usr/local/lib/python3.6/dist-packages/scrapcore/scraper/selenium.py", line 701, in search
    self.wait_until_serp_loaded()
  File "/usr/local/lib/python3.6/dist-packages/scrapcore/scraper/selenium.py", line 610, in wait_until_serp_loaded
    raise SeleniumSearchError('Stop Scraping, seems we are blocked')
scrapcore.scraping.SeleniumSearchError: Stop Scraping, seems we are blocked

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/serpscrap/csv_writer.py", line 11, in write
    w = csv.DictWriter(f, my_dict[0].keys(), dialect='excel')
IndexError: list index out of range
None
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/serpscrap/csv_writer.py", line 11, in write
    w = csv.DictWriter(f, my_dict[0].keys(), dialect='excel')
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/serpscraper1.py", line 21, in <module>
    results = scrap.as_csv('/tmp/output')
  File "/usr/local/lib/python3.6/dist-packages/serpscrap/serpscrap.py", line 134, in as_csv
    writer.write(file_path + '.csv', self.results)
  File "/usr/local/lib/python3.6/dist-packages/serpscrap/csv_writer.py", line 17, in write
    raise Exception
Exception

Is there any other method to apply to mobile SERP as well?
Any suggestions would be really appreciated.
C

Proxy table unique constraint errors

I'm getting lots of the below errors.

Using the latest commit from the 0.9.1 branch (which is already working much better than 0.9.0).
On Linux Mint 18.2.

2017-09-12 07:46:44,173 - root - INFO - preparing phantomjs
2017-09-12 07:46:44,175 - root - INFO - detecting phantomjs
2017-09-12 07:46:44,178 - root - INFO - using phantomjs/phantomjs-2.1.1-linux-x86_64/bin/phantomjs
2017-09-12 07:46:45,185 - root - INFO - 0 cache files found in /tmp/.serpscrap/
2017-09-12 07:46:45,186 - root - INFO - 0/785 objects have been read from the cache.
        785 remain to get scraped.
2017-09-12 07:46:45,201 - root - INFO - 
                Going to scrape 785 keywords with 99
                proxies by using 1 threads.

So I'm using rather a lot of proxies and I get a lot of errors where the DB is updating an existing proxy row and it seems like it's trying to update the IP to one that already exists in the table! This does not respect the Unique constraint on the IP column.

I've checked and I have NO duplicate IP's and ports in the proxy file.

So in the below example it's trying to update row 101 (which currently has IP address 196.196.232.5) and it's trying to change this row into IP 162.253.131.178, but that one already exists in the table.

sqlalchemy.exc.InvalidRequestError: This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (sqlite3.IntegrityError) UNIQUE constraint failed: proxy.ip, proxy.port [SQL: 'UPDATE proxy SET ip=?, online=?, status=?, checked_at=?, city=?, region=?, country=?, loc=?, org=?, postal=? WHERE proxy.id = ?'] [parameters: ('162.253.131.178', 1, 'Proxy is working.', '2017-09-12 08:45:03.818061', 'Toronto', 'Ontario', 'CA', '43.6230,-79.3936', 'AS32489 Amanah Tech Inc.', 'm5j 2n1', 101)]

Parsing of serps needs some attention

It seems there are some changes in the serp dom. selectors needs some updates

serpscrap.SerpScrap() returns None for some keywords

Hi.
Does somebody have any idea what could be the reason that on some keywords i get the data while on others i don't ?

for example, dog food:

import serpscrap

keywords = ['dog food']

config = serpscrap.Config()
config.set('scrape_urls', True)

scrap = serpscrap.SerpScrap()
scrap.init(config=config.get(), keywords=keywords)
scrap.as_csv('/tmp/output')

2019-09-22 11:55:14,988 - root - INFO - 
                Going to scrape 2 keywords with 1
                proxies by using 1 threads.
2019-09-22 11:55:14,990 - scrapcore.scraping - INFO - 
        [+] SelScrape[localhost][search-type:normal][https://www.google.com/search?] using search engine "google".
        Num keywords=1, num pages for keyword=[1]
        
2019-09-22 11:55:24,286 - scrapcore.scraper.selenium - INFO - https://www.google.com/search?
2019-09-22 11:55:55,364 - scrapcore.scraping - INFO - 
            [google]SelScrape localhost - Keyword: "dog food" with [1, 2] pages,
            slept 22 seconds before scraping. 1/1 already scraped
            
2019-09-22 11:55:56,767 - scrapcore.scraper.selenium - INFO - Requesting the next page
2/2 keywords processed.
2019-09-22 11:56:01,961 - root - INFO - Scraping URL: https://www.mypetneedsthat.com/best-dry-dog-foods-guide/
2019-09-22 11:56:02,681 - root - INFO - Scraping URL: https://www.businessinsider.com/best-dog-food
2019-09-22 11:56:02,686 - root - INFO - Scraping URL: https://www.akc.org/expert-advice/nutrition/best-dog-food-choosing-whats-right-for-your-dog/
2019-09-22 11:56:02,689 - root - INFO - Scraping URL: https://www.amazon.com/Best-Sellers-Pet-Supplies-Dry-Dog-Food/zgbs/pet-supplies/2975360011
2019-09-22 11:56:02,690 - root - INFO - Scraping URL: https://www.chewy.com/b/food-332
2019-09-22 11:56:26,122 - root - INFO - Scraping URL: https://www.petco.com/shop/en/petcostore/category/dog/dog-food
2019-09-22 11:56:26,123 - root - INFO - Scraping URL: https://www.petflow.com/dog/food
2019-09-22 11:56:26,843 - root - INFO - Scraping URL: https://www.dogfoodadvisor.com/
2019-09-22 11:56:27,735 - root - INFO - Scraping URL: https://www.petsmart.com/dog/food/dry-food/
2019-09-22 11:56:27,737 - root - INFO - Scraping URL: https://www.petsmart.com/dog/food/
2019-09-22 11:56:27,738 - root - INFO - Scraping URL: https://www.purina.com/dogs/dog-food
2019-09-22 11:56:28,635 - root - INFO - Scraping URL: https://www.youtube.com/watch?v=fBABfWqSN2I
2019-09-22 11:56:31,757 - root - INFO - Scraping URL: https://www.youtube.com/watch?v=7P85BMCCboI
2019-09-22 11:56:36,807 - root - INFO - Scraping URL: https://www.youtube.com/watch?v=az0ktsWYydw
2019-09-22 11:56:39,645 - root - INFO - Scraping URL: https://www.youtube.com/watch?v=njJ99wPByy4
2019-09-22 11:56:42,571 - root - INFO - Scraping URL: https://nypost.com/video/homeless-man-and-his-dog-reuniting-is-pure-joy/
2019-09-22 11:56:45,156 - root - INFO - Scraping URL: /aclk?sa=l&ai=DChcSEwjRyYG5h-TkAhUM1WQKHSiFASYYABAAGgJwag&sig=AOD64_2IRYpCakgEzR3BK1oqeuLCVa3mjA&adurl=&rct=j&q=
2019-09-22 11:56:45,157 - root - INFO - Scraping URL: https://www.purina.com/dogs/dog-food
2019-09-22 11:56:45,867 - root - INFO - Scraping URL: https://en.wikipedia.org/wiki/Dog_food
2019-09-22 11:56:45,872 - root - INFO - Scraping URL: https://www.hillspet.com/dog-food
2019-09-22 11:56:45,876 - root - INFO - Scraping URL: https://www.smithsfoodanddrug.com/pl/dog-food/11103
2019-09-22 11:57:10,321 - root - INFO - Scraping URL: https://www.canidae.com/dog-food/
2019-09-22 11:57:10,325 - root - INFO - Scraping URL: https://www.petcarerx.com/dog/food-nutrition
2019-09-22 11:57:11,222 - root - INFO - Scraping URL: https://www.businessinsider.com/best-dog-food
2019-09-22 11:57:11,223 - root - INFO - Scraping URL: https://www.tractorsupply.com/tsc/catalog/dog-food
2019-09-22 11:57:12,249 - root - INFO - Scraping URL: https://www.thehonestkitchen.com/dog-food
2019-09-22 11:57:12,253 - root - INFO - Scraping URL: https://www.boxed.com/products/category/418/dog-food
2019-09-22 11:57:13,171 - root - INFO - Scraping URL: https://lifesabundance.com/category/dogfood.aspx
2019-09-22 11:57:13,174 - root - INFO - Scraping URL: //www.googleadservices.com/pagead/aclk?sa=L&ai=DChcSEwj5_NHFh-TkAhWTr-wKHSgSDVMYABAAGgJwag&ohost=www.google.com&cid=CAASEuRoai4G0R8MNbToVnZKzozmNA&sig=AOD64_10tA_ESFCwAHTPgPUTDsInBgYwEQ&adurl=&rct=j&q=
2019-09-22 11:57:13,178 - root - INFO - Scraping URL: https://freshpet.com/why-freshpet/
2019-09-22 11:57:13,901 - root - INFO - Scraping URL: https://pet-food.thecomparizone.com/?var1=82002114870&var2=381760664839&var4&var5=b&var7=1234567890&utm_source=google&utm_medium=cpc
None
Traceback (most recent call last):
  File "C:\Users\rot\Anaconda3\lib\site-packages\serpscrap\csv_writer.py", line 14, in write
    w.writerow(row)
  File "C:\Users\rot\Anaconda3\lib\csv.py", line 155, in writerow
    return self.writer.writerow(self._dict_to_list(rowdict))
  File "C:\Users\rot\Anaconda3\lib\csv.py", line 151, in _dict_to_list
    + ", ".join([repr(x) for x in wrong_fields]))
ValueError: dict contains fields not in fieldnames: 'url', 'encoding', 'meta_robots', 'meta_title', 'text_raw', 'last_modified', 'status'
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\Anaconda3\lib\site-packages\serpscrap\csv_writer.py in write(self, file_name, my_dict)
     13                 for row in my_dict[0:]:
---> 14                     w.writerow(row)
     15         except Exception:

~\Anaconda3\lib\csv.py in writerow(self, rowdict)
    154     def writerow(self, rowdict):
--> 155         return self.writer.writerow(self._dict_to_list(rowdict))
    156 

~\Anaconda3\lib\csv.py in _dict_to_list(self, rowdict)
    150                 raise ValueError("dict contains fields not in fieldnames: "
--> 151                                  + ", ".join([repr(x) for x in wrong_fields]))
    152         return (rowdict.get(key, self.restval) for key in self.fieldnames)

ValueError: dict contains fields not in fieldnames: 'url', 'encoding', 'meta_robots', 'meta_title', 'text_raw', 'last_modified', 'status'

During handling of the above exception, another exception occurred:

Exception                                 Traceback (most recent call last)
<ipython-input-16-3f66e8511348> in <module>
      8 scrap = serpscrap.SerpScrap()
      9 scrap.init(config=config.get(), keywords=keywords)
---> 10 scrap.as_csv('/tmp/output')

~\Anaconda3\lib\site-packages\serpscrap\serpscrap.py in as_csv(self, file_path)
    146         writer = CsvWriter()
    147         self.results = self.run()
--> 148         writer.write(file_path + '.csv', self.results)
    149 
    150     def scrap_serps(self):

~\Anaconda3\lib\site-packages\serpscrap\csv_writer.py in write(self, file_name, my_dict)
     15         except Exception:
     16             print(traceback.print_exc())
---> 17             raise Exception

Exception:

Many thanks !!

Add Top Stories box

Hi,

What about adding the Top Stories box? Is there any easy way to parse them?

Thanks!

100 results from google search

Hello,

First thank you for this nice software :-).

I think i have found two issues. The first one is that i don't get 100 results from google search.

I have set following config options:

config.set('num_pages_for_keyword', 1)
config.set('num_results_per_page', 100)

But i get 10 results back instead of 100.

Is this a known issue or do i miss something....

I will post the other issue i have found in another post.

kind regards,

Toby.

keyerror: Using your own configuration

Hello,

Here my second issue i have found.

regarding: http://serpscrap.readthedocs.io/en/latest/configuration.html
i can use own configuration. This is not working for me. I get error:

KeyError: 'executable_path'

i have solved this by adding a executable path, but then i get other keyerrors.

For the moment i have set all options so that there are no more keyerrors. But now i get:

scrapcore.scraper.selenium - ERROR - Message: Service /home/ubuntu/workspace/wp-content/themes/SerpScrap/phantomjs/phantomjs-2.1.1-linux-x86_64/bin/phantomjs unexpectedly exited. Status code was: 0

Exception in thread [google]SelScrape

So what can i do?

regards,

Toby.

TimeoutException waiting for search input field

I get an Exception but little to go on. This is the output for simple example .py (on ubuntu 16 aws):

2017-12-10 18:58:17,609 - root - INFO - preparing phantomjs
2017-12-10 18:58:17,609 - root - INFO - detecting phantomjs
2017-12-10 18:58:17,612 - root - INFO - using phantomjs/phantomjs-2.1.1-linux-x86_64/bin/phantomjs
2017-12-10 18:58:17,636 - root - INFO - 0 cache files found in /tmp/.serpscrap/
2017-12-10 18:58:17,636 - root - INFO - 0/2 objects have been read from the cache.
2 remain to get scraped.
2017-12-10 18:58:17,644 - root - INFO -
Going to scrape 2 keywords with 1
proxies by using 1 threads.
2017-12-10 18:58:17,686 - scrapcore.scraping - INFO -
[+] SelScrape[localhost][search-type:normal][https://www.google.com/search?] using search engine "google".
Num keywords=1, num pages for keyword=[1]

2017-12-10 18:58:17,686 - scrapcore.scraper.selenium - INFO - useragent: Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko
2017-12-10 18:58:24,639 - scrapcore.scraper.selenium - ERROR - [google]SelScrape: TimeoutException waiting for search input field: Message:
Screenshot: available via screen

num_results_per_page is not working as expected

Hi,

First of all thank you for making such a great library.

I am trying to play around with it using python 3.

Below is my sample code but i am not able to limit the number of searches.

import pprint
import serpscrap

keywords = ['Virendra Sahewag']

config = serpscrap.Config()
config.set('scrape_urls', False)
config.set('num_pages_for_keyword', 2)
config.set('num_results_per_page', 3)

scrap = serpscrap.SerpScrap()
scrap.init(config=config.get(), keywords=keywords)
results = scrap.run()

for result in results:
    pprint.pprint(result)

I tried to change num_pages_for_keyword and num_results_per_page but in most of case, it is giving me 30 results always.

If anyone can help me about it.

Thanks!
Jayesh

CSV output is not comma separated, all values are in the same cell

Hello,

Values are not comma separated, they are all in the same cell, could you please help?

How to use with different languages and countries?

By default the script uses google.com and English as far as I see. How can I configure the script to use German for example and google.de or google.cz?

update docs

docs needs update for changes in 0.5.0

Element not visible

Hi, I'm using your python module for finding keywords from database (I have some information about companies without theirs url - roughly 2k?) but after some time of finding I get selenium.common.exceptions.ElementNotVisibleException: Message: element not visible error. I'm using Chrome for scraping. This means that I have many googledriver running process which do not end. Google blocked me or is there another fault?

And I find two small bugs which are already in GoogleScraper:

Selenium gecko driver process never end, there is some quit error
Google Chrome in selenium mode doesn't start by root (needed --no-sandbox option)

Scraping the base urls, without the people also asked for links

The results include people also asked as well and I need to filter those out. It should only give me the organic links without any from people also asked for URLs included. Could you please help?

Scraper standard exmaple returns only Youtube results in Lithuanian search

I used standard example to scrap related keywords data and added configuration codes. Scrapper return and writes to csv file only search results for Youtube insertions. The same has been data written to serpscrap.db - only search results for videos.

#!/usr/bin/python3
# -*- coding: utf-8 -*-
import pprint
import serpscrap


def scrape_to_csv(config, keywords):
    scrap = serpscrap.SerpScrap()
    scrap.init(config=config.get(), keywords=keywords)
    return scrap.as_csv('/Users/User/Desktop/stellar.csv')


def get_related(config, keywords, related):
    scrap = serpscrap.SerpScrap()
    scrap.init(config=config.get(), keywords=keywords)
    scrap.run()
    results = scrap.get_related()
    for keyword in results:
        if keyword['keyword'] not in related:
            related.append(keyword['keyword'])
    return related

config = serpscrap.Config()
config_new = {
   'cachedir': '/Users/User/Desktop/.serpscrap/',
   'clean_cache_after': 24,
   'database_name': '/Users/User/Desktop/serpscrap',
   'do_caching': True,
   'num_pages_for_keyword': 3,
   'scrape_urls': True,
   'search_engines': ['google'],
   'google_search_url': 'https://www.google.lt/search?',
   'executable_path': '/Users/User/bin/chrome_driver/chromedriver.exe',
}

config.apply(config_new)
config.set('scrape_urls', False)

keywords = ['stellar']

related = keywords
related = get_related(config, keywords, related)

scrape_to_csv(config, related)

pprint.pprint('********************')
pprint.pprint(related)

Similar Text_Raw field for all search results

Hello!

Thank you for great utility!
Faced with some problem:
I need to get the whole text from the search results, the snippet is too small for my issue,
so I thought that text_raw is a field for that. But for every result it's filled with the same text, and I can't understand why is that.
SerpScrap.ipynb.zip

I attached my notebook with results.

Thank you in advance for your answer

Not able to install package

Hi,

I am not able to install the package in my virtualenv. Python version 3.9.5 on windows 10.

below is the error when I try to install the package.

(myenv) PS C:\Users\ashis\OneDrive\Documents\Web Apps\WebUtilities> pip install SerpScrap --upgrade
Collecting SerpScrap
Using cached SerpScrap-0.13.0-py3-none-any.whl (45 kB)
Requirement already satisfied: PySocks==1.7.0 in c:\users\ashis\onedrive\documents\web apps\webutilities\myenv\lib\site-packages (from SerpScrap) (1.7.0)
Requirement already satisfied: sqlalchemy==1.3.7 in c:\users\ashis\onedrive\documents\web apps\webutilities\myenv\lib\site-packages (from SerpScrap) (1.3.7)
Collecting lxml==4.3.2
Using cached lxml-4.3.2.tar.gz (4.4 MB)
Collecting html2text==2019.8.11
Using cached html2text-2019.8.11-py2.py3-none-any.whl (31 kB)
Collecting chardet==3.0.4
Using cached chardet-3.0.4-py2.py3-none-any.whl (133 kB)
Collecting cssselect==1.1.0
Using cached cssselect-1.1.0-py2.py3-none-any.whl (16 kB)
Requirement already satisfied: selenium==3.141.0 in c:\users\ashis\onedrive\documents\web apps\webutilities\myenv\lib\site-packages (from SerpScrap) (3.141.0)
Collecting beautifulsoup4==4.8.0
Using cached beautifulsoup4-4.8.0-py3-none-any.whl (97 kB)
Requirement already satisfied: soupsieve>=1.2 in c:\users\ashis\onedrive\documents\web apps\webutilities\myenv\lib\site-packages (from beautifulsoup4==4.8.0->SerpScrap) (2.2.1)
Requirement already satisfied: urllib3 in c:\users\ashis\onedrive\documents\web apps\webutilities\myenv\lib\site-packages (from selenium==3.141.0->SerpScrap) (1.26.4)
Building wheels for collected packages: lxml
Building wheel for lxml (setup.py) ... error
ERROR: Command errored out with exit status 1:
command: 'c:\users\ashis\onedrive\documents\web apps\webutilities\myenv\scripts\python.exe' -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\ashis\AppData\Local\Temp\pip-install-_3mt2e1a\lxml_e3a717266aba478f81e552139f4895c3\setup.py'"'"'; file='"'"'C:\Users\ashis\AppData\Local\Temp\pip-install-_3mt2e1a\lxml_e3a717266aba478f81e552139f4895c3\setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(file) if os.path.exists(file) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\ashis\AppData\Local\Temp\pip-wheel-yv26i7jz'
cwd: C:\Users\ashis\AppData\Local\Temp\pip-install-_3mt2e1a\lxml_e3a717266aba478f81e552139f4895c3
Complete output (97 lines):
Building lxml version 4.3.2.
Building without Cython.
ERROR: b"'xslt-config' is not recognized as an internal or external command,\r\noperable program or batch file.\r\n"
** make sure the development packages of libxml2 and libxslt are installed **

Using build configuration of libxslt
running bdist_wheel
running build
running build_py
creating build
creating build\lib.win-amd64-3.9
creating build\lib.win-amd64-3.9\lxml
copying src\lxml\builder.py -> build\lib.win-amd64-3.9\lxml
copying src\lxml\cssselect.py -> build\lib.win-amd64-3.9\lxml
copying src\lxml\doctestcompare.py -> build\lib.win-amd64-3.9\lxml
copying src\lxml\ElementInclude.py -> build\lib.win-amd64-3.9\lxml
copying src\lxml\pyclasslookup.py -> build\lib.win-amd64-3.9\lxml
copying src\lxml\sax.py -> build\lib.win-amd64-3.9\lxml
copying src\lxml\usedoctest.py -> build\lib.win-amd64-3.9\lxml
copying src\lxml_elementpath.py -> build\lib.win-amd64-3.9\lxml
copying src\lxml_init_.py -> build\lib.win-amd64-3.9\lxml
creating build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes_init_.py -> build\lib.win-amd64-3.9\lxml\includes
creating build\lib.win-amd64-3.9\lxml\html
copying src\lxml\html\builder.py -> build\lib.win-amd64-3.9\lxml\html
copying src\lxml\html\clean.py -> build\lib.win-amd64-3.9\lxml\html
copying src\lxml\html\defs.py -> build\lib.win-amd64-3.9\lxml\html
copying src\lxml\html\diff.py -> build\lib.win-amd64-3.9\lxml\html
copying src\lxml\html\ElementSoup.py -> build\lib.win-amd64-3.9\lxml\html
copying src\lxml\html\formfill.py -> build\lib.win-amd64-3.9\lxml\html
copying src\lxml\html\html5parser.py -> build\lib.win-amd64-3.9\lxml\html
copying src\lxml\html\soupparser.py -> build\lib.win-amd64-3.9\lxml\html
copying src\lxml\html\usedoctest.py -> build\lib.win-amd64-3.9\lxml\html
copying src\lxml\html_diffcommand.py -> build\lib.win-amd64-3.9\lxml\html
copying src\lxml\html_html5builder.py -> build\lib.win-amd64-3.9\lxml\html
copying src\lxml\html_setmixin.py -> build\lib.win-amd64-3.9\lxml\html
copying src\lxml\html_init_.py -> build\lib.win-amd64-3.9\lxml\html
creating build\lib.win-amd64-3.9\lxml\isoschematron
copying src\lxml\isoschematron_init_.py -> build\lib.win-amd64-3.9\lxml\isoschematron
copying src\lxml\etree.h -> build\lib.win-amd64-3.9\lxml
copying src\lxml\etree_api.h -> build\lib.win-amd64-3.9\lxml
copying src\lxml\lxml.etree.h -> build\lib.win-amd64-3.9\lxml
copying src\lxml\lxml.etree_api.h -> build\lib.win-amd64-3.9\lxml
copying src\lxml\includes\c14n.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\config.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\dtdvalid.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\etreepublic.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\htmlparser.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\relaxng.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\schematron.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\tree.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\uri.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\xinclude.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\xmlerror.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\xmlparser.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\xmlschema.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\xpath.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\xslt.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes_init_.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\etree_defs.h -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\lxml-version.h -> build\lib.win-amd64-3.9\lxml\includes
creating build\lib.win-amd64-3.9\lxml\isoschematron\resources
creating build\lib.win-amd64-3.9\lxml\isoschematron\resources\rng
copying src\lxml\isoschematron\resources\rng\iso-schematron.rng -> build\lib.win-amd64-3.9\lxml\isoschematron\resources\rng
creating build\lib.win-amd64-3.9\lxml\isoschematron\resources\xsl
copying src\lxml\isoschematron\resources\xsl\RNG2Schtrn.xsl -> build\lib.win-amd64-3.9\lxml\isoschematron\resources\xsl
copying src\lxml\isoschematron\resources\xsl\XSD2Schtrn.xsl -> build\lib.win-amd64-3.9\lxml\isoschematron\resources\xsl
creating build\lib.win-amd64-3.9\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
copying src\lxml\isoschematron\resources\xsl\iso-schematron-xslt1\iso_abstract_expand.xsl -> build\lib.win-amd64-3.9\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
copying src\lxml\isoschematron\resources\xsl\iso-schematron-xslt1\iso_dsdl_include.xsl -> build\lib.win-amd64-3.9\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
copying src\lxml\isoschematron\resources\xsl\iso-schematron-xslt1\iso_schematron_message.xsl -> build\lib.win-amd64-3.9\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
copying src\lxml\isoschematron\resources\xsl\iso-schematron-xslt1\iso_schematron_skeleton_for_xslt1.xsl -> build\lib.win-amd64-3.9\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
copying src\lxml\isoschematron\resources\xsl\iso-schematron-xslt1\iso_svrl_for_xslt1.xsl -> build\lib.win-amd64-3.9\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
copying src\lxml\isoschematron\resources\xsl\iso-schematron-xslt1\readme.txt -> build\lib.win-amd64-3.9\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
running build_ext
building 'lxml.etree' extension
creating build\temp.win-amd64-3.9
creating build\temp.win-amd64-3.9\Release
creating build\temp.win-amd64-3.9\Release\src
creating build\temp.win-amd64-3.9\Release\src\lxml
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -DCYTHON_CLINE_IN_TRACEBACK=0 -Isrc -Isrc\lxml\includes -Ic:\users\ashis\onedrive\documents\web apps\webutilities\myenv\include -Ic:\python39\include -Ic:\python39\include -IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\ATLMFC\include -IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\include -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt /Tcsrc\lxml\etree.c /Fobuild\temp.win-amd64-3.9\Release\src\lxml\etree.obj -w
cl : Command line warning D9025 : overriding '/W3' with '/w'
etree.c
C:\Users\ashis\AppData\Local\Temp\pip-install-_3mt2e1a\lxml_e3a717266aba478f81e552139f4895c3\src\lxml\includes/etree_defs.h(14): fatal error C1083: Cannot open include file: 'libxml/xmlversion.h': No such file or directory
Compile failed: command 'C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\bin\HostX86\x64\cl.exe' failed with exit code 2
creating Users
creating Users\ashis
creating Users\ashis\AppData
creating Users\ashis\AppData\Local
creating Users\ashis\AppData\Local\Temp
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -I/usr/include/libxml2 -IC:\Program Files (x86)\Microsoft Visual
Studio\2019\Community\VC\Tools\MSVC\14.29.30037\ATLMFC\include -IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\include -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt /TcC:\Users\ashis\AppData\Local\Temp\xmlXPathInitodmy1ily.c /FoUsers\ashis\AppData\Local\Temp\xmlXPathInitodmy1ily.obj
xmlXPathInitodmy1ily.c
C:\Users\ashis\AppData\Local\Temp\xmlXPathInitodmy1ily.c(1): fatal error C1083: Cannot open include file: 'libxml/xpath.h': No such file or directory
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\bin\HostX86\x64\cl.exe' failed with exit code 2

Could not find function xmlCheckVersion in library libxml2. Is libxml2 installed?

ERROR: Failed building wheel for lxml
Running setup.py clean for lxml
Failed to build lxml
Installing collected packages: lxml, html2text, cssselect, chardet, beautifulsoup4, SerpScrap
Attempting uninstall: lxml
Found existing installation: lxml 4.6.3
Uninstalling lxml-4.6.3:
Successfully uninstalled lxml-4.6.3
Running setup.py install for lxml ... error
ERROR: Command errored out with exit status 1:
command: 'c:\users\ashis\onedrive\documents\web apps\webutilities\myenv\scripts\python.exe' -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\ashis\AppData\Local\Temp\pip-install-_3mt2e1a\lxml_e3a717266aba478f81e552139f4895c3\setup.py'"'"'; file='"'"'C:\Users\ashis\AppData\Local\Temp\pip-install-_3mt2e1a\lxml_e3a717266aba478f81e552139f4895c3\setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(file) if os.path.exists(file) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\ashis\AppData\Local\Temp\pip-record-wgus5gxx\install-record.txt' --single-version-externally-managed --compile --install-headers 'c:\users\ashis\onedrive\documents\web apps\webutilities\myenv\include\site\python3.9\lxml'
cwd: C:\Users\ashis\AppData\Local\Temp\pip-install-_3mt2e1a\lxml_e3a717266aba478f81e552139f4895c3
Complete output (92 lines):
Building lxml version 4.3.2.
Building without Cython.
ERROR: b"'xslt-config' is not recognized as an internal or external command,\r\noperable program or batch file.\r\n"
** make sure the development packages of libxml2 and libxslt are installed **

Using build configuration of libxslt
running install
running build
running build_py
creating build
creating build\lib.win-amd64-3.9
creating build\lib.win-amd64-3.9\lxml
copying src\lxml\builder.py -> build\lib.win-amd64-3.9\lxml
copying src\lxml\cssselect.py -> build\lib.win-amd64-3.9\lxml
copying src\lxml\doctestcompare.py -> build\lib.win-amd64-3.9\lxml
copying src\lxml\ElementInclude.py -> build\lib.win-amd64-3.9\lxml
copying src\lxml\pyclasslookup.py -> build\lib.win-amd64-3.9\lxml
copying src\lxml\sax.py -> build\lib.win-amd64-3.9\lxml
copying src\lxml\usedoctest.py -> build\lib.win-amd64-3.9\lxml
copying src\lxml\_elementpath.py -> build\lib.win-amd64-3.9\lxml
copying src\lxml\__init__.py -> build\lib.win-amd64-3.9\lxml
creating build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\__init__.py -> build\lib.win-amd64-3.9\lxml\includes
creating build\lib.win-amd64-3.9\lxml\html
copying src\lxml\html\builder.py -> build\lib.win-amd64-3.9\lxml\html
copying src\lxml\html\clean.py -> build\lib.win-amd64-3.9\lxml\html
copying src\lxml\html\defs.py -> build\lib.win-amd64-3.9\lxml\html
copying src\lxml\html\diff.py -> build\lib.win-amd64-3.9\lxml\html
copying src\lxml\html\ElementSoup.py -> build\lib.win-amd64-3.9\lxml\html
copying src\lxml\html\formfill.py -> build\lib.win-amd64-3.9\lxml\html
copying src\lxml\html\html5parser.py -> build\lib.win-amd64-3.9\lxml\html
copying src\lxml\html\soupparser.py -> build\lib.win-amd64-3.9\lxml\html
copying src\lxml\html\usedoctest.py -> build\lib.win-amd64-3.9\lxml\html
copying src\lxml\html\_diffcommand.py -> build\lib.win-amd64-3.9\lxml\html
copying src\lxml\html\_html5builder.py -> build\lib.win-amd64-3.9\lxml\html
copying src\lxml\html\_setmixin.py -> build\lib.win-amd64-3.9\lxml\html
copying src\lxml\html\__init__.py -> build\lib.win-amd64-3.9\lxml\html
creating build\lib.win-amd64-3.9\lxml\isoschematron
copying src\lxml\isoschematron\__init__.py -> build\lib.win-amd64-3.9\lxml\isoschematron
copying src\lxml\etree.h -> build\lib.win-amd64-3.9\lxml
copying src\lxml\etree_api.h -> build\lib.win-amd64-3.9\lxml
copying src\lxml\lxml.etree.h -> build\lib.win-amd64-3.9\lxml
copying src\lxml\lxml.etree_api.h -> build\lib.win-amd64-3.9\lxml
copying src\lxml\includes\c14n.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\config.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\dtdvalid.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\etreepublic.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\htmlparser.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\relaxng.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\schematron.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\tree.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\uri.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\xinclude.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\xmlerror.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\xmlparser.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\xmlschema.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\xpath.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\xslt.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\__init__.pxd -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\etree_defs.h -> build\lib.win-amd64-3.9\lxml\includes
copying src\lxml\includes\lxml-version.h -> build\lib.win-amd64-3.9\lxml\includes
creating build\lib.win-amd64-3.9\lxml\isoschematron\resources
creating build\lib.win-amd64-3.9\lxml\isoschematron\resources\rng
copying src\lxml\isoschematron\resources\rng\iso-schematron.rng -> build\lib.win-amd64-3.9\lxml\isoschematron\resources\rng
creating build\lib.win-amd64-3.9\lxml\isoschematron\resources\xsl
copying src\lxml\isoschematron\resources\xsl\RNG2Schtrn.xsl -> build\lib.win-amd64-3.9\lxml\isoschematron\resources\xsl
copying src\lxml\isoschematron\resources\xsl\XSD2Schtrn.xsl -> build\lib.win-amd64-3.9\lxml\isoschematron\resources\xsl
creating build\lib.win-amd64-3.9\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
copying src\lxml\isoschematron\resources\xsl\iso-schematron-xslt1\iso_abstract_expand.xsl -> build\lib.win-amd64-3.9\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
copying src\lxml\isoschematron\resources\xsl\iso-schematron-xslt1\iso_dsdl_include.xsl -> build\lib.win-amd64-3.9\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
copying src\lxml\isoschematron\resources\xsl\iso-schematron-xslt1\iso_schematron_message.xsl -> build\lib.win-amd64-3.9\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
copying src\lxml\isoschematron\resources\xsl\iso-schematron-xslt1\iso_schematron_skeleton_for_xslt1.xsl -> build\lib.win-amd64-3.9\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
copying src\lxml\isoschematron\resources\xsl\iso-schematron-xslt1\iso_svrl_for_xslt1.xsl -> build\lib.win-amd64-3.9\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
copying src\lxml\isoschematron\resources\xsl\iso-schematron-xslt1\readme.txt -> build\lib.win-amd64-3.9\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
running build_ext
building 'lxml.etree' extension
creating build\temp.win-amd64-3.9
creating build\temp.win-amd64-3.9\Release
creating build\temp.win-amd64-3.9\Release\src
creating build\temp.win-amd64-3.9\Release\src\lxml
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -DCYTHON_CLINE_IN_TRACEBACK=0 -Isrc -Isrc\lxml\includes -Ic:\users\ashis\onedrive\documents\web apps\webutilities\myenv\include -Ic:\python39\include -Ic:\python39\include -IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\ATLMFC\include

-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\include -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt /Tcsrc\lxml\etree.c /Fobuild\temp.win-amd64-3.9\Release\src\lxml\etree.obj -w
cl : Command line warning D9025 : overriding '/W3' with '/w'
etree.c
C:\Users\ashis\AppData\Local\Temp\pip-install-_3mt2e1a\lxml_e3a717266aba478f81e552139f4895c3\src\lxml\includes/etree_defs.h(14): fatal error C1083: Cannot open include file: 'libxml/xmlversion.h': No such file
or directory
Compile failed: command 'C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\bin\HostX86\x64\cl.exe' failed with exit code 2
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -I/usr/include/libxml2 -IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\ATLMFC\include -IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\include -IC:\Program Files (x86)\Windows Kits\10\include\10nrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt /TcC:\Users\ashis\AppData\Local\Temp\xmlXPathInitpf2ecvgu.c /FoUsers\ashis\AppData\Local\Temp\xmlXPathInitpf2ecvgu.obj
xmlXPathInitpf2ecvgu.c
C:\Users\ashis\AppData\Local\Temp\xmlXPathInitpf2ecvgu.c(1): fatal error C1083: Cannot open include file: 'libxml/xpath.h': No such file or directory
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\bin\HostX86\x64\cl.exe' failed with exit code 2
*********************************************************************************
Could not find function xmlCheckVersion in library libxml2. Is libxml2 installed?
*********************************************************************************
Rolling back uninstall of lxml
Moving to c:\users\ashis\onedrive\documents\web apps\webutilities\myenv\lib\site-packages\lxml-4.6.3.dist-info
Moving to c:\users\ashis\onedrive\documents\web apps\webutilities\myenv\lib\site-packages\lxml
from C:\Users\ashis\OneDrive\Documents\Web Apps\WebUtilities\myenv\Lib\site-packages~xml
\AppData\Local\Temp\pip-install-_3mt2e1a\lxml_e3a717266aba478f81e552139f4895c3\setup.py'"'"'; file='"'"'C:\Users\ashis\AppData\Local\Temp\pip-install-_3mt2e1a\lxml_e3a717266aba478f81e552139f4895c3\setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(file) if os.path.exists(file) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\ashis\AppData\Local\Temp\pip-record-wgus5gxx\install-record.txt' --single-version-externally-managed --compile --install-headers 'c:\users\ashis\onedrive\documents\web apps\webutilities\myenv\include\site\python3.9\lxml' Check the logs for full command output.

Http proxies with username:password

How to use http proxies with pattern of socks4? I have http xx.xx.xx.xx:80 username:password but I'm getting

[google]SelScrape: TimeoutException waiting for search input field: Message:

Getting "People Also Ask" results as results

Any way to dump those/filter those out?

Extracting to CSV starts from rank_2

Hi,
first off, I love the SeprScrap and appreciate the work you did. The only issue I've stumbled upon was that when using the as_csv() the results start from serp_rank2 whereas if I print the results everything is great. I'm not sure where the problem is.
Thanks!

Issue date in path of screenshots

it seems the date in the path for screenshots isn't updated by the current date.

SQL Error while runing your first example

While rununing your first example:
import serpscrap

keywords = ['one', 'two']
scrap = serpscrap.SerpScrap()
scrap.init(keywords=keywords)
result = scrap.scrap_serps()
I get this error:

sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) disk I/O error [SQL: 'SELECT scraper_search.id AS scraper_search_id, scraper_search.keyword_file AS scraper_search_keyword_file, scraper_search.number_search_engines_used AS scraper_search_number_search_engines_used, scraper_search.used_search_engines AS scraper_search_used_search_engines, scraper_search.number_proxies_used AS scraper_search_number_proxies_used, scraper_search.number_search_queries AS scraper_search_number_search_queries, scraper_search.started_searching AS scraper_search_started_searching \nFROM scraper_search \nWHERE scraper_search.id = ?'] [parameters: (6,)]

How can I fix it? should install a SQL database and configure it first?

I would Like to contribute to this project!

Hi, I am new to open source contribution but this project interests me very much and I would like to help in anyway I can.

Issue with SerpScrap package on Linux box

Hi SerpScrap Team,
I have installed this serpsrap package on my Linux system and its throwing the below error while its working fine on Windows system.

Please let me know, if i am missing any dependent package.

File "/home/ec2-user/Project_name/data_spider.py", line 25, in getData
scrap.init(config=config.get(), keywords=keywords)
File "/usr/local/lib64/python3.7/site-packages/serpscrap/serpscrap.py", line 98, in init
firstrun.download()
File "/usr/local/lib64/python3.7/site-packages/serpscrap/chrome_install.py", line 55, in download
os.chmod('install_chrome.sh', 755 | stat.S_IEXEC)
FileNotFoundError: [Errno 2] No such file or directory: 'install_chrome.sh'

Thanks & Regards,
Bhupendra

Handle malicious requests in http mode

in http mode malicious requests are not handled well.
in such cases stop scraping and throw an exception or similar

Data from Knowledge Panel

Can I extract data from knowledge Panel using serpscrap?

Finding ranking of single domain?

Is there a built-in function/method for finding the ranking of a single domain for a keyword?

Proxy do not work

Hi. I have tried to send the requests with some proxies in a proxy.txt file.
But it doesn't work, as many selenium instances open in the browser and everything collapses.
Selenium opens up even if headless mode is True.
Then, I get this error:
scrapcore.scraper.selenium - WARNING - 'NoneType' object has no attribute 'group'
Furthermore, the proxy file appears not to be working if only one http proxy without user:pass is configured.

Scrape Exception pass. Selector: #resultStats

TimeoutException?

Followed the install instructions,
ran the example_image.py

(basic) ➜ examples git:(master) ✗ python example_image.py
2017-10-30 17:42:14,901 - root - INFO - preparing phantomjs
2017-10-30 17:42:14,901 - root - INFO - detecting phantomjs
2017-10-30 17:42:14,904 - root - INFO - using phantomjs/phantomjs-2.1.1-linux-x86_64/bin/phantomjs
2017-10-30 17:42:14,920 - root - INFO - 0 cache files found in /tmp/.serpscrap/
2017-10-30 17:42:14,920 - root - INFO - 0/2 objects have been read from the cache.
2 remain to get scraped.
2017-10-30 17:42:14,928 - root - INFO -
Going to scrape 2 keywords with 1
proxies by using 1 threads.
2017-10-30 17:42:14,964 - scrapcore.scraping - INFO -
[+] SelScrape[localhost][search-type:image][https://www.google.com/search?] using search engine "google".
Num keywords=1, num pages for keyword=[1]

2017-10-30 17:42:14,964 - scrapcore.scraper.selenium - INFO - useragent: Mozilla/5.0 (Windows NT 10.0, WOW64, rv:51.0) Gecko/20100101 Firefox/51.0
2017-10-30 17:42:21,734 - scrapcore.scraper.selenium - ERROR - [google]SelScrape: TimeoutException waiting for search input field: Message:
Screenshot: available via screen

issue with amount of results

Hey, may you help me with my own issue?
I'm trying to scrape large amount of results in Google, but I reach a limit of XXX with the message :

"In order to show you the most relevant results, we have omitted some entries very similar to the XXX already displayed.
If you like, you can repeat the search with the omitted results included."

I'm trying to get through 1000 results.
Any advises? clues?
Thanks ahead.

quit question, how do you handle catchas?

csv_writer.py csvWriter cannot handle ? in urls

Response
Traceback (most recent call last):
File "miniconda3/lib/python3.7/site-packages/serpscrap/csv_writer.py", line 14, in write
w.writerow(row)
File "miniconda3/lib/python3.7/csv.py", line 155, in writerow
return self.writer.writerow(self._dict_to_list(rowdict))
File "miniconda3/lib/python3.7/csv.py", line 151, in _dict_to_list
+ ", ".join([repr(x) for x in wrong_fields]))
ValueError: dict contains fields not in fieldnames: 'encoding', 'text_raw', 'url', 'status', 'meta_robots', 'meta_title', 'last_modified'
None
Traceback (most recent call last):
File "miniconda3/lib/python3.7/site-packages/serpscrap/csv_writer.py", line 14, in write
w.writerow(row)
File "miniconda3/lib/python3.7/csv.py", line 155, in writerow
return self.writer.writerow(self._dict_to_list(rowdict))
File "miniconda3/lib/python3.7/csv.py", line 151, in _dict_to_list
+ ", ".join([repr(x) for x in wrong_fields]))
ValueError: dict contains fields not in fieldnames: 'encoding', 'text_raw', 'url', 'status', 'meta_robots', 'meta_title', 'last_modified'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "example_serp_urls.py", line 12, in
scrap.as_csv('/tmp/outputurls')
File "miniconda3/lib/python3.7/site-packages/serpscrap/serpscrap.py", line 148, in as_csv
writer.write(file_path + '.csv', self.results)
File "/miniconda3/lib/python3.7/site-packages/serpscrap/csv_writer.py", line 17, in write
raise Exception
Exception

ecoron / serpscrap Goto Github PK

serpscrap's People

Contributors

Stargazers

Watchers

Forkers

serpscrap's Issues

Recommend Projects

Recommend Topics

Recommend Org