elvisyjlin / media-scraper Goto Github PK
View Code? Open in Web Editor NEWScrapes all photos and videos in a web page / Instagram / Twitter / Tumblr / Reddit / pixiv / TikTok
License: MIT License
Scrapes all photos and videos in a web page / Instagram / Twitter / Tumblr / Reddit / pixiv / TikTok
License: MIT License
Hi,
I cloned the repo and I codeblocks or modules regarding twitter scraping does not exist. Is it disabled?
Is there a reason why if i delete all files from a folder(or the entire folder) lets say download\twitter\kheshig, and i try to recraw the account it says some files already exist and skips them. Although this could be a useful function if you don't want the crawler to re-download pictures/video you deleted as you might not like want to constantly deleting things every time you use the crawler. Is there a way to delete stored data of the downloaded files. only 44 of the files were downloaded out of 100(this should be the amount that the crawler should've registered)
python -m mediascraper.twitter kheshig
Starting PhantomJS web driver...
.\webdriver/phantomjsdriver_2.1.1_win32/phantomjs.exe
C:\Users\dolap_000\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\phantomjs\webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead
warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '
Logging in as "Kheshig"...
Crawling...
90 media are found.
Downloading...
49%|██████████████████████████▉ | 44/90 [00:04<00:04, 9.50it/s]The file download/twitter\kheshig\DfWRZtAX0AE6aos.jpg exists. Skip it.
The file download/twitter\kheshig\DfWRaD3XUAU7DVc.jpg exists. Skip it.
The file download/twitter\kheshig\DfWRahRXkAAOAEN.jpg exists. Skip it.
The file download/twitter\kheshig\DfWRa4PWsAEjGpO.jpg exists. Skip it.
The file download/twitter\kheshig\DfWRWnWXkAIbYv9.jpg exists. Skip it.
The file download/twitter\kheshig\DfWRXY_WAAIOoe6.jpg exists. Skip it.
The file download/twitter\kheshig\DfWRR0cXcAErec4.jpg exists. Skip it.
The file download/twitter\kheshig\DfWRSd_XkAANXsq.jpg exists. Skip it.
The file download/twitter\kheshig\DfWRS55X4AEGjTs.jpg exists. Skip it.
The file download/twitter\kheshig\DfWQY6sXUAI6lNx.jpg exists. Skip it.
The file download/twitter\kheshig\DfWQWRtXUAIZUwt.jpg exists. Skip it.
The file download/twitter\kheshig\DfWQUF3W4AAzG3S.jpg exists. Skip it.
The file download/twitter\kheshig\DfWQPoYXkAA6lQw.jpg exists. Skip it.
The file download/twitter\kheshig\DfWQNU-XcAApfXR.jpg exists. Skip it.
The file download/twitter\kheshig\DfWQIyOXUAASqVr.jpg exists. Skip it.
The file download/twitter\kheshig\DfWQG06X4AANiOp.jpg exists. Skip it.
The file download/twitter\kheshig\DfWQEspXcAIx_bC.jpg exists. Skip it.
The file download/twitter\kheshig\DfWQCf6X4AAlqcF.jpg exists. Skip it.
The file download/twitter\kheshig\DfWQA-SWAAAh8ED.jpg exists. Skip it.
The file download/twitter\kheshig\DfWP8ocWAAELICF.jpg exists. Skip it.
The file download/twitter\kheshig\DfWP5HlWsAAwcU5.jpg exists. Skip it.
The file download/twitter\kheshig\DfWP1yyWAAACjNB.jpg exists. Skip it.
The file download/twitter\kheshig\DfWMr6MWAAUgcB_.jpg exists. Skip it.
The file download/twitter\kheshig\DfWMk8nW4AAtnSc.jpg exists. Skip it.
The file download/twitter\kheshig\DfWMZ0bW0AAI1pX.jpg exists. Skip it.
The file download/twitter\kheshig\DfWMaOvW0AQnPPs.jpg exists. Skip it.
The file download/twitter\kheshig\DfWMazDX0AUMO8p.jpg exists. Skip it.
The file download/twitter\kheshig\DfWMbOMX0AAYlZb.jpg exists. Skip it.
The file download/twitter\kheshig\DfWMScQXcAolR1z.jpg exists. Skip it.
The file download/twitter\kheshig\DfWMOBIW4AANRLS.jpg exists. Skip it.
The file download/twitter\kheshig\DfWMObHW0AELSx-.jpg exists. Skip it.
The file download/twitter\kheshig\DfWLTwNX4AAYMFn.jpg exists. Skip it.
The file download/twitter\kheshig\DfWJ_nIWkAIW7Ok.jpg exists. Skip it.
The file download/twitter\kheshig\DfWEoQ8WkAE6s0l.jpg exists. Skip it.
The file download/twitter\kheshig\DfWEJd1XkAE6Edp.jpg exists. Skip it.
The file download/twitter\kheshig\DfWEKmeW0AIpDNs.jpg exists. Skip it.
The file download/twitter\kheshig\DfWELsWWAAA7xcn.jpg exists. Skip it.
The file download/twitter\kheshig\DfWEBQMW0AEQVUW.jpg exists. Skip it.
The file download/twitter\kheshig\DfWEBy8XUAIbLEx.jpg exists. Skip it.
The file download/twitter\kheshig\DfWECRkWsAAee42.jpg exists. Skip it.
The file download/twitter\kheshig\DfWEDTPXUAAGRiw.jpg exists. Skip it.
The file download/twitter\kheshig\DfWD8NMWsAAbwwa.jpg exists. Skip it.
The file download/twitter\kheshig\DfWD8u2XUAEFpVb.jpg exists. Skip it.
The file download/twitter\kheshig\DfWD9JMW0AElLi_.jpg exists. Skip it.
The file download/twitter\kheshig\DfWD9kBX4AAcS-e.jpg exists. Skip it.
100%|███████████████████████████████████████████████████████| 90/90 [00:04<00:00, 19.04it/s]
I have been using a bat file to run the program; which contains multiple twitter account. The problem is that if an account was to deactivate their account it would broke the program, stopping it. This shouldn't be the case as the program should be able to skip it and continues to crawl the other accounts; and at the end it should say the accounts that longer exist.
error produce when you run a bat file that contains an account that has deactivated
Starting PhantomJS web driver...
.\webdriver/phantomjsdriver_2.1.1_win32/phantomjs.exe
C:\Users\dolap_000\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\phantomjs\webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead
warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '
Logging in as "Kheshig"...
Crawling...
Traceback (most recent call last):
File "C:\Users\dolap_000\AppData\Local\Programs\Python\Python36-32\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "C:\Users\dolap_000\AppData\Local\Programs\Python\Python36-32\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\dolap_000\Desktop\media-scraper-master\mediascraper\twitter.py", line 18, in <module>
tasks = scraper.scrape(username)
File "C:\Users\dolap_000\Desktop\media-scraper-master\mediascrapers.py", line 384, in scrape
done = self.scrollToBottom()
File "C:\Users\dolap_000\Desktop\media-scraper-master\mediascrapers.py", line 87, in scrollToBottom
last_height, new_height = self._driver.execute_script("return document.body.scrollHeight"), 0
File "C:\Users\dolap_000\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 627, in execute_script
'args': converted_args})['value']
File "C:\Users\dolap_000\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 312, in execute
self.error_handler.check_response(response)
File "C:\Users\dolap_000\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 237, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: {"errorMessage":"Refused to evaluate a string as JavaScript because 'unsafe-eval' is not an allowed source of script in the following Content Security Policy directive: \"script-src https://abs.twimg.com https://ssl.google-analytics.com https://ajax.googleapis.com http://www.google-analytics.com about:\".\n","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Length":"112","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:50385","User-Agent":"Python http auth"},"httpVersion":"1.1","method":"POST","post":"{\"script\": \"return document.body.scrollHeight\", \"args\": [], \"sessionId\": \"afc32870-757b-11e8-9639-3b1da092f52a\"}","url":"/execute","urlParsed":{"anchor":"","query":"","file":"execute","directory":"/","path":"/execute","relative":"/execute","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/execute","queryKey":{},"chunks":["execute"]},"urlOriginal":"/session/afc32870-757b-11e8-9639-3b1da092f52a/execute"}}
Screenshot: available via screen.
Message: no such element: Unable to locate element: {"method":"tag name","selector":"pre"}
Hello, you should update below. These no longer work
GFYCAT_MP4 = 'https://giant.gfycat.com/{}.mp4'
GFYCAT_WEBM = 'https://giant.gfycat.com/{}.webm'
S C:\Users\Ego Brain\Documents\Python\ScrapingSocial\media-scraper> python m-scraper.py rq instagram aeso_barrosmelo
Namespace(credential_file=None, early_stop=False, keywords=['aeso_barrosmelo'], save_path=None)
Instagramer Task: aeso_barrosmelo
Traceback (most recent call last):
File "m-scraper.py", line 36, in
scraper.run(sys.argv[3:])
File "C:\Users\Ego Brain\Documents\Python\ScrapingSocial\media-scraper\m_scraper\rq\downloader.py", line 82, in run
self.crawl(keyword, args.early_stop)
File "C:\Users\Ego Brain\Documents\Python\ScrapingSocial\media-scraper\m_scraper\rq\instagramer.py", line 46, in crawl
tasks, end_cursor, has_next, length, user_id, rhx_gis, csrf_token = get_first_page(username)
File "C:\Users\Ego Brain\Documents\Python\ScrapingSocial\media-scraper\m_scraper\rq\utils\instagram.py", line 52, in get_first_page
rhx_gis = shared_data['rhx_gis']
KeyError: 'rhx_gis'
PS C:\Users\Ego Brain\Documents\Python\ScrapingSocial\media-scraper>
Tried using login credentials for pixiv scraper:
python m-scraper.py rq pixiv USERID -c credentials.json
tried out a couple times with different USERID
Got this error:
Namespace(keywords=[], credential_file='credentials.json', save_path=None, early_stop=False)
Logging in pixiv account...
Traceback (most recent call last):
File "D:\media-scraper-master\media-scraper-master\m-scraper.py", line 36, in <module>
scraper.run(sys.argv[3:])
File "D:\media-scraper-master\media-scraper-master\m_scraper\rq\downloader.py", line 72, in run
self.login(username, password)
File "D:\media-scraper-master\media-scraper-master\m_scraper\rq\pixiver.py", line 27, in login
post_key = soup.select('input[name==post_key]')[0]['value']
File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\bs4\element.py", line 1973, in select
results = soupsieve.select(selector, self, namespaces, limit, **kwargs)
File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\soupsieve\__init__.py", line 144, in select
return compile(select, namespaces, flags, **kwargs).select(tag, limit)
File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\soupsieve\__init__.py", line 67, in compile
return cp._cached_css_compile(pattern, ns, cs, flags)
File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\soupsieve\css_parser.py", line 222, in _cached_css_compile
).process_selectors(),
File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\soupsieve\css_parser.py", line 1159, in process_selectors
return self.parse_selectors(self.selector_iter(self.pattern), index, flags)
File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\soupsieve\css_parser.py", line 985, in parse_selectors
key, m = next(iselector)
File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\soupsieve\css_parser.py", line 1152, in selector_iter
raise SelectorSyntaxError(msg, self.pattern, index)
soupsieve.util.SelectorSyntaxError: Malformed attribute selector at position 5
line 1:
input[name==post_key]
^
UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead
it happen when i'm using this
python3 -m mediascraper.general [WEB PAGE 1] [WEB PAGE 2] ...
When running the following command:
python3 -m mediascraper.twitter nerdcity
I get the following error:
Starting PhantomJS web driver...
./webdriver/phantomjsdriver_2.1.1_linux64/phantomjs
/home/User/.local/lib/python3.6/site-packages/selenium/webdriver/phantomjs/webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead
warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '
Traceback (most recent call last):
File "/home/User/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/home/User/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 384, in _make_request
six.raise_from(e, None)
File "", line 2, in raise_from
File "/home/User/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 380, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.6/http/client.py", line 1331, in getresponse
response.begin()
File "/usr/lib/python3.6/http/client.py", line 297, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.6/http/client.py", line 266, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/User/Desktop/git/media-scraper/mediascraper/twitter.py", line 18, in
tasks = scraper.scrape(username)
File "/home/User/Desktop/git/media-scraper/mediascrapers.py", line 379, in scrape
self._connect('{}/{}/media'.format(self.base_url, username))
File "/home/User/Desktop/git/media-scraper/mediascrapers.py", line 51, in _connect
self._driver.get(url)
File "/home/User/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 333, in get
self.execute(Command.GET, {'url': url})
File "/home/User/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 319, in execute
response = self.command_executor.execute(driver_command, params)
File "/home/User/.local/lib/python3.6/site-packages/selenium/webdriver/remote/remote_connection.py", line 374, in execute
return self._request(command_info[0], url, body=data)
File "/home/User/.local/lib/python3.6/site-packages/selenium/webdriver/remote/remote_connection.py", line 402, in _request
resp = http.request(method, url, body=body, headers=headers)
File "/home/User/.local/lib/python3.6/site-packages/urllib3/request.py", line 72, in request
**urlopen_kw)
File "/home/User/.local/lib/python3.6/site-packages/urllib3/request.py", line 150, in request_encode_body
return self.urlopen(method, url, **extra_kw)
File "/home/User/.local/lib/python3.6/site-packages/urllib3/poolmanager.py", line 324, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/home/User/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "/home/User/.local/lib/python3.6/site-packages/urllib3/util/retry.py", line 368, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/home/User/.local/lib/python3.6/site-packages/urllib3/packages/six.py", line 685, in reraise
raise value.with_traceback(tb)
File "/home/User/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/home/User/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 384, in _make_request
six.raise_from(e, None)
File "", line 2, in raise_from
File "/home/User/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 380, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.6/http/client.py", line 1331, in getresponse
response.begin()
File "/usr/lib/python3.6/http/client.py", line 297, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.6/http/client.py", line 266, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))
The program doesn't seem to work, and i'm not sure what the possible cause is...
C:\Users\dolap_000\Desktop\media-scraper-master>python -m mediascraper.twitter [3347813921]
Starting PhantomJS web driver...
C:\Users\dolap_000\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\phantomjs\webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead
warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '
Logging in as "Dolapofreeman"...
Traceback (most recent call last):
File "C:\Users\dolap_000\AppData\Local\Programs\Python\Python36-32\lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "C:\Users\dolap_000\AppData\Local\Programs\Python\Python36-32\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\dolap_000\Desktop\media-scraper-master\mediascraper\twitter.py", line 14, in
scraper.login('credentials.json')
File "C:\Users\dolap_000\Desktop\media-scraper-master\mediascrapers.py", line 410, in login
username.send_keys(credentials['username'])
File "C:\Users\dolap_000\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\remote\webelement.py", line 479, in send_keys
'value': keys_to_typing(value)})
File "C:\Users\dolap_000\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\remote\webelement.py", line 628, in _execute
return self._parent.execute(command, params)
File "C:\Users\dolap_000\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 312, in execute
self.error_handler.check_response(response)
File "C:\Users\dolap_000\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 237, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.InvalidElementStateException: Message: {"errorMessage":"Element is not currently interactable and may not be manipulated","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Length":"182","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:62282","User-Agent":"Python http auth"},"httpVersion":"1.1","method":"POST","post":"{"text": "Dolapofreeman", "value": ["D", "o", "l", "a", "p", "o", "f", "r", "e", "e", "m", "a", "n"], "id": ":wdc:1523809525391", "sessionId": "96a881f0-40c9-11e8-a71a-c10ae27e62ae"}","url":"/value","urlParsed":{"anchor":"","query":"","file":"value","directory":"/","path":"/value","relative":"/value","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/value","queryKey":{},"chunks":["value"]},"urlOriginal":"/session/96a881f0-40c9-11e8-a71a-c10ae27e62ae/element/:wdc:1523809525391/value"}}
Screenshot: available via screen
The twitter media scraper does not download any videos.
selenium.common.exceptions.WebDriverException: Message: Service ./webdriver/phantomjsdriver_2.1.1_mac64/phantomjs unexpectedly exited. Status code was: -9
macOS 10.14.5
python 3.7 (macports)
C:\Users\Freeman\Desktop\media-scraper-master>python -m mediascraper.twitter [783214]
Starting PhantomJS web driver...
C:\Users\Freeman\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\phantomjs\webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead
warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '
Traceback (most recent call last):
File "C:\Users\Freeman\AppData\Local\Programs\Python\Python36-32\lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "C:\Users\Freeman\AppData\Local\Programs\Python\Python36-32\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\Freeman\Desktop\media-scraper-master\mediascraper\twitter.py", line 18, in
scraper.username(username)
AttributeError: 'TwitterScraper' object has no attribute 'username'
Hi there,
is it possible to add feature that scrape All Users of TickTok Then it Download All video of All users by its folder name. For example. A to Z are Total members of Tiktok. Script First Scrape all members. by its username. Script created Folder A(anyusername) then it check all available video and Download All inside that folder so on. it download All video from all users of tiktok in folders by their username.
i hope so you understand what feature i requested.
Video Downloaded in Maximum available GOOD Quality.
Thank you so much for any help.
Hello,
Is it possible to add this feature : scrap all media from twitter.com/USER/likes ?
Thank you.
python3 -m mediascraper.twitter
Doesn't work.
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/Volumes/500GB/Users/Shared/moved from ~/Downloads/media-scraper/mediascraper/twitter.py", line 12, in <module>
scraper = mediascrapers.TwitterScraper(scroll_pause = 1.0, mode='normal', debug=False)
File "/Volumes/500GB/Users/Shared/moved from ~/Downloads/media-scraper/mediascrapers.py", line 378, in __init__
super().__init__(**kwargs)
File "/Volumes/500GB/Users/Shared/moved from ~/Downloads/media-scraper/mediascrapers.py", line 38, in __init__
self._driver = seleniumdriver.get('PhantomJS')
File "/Volumes/500GB/Users/Shared/moved from ~/Downloads/media-scraper/util/seleniumdriver.py", line 36, in get
driver = webdriver.PhantomJS(executable_path=source, service_log_path=join(path, 'phantomjs.log'), service_args=["--remote-debugger-port=9000", "--web-security=false"])
AttributeError: module 'selenium.webdriver' has no attribute 'PhantomJS'
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/Users/ /Downloads/media-scraper-master/mediascraper/twitter.py", line 14, in
scraper.login('credentials.json')
File "/Users/ /Downloads/media-scraper-master/mediascrapers.py", line 444, in login
username = [u for u in usernames if u.get_attribute('class') == 'js-username-field email-input js-initial-focus'][0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range
Received the following error: "AttributeError: 'MediaScraper' object has no attribute 'connect'", after running the following code listed on the READEME.
import mediascraper
scraper = mediascrapers.MediaScraper()
scraper.connect(URL)
scraper.scrape(path=SAVE_PATH)
I noticed that the scrape method actually should include the URL argument and the connect method was for the Scraper class but not the MediaScraper class. I'm unsure how to fix myself but I hope I could help.
hi
Windows 7
Python 3.7.6
pip 21.3.1
It does not work on Reddit, Twitter, Instagram and Tik Tok
After loading the repository and run it on cmd
I wanted to try it on Instagram, Tiktok, Twitter and Reddit
I see this problem on Instagram
I will give you screenshots of the problem
C:\Users\pc\Desktop\media-scraper>python -m mediascraper.instagram instagram Starting PhantomJS web driver... .\webdriver/phantomjsdriver_2.1.1_win32/phantomjs.exe C:\Users\pc\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\phantomjs\webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headle ss versions of Chrome or Firefox instead warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless ' Crawling... Traceback (most recent call last): File "C:\Users\pc\AppData\Local\Programs\Python\Python37-32\lib\runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "C:\Users\pc\AppData\Local\Programs\Python\Python37-32\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users\pc\Desktop\media-scraper\mediascraper\instagram.py", line 16, in <module> tasks = scraper.scrape(username) File "C:\Users\pc\Desktop\media-scraper\mediascrapers.py", line 262, in scrape tasks += (task[0], username, task[1]) IndexError: list index out of range
C:\Users\pc\Desktop\media-scraper>python m-scraper.py rq instagram instagram Namespace(credential_file=None, early_stop=False, keywords=['instagram'], save_path=None) Instagramer Task: instagram Traceback (most recent call last): File "m-scraper.py", line 36, in <module> scraper.run(sys.argv[3:]) File "C:\Users\pc\Desktop\media-scraper\m_scraper\rq\downloader.py", line 82, in run self.crawl(keyword, args.early_stop) File "C:\Users\pc\Desktop\media-scraper\m_scraper\rq\instagramer.py", line 46, in crawl tasks, end_cursor, has_next, length, user_id, rhx_gis, csrf_token = get_first_page(username) File "C:\Users\pc\Desktop\media-scraper\m_scraper\rq\utils\instagram.py", line 52, in get_first_page rhx_gis = shared_data['rhx_gis'] KeyError: 'rhx_gis'
C:\Users\pc\Desktop\media-scraper>python m-scraper.py rq tiktok tiktok Namespace(credential_file=None, early_stop=False, keywords=['tiktok'], save_path=None) {'statusCode': 10000, 'verifyConfig': {'code': 10000, 'type': 'verify', 'subtype': 'slide', 'fp': 'verify_dd9489ac31d2e6b50a4f9ed75b5240f2', 'region': 'va', 'detail': 'vEyCkJEKBnSe-zq257GFQJrLW03-aOs8 awmNop3PD5IGQA4kjoDDIU6NDQKG7BnEsMWT8C-WHIUjsfHZ9OMgl9009Qcdo2LIOBhGJyNK118AOCRmw8StlADDjuzkZrFHFDTHnSgp2x651wwrNM6-FYFCOlP0izZx6n*pCjcMIM1sjOh0zwAye*FM5lPnHiVJ1eER3KmM*q6VpyCU*uNyTeYkaDpcFOMdgP3br0Hl sWO--*jeaUPVnSjP8RejdrEQgq7oLsXM4rjjf14GhyWBa0H8kj*LODz42UoKrM32r4Fm6VjEAoEeRrjmHVUkwbwAptLOsfmREJTSdtNToMx6t4NqBXWm0mJ24vXdY9Txp83rH49pTmZE1wbEupTi18B1Tw..'}} Traceback (most recent call last): File "m-scraper.py", line 36, in <module> scraper.run(sys.argv[3:]) File "C:\Users\pc\Desktop\media-scraper\m_scraper\rq\downloader.py", line 82, in run self.crawl(keyword, args.early_stop) File "C:\Users\pc\Desktop\media-scraper\m_scraper\rq\tiktoker.py", line 38, in crawl raise Exception('body not found') Exception: body not found
C:\Users\pc\Desktop\media-scraper>python -m mediascraper.twitter twitter Starting PhantomJS web driver... .\webdriver/phantomjsdriver_2.1.1_win32/phantomjs.exe C:\Users\pc\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\phantomjs\webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headle ss versions of Chrome or Firefox instead warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless ' Crawling... Traceback (most recent call last): File "C:\Users\pc\AppData\Local\Programs\Python\Python37-32\lib\runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "C:\Users\pc\AppData\Local\Programs\Python\Python37-32\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users\pc\Desktop\media-scraper\mediascraper\twitter.py", line 18, in <module> tasks = scraper.scrape(username) File "C:\Users\pc\Desktop\media-scraper\mediascrapers.py", line 392, in scrape done = self.scrollToBottom() File "C:\Users\pc\Desktop\media-scraper\mediascrapers.py", line 87, in scrollToBottom last_height, new_height = self._driver.execute_script("return document.body.scrollHeight"), 0 File "C:\Users\pc\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 636, in execute_script 'args': converted_args})['value'] File "C:\Users\pc\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute self.error_handler.check_response(response) File "C:\Users\pc\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.WebDriverException: Message: {"errorMessage":"Refused to evaluate a string as JavaScript because 'unsafe-eval' is not an allowed source of script in the following Content Se curity Policy directive: \"script-src 'self' 'unsafe-inline' https://*.twimg.com https://recaptcha.net/recaptcha/ https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ https://www.googl e-analytics.com https://twitter.com https://app.link https://accounts.google.com/gsi/client https://appleid.cdn-apple.com/appleauth/static/jsapi/appleid/1/en_US/appleid.auth.js 'nonce-MDk3YWFmNWEtOGR kZS00NGQzLWE2MjMtZjUzNzhjMGZhZGJl'\".\n","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Content-Length":"112","Content-Type":"application/json;charset=UTF-8","Host":"1 27.0.0.1:55938","User-Agent":"selenium/3.141.0 (python windows)"},"httpVersion":"1.1","method":"POST","post":"{\"script\": \"return document.body.scrollHeight\", \"args\": [], \"sessionId\": \"7c44f2b 0-7a92-11ec-a2dc-d38583c010ce\"}","url":"/execute","urlParsed":{"anchor":"","query":"","file":"execute","directory":"/","path":"/execute","relative":"/execute","port":"","host":"","password":"","user" :"","userInfo":"","authority":"","protocol":"","source":"/execute","queryKey":{},"chunks":["execute"]},"urlOriginal":"/session/7c44f2b0-7a92-11ec-a2dc-d38583c010ce/execute"}} Screenshot: available via screen
I also tried set the web driver to 777 for convenience.
C:\Users\pc\Desktop\media-scraper>chmod 777 webdriver/phantomjsdriver_2.1.1_win32/phantomjs.exe 'chmod' is not recognized as an internal or external command, operable program or batch file.
I wish this tool was working because it would be the best tool on the internet
Greetings to all
As stated in the title, i can only download pictures. i download 4 different accounts and it still has the same result
python3 -m mediascraper.twitter
Doesn't work.
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/Volumes/500GB/Users/Shared/moved from ~/Downloads/media-scraper/mediascraper/twitter.py", line 12, in <module>
scraper = mediascrapers.TwitterScraper(scroll_pause = 1.0, mode='normal', debug=False)
File "/Volumes/500GB/Users/Shared/moved from ~/Downloads/media-scraper/mediascrapers.py", line 378, in __init__
super().__init__(**kwargs)
File "/Volumes/500GB/Users/Shared/moved from ~/Downloads/media-scraper/mediascrapers.py", line 38, in __init__
self._driver = seleniumdriver.get('PhantomJS')
File "/Volumes/500GB/Users/Shared/moved from ~/Downloads/media-scraper/util/seleniumdriver.py", line 36, in get
driver = webdriver.PhantomJS(executable_path=source, service_log_path=join(path, 'phantomjs.log'), service_args=["--remote-debugger-port=9000", "--web-security=false"])
AttributeError: module 'selenium.webdriver' has no attribute 'PhantomJS'
My testing account was always realDonaldTrump
and mediascraper.twitter works fine on it. However, for other accounts, for instance, sigrid_ig
, videos are not captured by mediascraper.twitter. The html structure seems to be different between accounts. It needs handling case by case.
The code still not downloading videos from Twitter, what should I do to make it work?
Hi, I tried downloading twitter images but it saves in a subfolder of the current folder of the image like this
(media-scraper\download\twitter\username\username\username\username)
Is there any way to fix this and just download all images in one folder thanks
When I try to scrape Instagram account
PS C:\Users\User\media-scraper> python -m mediascraper.instagram sigridupdating Starting PhantomJS web driver... .\webdriver/phantomjsdriver_2.1.1_win32/phantomjs.exe C:\Python\Python37\lib\site-packages\selenium\webdriver\phantomjs\webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless ' Crawling... Traceback (most recent call last): File "C:\Python\Python37\lib\runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "C:\Python\Python37\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users\User\media-scraper\mediascraper\instagram.py", line 16, in <module> tasks = scraper.scrape(username) File "C:\Users\User\media-scraper\mediascrapers.py", line 238, in scrape data = self.getJsonData(username) File "C:\Users\User\media-scraper\mediascrapers.py", line 227, in getJsonData content = self._driver.find_element_by_tag_name('pre').text File "C:\Python\Python37\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 530, in find_element_by_tag_name return self.find_element(by=By.TAG_NAME, value=name) File "C:\Python\Python37\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 978, in find_element 'value': value})['value'] File "C:\Python\Python37\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute self.error_handler.check_response(response) File "C:\Python\Python37\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.NoSuchElementException: Message: {"errorMessage":"Unable to find element with tag name 'pre'","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Content-Length":"90","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:54137","User-Agent":"selenium/3.141.0 (python windows)"},"httpVersion":"1.1","method":"POST","post":"{\"using\": \"tag name\", \"value\": \"pre\", \"sessionId\": \"70cf5fb0-6c21-11e9-975b-e51852ab1518\"}","url":"/element","urlParsed":{"anchor":"","query":"","file":"element","directory":"/","path":"/element","relative":"/element","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/element","queryKey":{},"chunks":["element"]},"urlOriginal":"/session/70cf5fb0-6c21-11e9-975b-e51852ab1518/element"}} Screenshot: available via screen
Any idea what is causing this
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.