elvisyjlin / media-scraper Goto Github PK

View Code? Open in Web Editor NEW

358.0 9.0 49.0 88 KB

Scrapes all photos and videos in a web page / Instagram / Twitter / Tumblr / Reddit / pixiv / TikTok

License: MIT License

Python 100.00%

scraper instagram twitter crawler tumblr reddit pixiv tiktok

media-scraper's People

Contributors

Stargazers

Watchers

media-scraper's Issues

Twitter scraping is disabled?

Hi,

I cloned the repo and I codeblocks or modules regarding twitter scraping does not exist. Is it disabled?

Redownloading deleted files that are being read as exists

Is there a reason why if i delete all files from a folder(or the entire folder) lets say download\twitter\kheshig, and i try to recraw the account it says some files already exist and skips them. Although this could be a useful function if you don't want the crawler to re-download pictures/video you deleted as you might not like want to constantly deleting things every time you use the crawler. Is there a way to delete stored data of the downloaded files. only 44 of the files were downloaded out of 100(this should be the amount that the crawler should've registered)

python -m mediascraper.twitter kheshig

Starting PhantomJS web driver...
.\webdriver/phantomjsdriver_2.1.1_win32/phantomjs.exe
C:\Users\dolap_000\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\phantomjs\webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead
warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '
Logging in as "Kheshig"...
Crawling...
90 media are found.
Downloading...
49%|██████████████████████████▉ | 44/90 [00:04<00:04, 9.50it/s]The file download/twitter\kheshig\DfWRZtAX0AE6aos.jpg exists. Skip it.
The file download/twitter\kheshig\DfWRaD3XUAU7DVc.jpg exists. Skip it.
The file download/twitter\kheshig\DfWRahRXkAAOAEN.jpg exists. Skip it.
The file download/twitter\kheshig\DfWRa4PWsAEjGpO.jpg exists. Skip it.
The file download/twitter\kheshig\DfWRWnWXkAIbYv9.jpg exists. Skip it.
The file download/twitter\kheshig\DfWRXY_WAAIOoe6.jpg exists. Skip it.
The file download/twitter\kheshig\DfWRR0cXcAErec4.jpg exists. Skip it.
The file download/twitter\kheshig\DfWRSd_XkAANXsq.jpg exists. Skip it.
The file download/twitter\kheshig\DfWRS55X4AEGjTs.jpg exists. Skip it.
The file download/twitter\kheshig\DfWQY6sXUAI6lNx.jpg exists. Skip it.
The file download/twitter\kheshig\DfWQWRtXUAIZUwt.jpg exists. Skip it.
The file download/twitter\kheshig\DfWQUF3W4AAzG3S.jpg exists. Skip it.
The file download/twitter\kheshig\DfWQPoYXkAA6lQw.jpg exists. Skip it.
The file download/twitter\kheshig\DfWQNU-XcAApfXR.jpg exists. Skip it.
The file download/twitter\kheshig\DfWQIyOXUAASqVr.jpg exists. Skip it.
The file download/twitter\kheshig\DfWQG06X4AANiOp.jpg exists. Skip it.
The file download/twitter\kheshig\DfWQEspXcAIx_bC.jpg exists. Skip it.
The file download/twitter\kheshig\DfWQCf6X4AAlqcF.jpg exists. Skip it.
The file download/twitter\kheshig\DfWQA-SWAAAh8ED.jpg exists. Skip it.
The file download/twitter\kheshig\DfWP8ocWAAELICF.jpg exists. Skip it.
The file download/twitter\kheshig\DfWP5HlWsAAwcU5.jpg exists. Skip it.
The file download/twitter\kheshig\DfWP1yyWAAACjNB.jpg exists. Skip it.
The file download/twitter\kheshig\DfWMr6MWAAUgcB_.jpg exists. Skip it.
The file download/twitter\kheshig\DfWMk8nW4AAtnSc.jpg exists. Skip it.
The file download/twitter\kheshig\DfWMZ0bW0AAI1pX.jpg exists. Skip it.
The file download/twitter\kheshig\DfWMaOvW0AQnPPs.jpg exists. Skip it.
The file download/twitter\kheshig\DfWMazDX0AUMO8p.jpg exists. Skip it.
The file download/twitter\kheshig\DfWMbOMX0AAYlZb.jpg exists. Skip it.
The file download/twitter\kheshig\DfWMScQXcAolR1z.jpg exists. Skip it.
The file download/twitter\kheshig\DfWMOBIW4AANRLS.jpg exists. Skip it.
The file download/twitter\kheshig\DfWMObHW0AELSx-.jpg exists. Skip it.
The file download/twitter\kheshig\DfWLTwNX4AAYMFn.jpg exists. Skip it.
The file download/twitter\kheshig\DfWJ_nIWkAIW7Ok.jpg exists. Skip it.
The file download/twitter\kheshig\DfWEoQ8WkAE6s0l.jpg exists. Skip it.
The file download/twitter\kheshig\DfWEJd1XkAE6Edp.jpg exists. Skip it.
The file download/twitter\kheshig\DfWEKmeW0AIpDNs.jpg exists. Skip it.
The file download/twitter\kheshig\DfWELsWWAAA7xcn.jpg exists. Skip it.
The file download/twitter\kheshig\DfWEBQMW0AEQVUW.jpg exists. Skip it.
The file download/twitter\kheshig\DfWEBy8XUAIbLEx.jpg exists. Skip it.
The file download/twitter\kheshig\DfWECRkWsAAee42.jpg exists. Skip it.
The file download/twitter\kheshig\DfWEDTPXUAAGRiw.jpg exists. Skip it.
The file download/twitter\kheshig\DfWD8NMWsAAbwwa.jpg exists. Skip it.
The file download/twitter\kheshig\DfWD8u2XUAEFpVb.jpg exists. Skip it.
The file download/twitter\kheshig\DfWD9JMW0AElLi_.jpg exists. Skip it.
The file download/twitter\kheshig\DfWD9kBX4AAcS-e.jpg exists. Skip it.
100%|███████████████████████████████████████████████████████| 90/90 [00:04<00:00, 19.04it/s]

Enhancement: deactivated accounts

I have been using a bat file to run the program; which contains multiple twitter account. The problem is that if an account was to deactivate their account it would broke the program, stopping it. This shouldn't be the case as the program should be able to skip it and continues to crawl the other accounts; and at the end it should say the accounts that longer exist.

error produce when you run a bat file that contains an account that has deactivated

Starting PhantomJS web driver...
.\webdriver/phantomjsdriver_2.1.1_win32/phantomjs.exe
C:\Users\dolap_000\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\phantomjs\webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead
  warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '
Logging in as "Kheshig"...
Crawling...
Traceback (most recent call last):
  File "C:\Users\dolap_000\AppData\Local\Programs\Python\Python36-32\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\dolap_000\AppData\Local\Programs\Python\Python36-32\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\dolap_000\Desktop\media-scraper-master\mediascraper\twitter.py", line 18, in <module>
    tasks = scraper.scrape(username)
  File "C:\Users\dolap_000\Desktop\media-scraper-master\mediascrapers.py", line 384, in scrape
    done = self.scrollToBottom()
  File "C:\Users\dolap_000\Desktop\media-scraper-master\mediascrapers.py", line 87, in scrollToBottom
    last_height, new_height = self._driver.execute_script("return document.body.scrollHeight"), 0
  File "C:\Users\dolap_000\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 627, in execute_script
    'args': converted_args})['value']
  File "C:\Users\dolap_000\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 312, in execute
    self.error_handler.check_response(response)
  File "C:\Users\dolap_000\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 237, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: {"errorMessage":"Refused to evaluate a string as JavaScript because 'unsafe-eval' is not an allowed source of script in the following Content Security Policy directive: \"script-src https://abs.twimg.com https://ssl.google-analytics.com https://ajax.googleapis.com http://www.google-analytics.com about:\".\n","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Length":"112","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:50385","User-Agent":"Python http auth"},"httpVersion":"1.1","method":"POST","post":"{\"script\": \"return document.body.scrollHeight\", \"args\": [], \"sessionId\": \"afc32870-757b-11e8-9639-3b1da092f52a\"}","url":"/execute","urlParsed":{"anchor":"","query":"","file":"execute","directory":"/","path":"/execute","relative":"/execute","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/execute","queryKey":{},"chunks":["execute"]},"urlOriginal":"/session/afc32870-757b-11e8-9639-3b1da092f52a/execute"}}
Screenshot: available via screen.

İnstagram scrap problem

Message: no such element: Unable to locate element: {"method":"tag name","selector":"pre"}

gyfcat URLs require an update

Hello, you should update below. These no longer work

    GFYCAT_MP4  = 'https://giant.gfycat.com/{}.mp4'
    GFYCAT_WEBM = 'https://giant.gfycat.com/{}.webm'

Instagram erro

S C:\Users\Ego Brain\Documents\Python\ScrapingSocial\media-scraper> python m-scraper.py rq instagram aeso_barrosmelo
Namespace(credential_file=None, early_stop=False, keywords=['aeso_barrosmelo'], save_path=None)
Instagramer Task: aeso_barrosmelo
Traceback (most recent call last):
File "m-scraper.py", line 36, in
scraper.run(sys.argv[3:])
File "C:\Users\Ego Brain\Documents\Python\ScrapingSocial\media-scraper\m_scraper\rq\downloader.py", line 82, in run
self.crawl(keyword, args.early_stop)
File "C:\Users\Ego Brain\Documents\Python\ScrapingSocial\media-scraper\m_scraper\rq\instagramer.py", line 46, in crawl
tasks, end_cursor, has_next, length, user_id, rhx_gis, csrf_token = get_first_page(username)
File "C:\Users\Ego Brain\Documents\Python\ScrapingSocial\media-scraper\m_scraper\rq\utils\instagram.py", line 52, in get_first_page
rhx_gis = shared_data['rhx_gis']
KeyError: 'rhx_gis'
PS C:\Users\Ego Brain\Documents\Python\ScrapingSocial\media-scraper>

[Pixiv Login] soupsieve.util.SelectorSyntaxError: Malformed attribute selector at position 5

Tried using login credentials for pixiv scraper:
python m-scraper.py rq pixiv USERID -c credentials.json tried out a couple times with different USERID

Got this error:

Namespace(keywords=[], credential_file='credentials.json', save_path=None, early_stop=False)
Logging in pixiv account...
Traceback (most recent call last):
  File "D:\media-scraper-master\media-scraper-master\m-scraper.py", line 36, in <module>
    scraper.run(sys.argv[3:])
  File "D:\media-scraper-master\media-scraper-master\m_scraper\rq\downloader.py", line 72, in run
    self.login(username, password)
  File "D:\media-scraper-master\media-scraper-master\m_scraper\rq\pixiver.py", line 27, in login
    post_key = soup.select('input[name==post_key]')[0]['value']
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\bs4\element.py", line 1973, in select
    results = soupsieve.select(selector, self, namespaces, limit, **kwargs)
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\soupsieve\__init__.py", line 144, in select
    return compile(select, namespaces, flags, **kwargs).select(tag, limit)
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\soupsieve\__init__.py", line 67, in compile
    return cp._cached_css_compile(pattern, ns, cs, flags)
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\soupsieve\css_parser.py", line 222, in _cached_css_compile
    ).process_selectors(),
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\soupsieve\css_parser.py", line 1159, in process_selectors
    return self.parse_selectors(self.selector_iter(self.pattern), index, flags)
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\soupsieve\css_parser.py", line 985, in parse_selectors
    key, m = next(iselector)
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\soupsieve\css_parser.py", line 1152, in selector_iter
    raise SelectorSyntaxError(msg, self.pattern, index)
soupsieve.util.SelectorSyntaxError: Malformed attribute selector at position 5
  line 1:
input[name==post_key]
     ^

Selenium support for PhantomJS has been deprecated,

UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead

it happen when i'm using this
python3 -m mediascraper.general [WEB PAGE 1] [WEB PAGE 2] ...

Error: Remote end closed connection without response

When running the following command:
python3 -m mediascraper.twitter nerdcity

I get the following error:

Starting PhantomJS web driver...
./webdriver/phantomjsdriver_2.1.1_linux64/phantomjs
/home/User/.local/lib/python3.6/site-packages/selenium/webdriver/phantomjs/webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead
warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '
Traceback (most recent call last):
File "/home/User/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/home/User/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 384, in _make_request
six.raise_from(e, None)
File "", line 2, in raise_from
File "/home/User/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 380, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.6/http/client.py", line 1331, in getresponse
response.begin()
File "/usr/lib/python3.6/http/client.py", line 297, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.6/http/client.py", line 266, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/User/Desktop/git/media-scraper/mediascraper/twitter.py", line 18, in
tasks = scraper.scrape(username)
File "/home/User/Desktop/git/media-scraper/mediascrapers.py", line 379, in scrape
self._connect('{}/{}/media'.format(self.base_url, username))
File "/home/User/Desktop/git/media-scraper/mediascrapers.py", line 51, in _connect
self._driver.get(url)
File "/home/User/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 333, in get
self.execute(Command.GET, {'url': url})
File "/home/User/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 319, in execute
response = self.command_executor.execute(driver_command, params)
File "/home/User/.local/lib/python3.6/site-packages/selenium/webdriver/remote/remote_connection.py", line 374, in execute
return self._request(command_info[0], url, body=data)
File "/home/User/.local/lib/python3.6/site-packages/selenium/webdriver/remote/remote_connection.py", line 402, in _request
resp = http.request(method, url, body=body, headers=headers)
File "/home/User/.local/lib/python3.6/site-packages/urllib3/request.py", line 72, in request
**urlopen_kw)
File "/home/User/.local/lib/python3.6/site-packages/urllib3/request.py", line 150, in request_encode_body
return self.urlopen(method, url, **extra_kw)
File "/home/User/.local/lib/python3.6/site-packages/urllib3/poolmanager.py", line 324, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/home/User/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "/home/User/.local/lib/python3.6/site-packages/urllib3/util/retry.py", line 368, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/home/User/.local/lib/python3.6/site-packages/urllib3/packages/six.py", line 685, in reraise
raise value.with_traceback(tb)
File "/home/User/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/home/User/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 384, in _make_request
six.raise_from(e, None)
File "", line 2, in raise_from
File "/home/User/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 380, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.6/http/client.py", line 1331, in getresponse
response.begin()
File "/usr/lib/python3.6/http/client.py", line 297, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.6/http/client.py", line 266, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

Can't run the program

The program doesn't seem to work, and i'm not sure what the possible cause is...

C:\Users\dolap_000\Desktop\media-scraper-master>python -m mediascraper.twitter [3347813921]
Starting PhantomJS web driver...
C:\Users\dolap_000\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\phantomjs\webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead
warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '
Logging in as "Dolapofreeman"...
Traceback (most recent call last):
File "C:\Users\dolap_000\AppData\Local\Programs\Python\Python36-32\lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "C:\Users\dolap_000\AppData\Local\Programs\Python\Python36-32\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\dolap_000\Desktop\media-scraper-master\mediascraper\twitter.py", line 14, in
scraper.login('credentials.json')
File "C:\Users\dolap_000\Desktop\media-scraper-master\mediascrapers.py", line 410, in login
username.send_keys(credentials['username'])
File "C:\Users\dolap_000\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\remote\webelement.py", line 479, in send_keys
'value': keys_to_typing(value)})
File "C:\Users\dolap_000\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\remote\webelement.py", line 628, in _execute
return self._parent.execute(command, params)
File "C:\Users\dolap_000\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 312, in execute
self.error_handler.check_response(response)
File "C:\Users\dolap_000\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 237, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.InvalidElementStateException: Message: {"errorMessage":"Element is not currently interactable and may not be manipulated","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Length":"182","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:62282","User-Agent":"Python http auth"},"httpVersion":"1.1","method":"POST","post":"{"text": "Dolapofreeman", "value": ["D", "o", "l", "a", "p", "o", "f", "r", "e", "e", "m", "a", "n"], "id": ":wdc:1523809525391", "sessionId": "96a881f0-40c9-11e8-a71a-c10ae27e62ae"}","url":"/value","urlParsed":{"anchor":"","query":"","file":"value","directory":"/","path":"/value","relative":"/value","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/value","queryKey":{},"chunks":["value"]},"urlOriginal":"/session/96a881f0-40c9-11e8-a71a-c10ae27e62ae/element/:wdc:1523809525391/value"}}
Screenshot: available via screen

twitter does not download videos

The twitter media scraper does not download any videos.

phantomjs unexpectedly exited. Status code was: -9

selenium.common.exceptions.WebDriverException: Message: Service ./webdriver/phantomjsdriver_2.1.1_mac64/phantomjs unexpectedly exited. Status code was: -9

macOS 10.14.5
python 3.7 (macports)

How to handle the __signature field in tiktok share link

issue

C:\Users\Freeman\Desktop\media-scraper-master>python -m mediascraper.twitter [783214]
Starting PhantomJS web driver...
C:\Users\Freeman\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\phantomjs\webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead
warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '
Traceback (most recent call last):
File "C:\Users\Freeman\AppData\Local\Programs\Python\Python36-32\lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "C:\Users\Freeman\AppData\Local\Programs\Python\Python36-32\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\Freeman\Desktop\media-scraper-master\mediascraper\twitter.py", line 18, in
scraper.username(username)
AttributeError: 'TwitterScraper' object has no attribute 'username'

TikTok Video Downloader.

Hi there,
is it possible to add feature that scrape All Users of TickTok Then it Download All video of All users by its folder name. For example. A to Z are Total members of Tiktok. Script First Scrape all members. by its username. Script created Folder A(anyusername) then it check all available video and Download All inside that folder so on. it download All video from all users of tiktok in folders by their username.

i hope so you understand what feature i requested.
Video Downloaded in Maximum available GOOD Quality.

Thank you so much for any help.

[Feature] Scap images from User likes

Hello,

Is it possible to add this feature : scrap all media from twitter.com/USER/likes ?

Thank you.

AttributeError: module 'selenium.webdriver' has no attribute 'PhantomJS'

python3 -m mediascraper.twitter Doesn't work.

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Volumes/500GB/Users/Shared/moved from ~/Downloads/media-scraper/mediascraper/twitter.py", line 12, in <module>
    scraper = mediascrapers.TwitterScraper(scroll_pause = 1.0, mode='normal', debug=False)
  File "/Volumes/500GB/Users/Shared/moved from ~/Downloads/media-scraper/mediascrapers.py", line 378, in __init__
    super().__init__(**kwargs)
  File "/Volumes/500GB/Users/Shared/moved from ~/Downloads/media-scraper/mediascrapers.py", line 38, in __init__
    self._driver = seleniumdriver.get('PhantomJS')
  File "/Volumes/500GB/Users/Shared/moved from ~/Downloads/media-scraper/util/seleniumdriver.py", line 36, in get
    driver = webdriver.PhantomJS(executable_path=source, service_log_path=join(path, 'phantomjs.log'), service_args=["--remote-debugger-port=9000", "--web-security=false"]) 
AttributeError: module 'selenium.webdriver' has no attribute 'PhantomJS'

Index Error

I get this every time I run " python3 -m mediascraper.twitter Twitter "

Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/Users/ /Downloads/media-scraper-master/mediascraper/twitter.py", line 14, in
scraper.login('credentials.json')
File "/Users/ /Downloads/media-scraper-master/mediascrapers.py", line 444, in login
username = [u for u in usernames if u.get_attribute('class') == 'js-username-field email-input js-initial-focus'][0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

AttributeError: 'MediaScraper' object has no attribute 'connect'

Received the following error: "AttributeError: 'MediaScraper' object has no attribute 'connect'", after running the following code listed on the READEME.

import mediascraper
scraper = mediascrapers.MediaScraper()
scraper.connect(URL)
scraper.scrape(path=SAVE_PATH)

I noticed that the scrape method actually should include the URL argument and the connect method was for the Scraper class but not the MediaScraper class. I'm unsure how to fix myself but I hope I could help.

media scraper does not work

hi
Windows 7
Python 3.7.6
pip 21.3.1

It does not work on Reddit, Twitter, Instagram and Tik Tok
After loading the repository and run it on cmd
I wanted to try it on Instagram, Tiktok, Twitter and Reddit
I see this problem on Instagram

I will give you screenshots of the problem

C:\Users\pc\Desktop\media-scraper>python -m mediascraper.instagram instagram Starting PhantomJS web driver... .\webdriver/phantomjsdriver_2.1.1_win32/phantomjs.exe C:\Users\pc\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\phantomjs\webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headle ss versions of Chrome or Firefox instead warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless ' Crawling... Traceback (most recent call last): File "C:\Users\pc\AppData\Local\Programs\Python\Python37-32\lib\runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "C:\Users\pc\AppData\Local\Programs\Python\Python37-32\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users\pc\Desktop\media-scraper\mediascraper\instagram.py", line 16, in <module> tasks = scraper.scrape(username) File "C:\Users\pc\Desktop\media-scraper\mediascrapers.py", line 262, in scrape tasks += (task[0], username, task[1]) IndexError: list index out of range

C:\Users\pc\Desktop\media-scraper>python m-scraper.py rq instagram instagram Namespace(credential_file=None, early_stop=False, keywords=['instagram'], save_path=None) Instagramer Task: instagram Traceback (most recent call last): File "m-scraper.py", line 36, in <module> scraper.run(sys.argv[3:]) File "C:\Users\pc\Desktop\media-scraper\m_scraper\rq\downloader.py", line 82, in run self.crawl(keyword, args.early_stop) File "C:\Users\pc\Desktop\media-scraper\m_scraper\rq\instagramer.py", line 46, in crawl tasks, end_cursor, has_next, length, user_id, rhx_gis, csrf_token = get_first_page(username) File "C:\Users\pc\Desktop\media-scraper\m_scraper\rq\utils\instagram.py", line 52, in get_first_page rhx_gis = shared_data['rhx_gis'] KeyError: 'rhx_gis'

tiktok

C:\Users\pc\Desktop\media-scraper>python m-scraper.py rq tiktok tiktok Namespace(credential_file=None, early_stop=False, keywords=['tiktok'], save_path=None) {'statusCode': 10000, 'verifyConfig': {'code': 10000, 'type': 'verify', 'subtype': 'slide', 'fp': 'verify_dd9489ac31d2e6b50a4f9ed75b5240f2', 'region': 'va', 'detail': 'vEyCkJEKBnSe-zq257GFQJrLW03-aOs8 awmNop3PD5IGQA4kjoDDIU6NDQKG7BnEsMWT8C-WHIUjsfHZ9OMgl9009Qcdo2LIOBhGJyNK118AOCRmw8StlADDjuzkZrFHFDTHnSgp2x651wwrNM6-FYFCOlP0izZx6n*pCjcMIM1sjOh0zwAye*FM5lPnHiVJ1eER3KmM*q6VpyCU*uNyTeYkaDpcFOMdgP3br0Hl sWO--*jeaUPVnSjP8RejdrEQgq7oLsXM4rjjf14GhyWBa0H8kj*LODz42UoKrM32r4Fm6VjEAoEeRrjmHVUkwbwAptLOsfmREJTSdtNToMx6t4NqBXWm0mJ24vXdY9Txp83rH49pTmZE1wbEupTi18B1Tw..'}} Traceback (most recent call last): File "m-scraper.py", line 36, in <module> scraper.run(sys.argv[3:]) File "C:\Users\pc\Desktop\media-scraper\m_scraper\rq\downloader.py", line 82, in run self.crawl(keyword, args.early_stop) File "C:\Users\pc\Desktop\media-scraper\m_scraper\rq\tiktoker.py", line 38, in crawl raise Exception('body not found') Exception: body not found

twitter

C:\Users\pc\Desktop\media-scraper>python -m mediascraper.twitter twitter Starting PhantomJS web driver... .\webdriver/phantomjsdriver_2.1.1_win32/phantomjs.exe C:\Users\pc\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\phantomjs\webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headle ss versions of Chrome or Firefox instead warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless ' Crawling... Traceback (most recent call last): File "C:\Users\pc\AppData\Local\Programs\Python\Python37-32\lib\runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "C:\Users\pc\AppData\Local\Programs\Python\Python37-32\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users\pc\Desktop\media-scraper\mediascraper\twitter.py", line 18, in <module> tasks = scraper.scrape(username) File "C:\Users\pc\Desktop\media-scraper\mediascrapers.py", line 392, in scrape done = self.scrollToBottom() File "C:\Users\pc\Desktop\media-scraper\mediascrapers.py", line 87, in scrollToBottom last_height, new_height = self._driver.execute_script("return document.body.scrollHeight"), 0 File "C:\Users\pc\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 636, in execute_script 'args': converted_args})['value'] File "C:\Users\pc\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute self.error_handler.check_response(response) File "C:\Users\pc\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.WebDriverException: Message: {"errorMessage":"Refused to evaluate a string as JavaScript because 'unsafe-eval' is not an allowed source of script in the following Content Se curity Policy directive: \"script-src 'self' 'unsafe-inline' https://*.twimg.com https://recaptcha.net/recaptcha/ https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ https://www.googl e-analytics.com https://twitter.com https://app.link https://accounts.google.com/gsi/client https://appleid.cdn-apple.com/appleauth/static/jsapi/appleid/1/en_US/appleid.auth.js 'nonce-MDk3YWFmNWEtOGR kZS00NGQzLWE2MjMtZjUzNzhjMGZhZGJl'\".\n","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Content-Length":"112","Content-Type":"application/json;charset=UTF-8","Host":"1 27.0.0.1:55938","User-Agent":"selenium/3.141.0 (python windows)"},"httpVersion":"1.1","method":"POST","post":"{\"script\": \"return document.body.scrollHeight\", \"args\": [], \"sessionId\": \"7c44f2b 0-7a92-11ec-a2dc-d38583c010ce\"}","url":"/execute","urlParsed":{"anchor":"","query":"","file":"execute","directory":"/","path":"/execute","relative":"/execute","port":"","host":"","password":"","user" :"","userInfo":"","authority":"","protocol":"","source":"/execute","queryKey":{},"chunks":["execute"]},"urlOriginal":"/session/7c44f2b0-7a92-11ec-a2dc-d38583c010ce/execute"}} Screenshot: available via screen

I also tried set the web driver to 777 for convenience.
C:\Users\pc\Desktop\media-scraper>chmod 777 webdriver/phantomjsdriver_2.1.1_win32/phantomjs.exe 'chmod' is not recognized as an internal or external command, operable program or batch file.

I wish this tool was working because it would be the best tool on the internet
Greetings to all

@elvisyjlin

Only downloads photos in Twitter

As stated in the title, i can only download pictures. i download 4 different accounts and it still has the same result

AttributeError: module 'selenium.webdriver' has no attribute 'PhantomJS'

python3 -m mediascraper.twitter Doesn't work.

  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Volumes/500GB/Users/Shared/moved from ~/Downloads/media-scraper/mediascraper/twitter.py", line 12, in <module>
    scraper = mediascrapers.TwitterScraper(scroll_pause = 1.0, mode='normal', debug=False)
  File "/Volumes/500GB/Users/Shared/moved from ~/Downloads/media-scraper/mediascrapers.py", line 378, in __init__
    super().__init__(**kwargs)
  File "/Volumes/500GB/Users/Shared/moved from ~/Downloads/media-scraper/mediascrapers.py", line 38, in __init__
    self._driver = seleniumdriver.get('PhantomJS')
  File "/Volumes/500GB/Users/Shared/moved from ~/Downloads/media-scraper/util/seleniumdriver.py", line 36, in get
    driver = webdriver.PhantomJS(executable_path=source, service_log_path=join(path, 'phantomjs.log'), service_args=["--remote-debugger-port=9000", "--web-security=false"]) 
AttributeError: module 'selenium.webdriver' has no attribute 'PhantomJS'

TwitterScraper does not download all images and videos.

My testing account was always realDonaldTrump and mediascraper.twitter works fine on it. However, for other accounts, for instance, sigrid_ig, videos are not captured by mediascraper.twitter. The html structure seems to be different between accounts. It needs handling case by case.

Not downloading twitter videos

The code still not downloading videos from Twitter, what should I do to make it work?

Twitter media download in subfolders

Hi, I tried downloading twitter images but it saves in a subfolder of the current folder of the image like this
(media-scraper\download\twitter\username\username\username\username)
Is there any way to fix this and just download all images in one folder thanks

Instagram scraper not working

When I try to scrape Instagram account
PS C:\Users\User\media-scraper> python -m mediascraper.instagram sigridupdating Starting PhantomJS web driver... .\webdriver/phantomjsdriver_2.1.1_win32/phantomjs.exe C:\Python\Python37\lib\site-packages\selenium\webdriver\phantomjs\webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless ' Crawling... Traceback (most recent call last): File "C:\Python\Python37\lib\runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "C:\Python\Python37\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users\User\media-scraper\mediascraper\instagram.py", line 16, in <module> tasks = scraper.scrape(username) File "C:\Users\User\media-scraper\mediascrapers.py", line 238, in scrape data = self.getJsonData(username) File "C:\Users\User\media-scraper\mediascrapers.py", line 227, in getJsonData content = self._driver.find_element_by_tag_name('pre').text File "C:\Python\Python37\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 530, in find_element_by_tag_name return self.find_element(by=By.TAG_NAME, value=name) File "C:\Python\Python37\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 978, in find_element 'value': value})['value'] File "C:\Python\Python37\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute self.error_handler.check_response(response) File "C:\Python\Python37\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.NoSuchElementException: Message: {"errorMessage":"Unable to find element with tag name 'pre'","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Content-Length":"90","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:54137","User-Agent":"selenium/3.141.0 (python windows)"},"httpVersion":"1.1","method":"POST","post":"{\"using\": \"tag name\", \"value\": \"pre\", \"sessionId\": \"70cf5fb0-6c21-11e9-975b-e51852ab1518\"}","url":"/element","urlParsed":{"anchor":"","query":"","file":"element","directory":"/","path":"/element","relative":"/element","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/element","queryKey":{},"chunks":["element"]},"urlOriginal":"/session/70cf5fb0-6c21-11e9-975b-e51852ab1518/element"}} Screenshot: available via screen
Any idea what is causing this

elvisyjlin / media-scraper Goto Github PK

media-scraper's People

Contributors

Stargazers

Watchers

Forkers

media-scraper's Issues

I get this every time I run " python3 -m mediascraper.twitter Twitter "

tiktok

twitter

Recommend Projects

Recommend Topics

Recommend Org