justanotherarchivist / snscrape Goto Github PK

A social networking service scraper in Python

License: GNU General Public License v3.0

Python 100.00%

python scraper social-network social-media

snscrape's Introduction

snscrape

snscrape is a scraper for social networking services (SNS). It scrapes things like user profiles, hashtags, or searches and returns the discovered items, e.g. the relevant posts.

The following services are currently supported:

Facebook: user profiles, groups, and communities (aka visitor posts)
Instagram: user profiles, hashtags, and locations
Mastodon: user profiles and toots (single or thread)
Reddit: users, subreddits, and searches (via Pushshift)
Telegram: channels
Twitter: users, user profiles, hashtags, searches (live tweets, top tweets, and users), tweets (single or surrounding thread), list posts, communities, and trends
VKontakte: user profiles
Weibo (Sina Weibo): user profiles

Requirements

snscrape requires Python 3.8 or higher. The Python package dependencies are installed automatically when you install snscrape.

Note that one of the dependencies, lxml, also requires libxml2 and libxslt to be installed.

Installation

pip3 install snscrape

If you want to use the development version:

pip3 install git+https://github.com/JustAnotherArchivist/snscrape.git

Usage

CLI

The generic syntax of snscrape's CLI is:

snscrape [GLOBAL-OPTIONS] SCRAPER-NAME [SCRAPER-OPTIONS] [SCRAPER-ARGUMENTS...]

snscrape --help and snscrape SCRAPER-NAME --help provide details on the options and arguments. snscrape --help also lists all available scrapers.

The default output of the CLI is the URL of each result.

Some noteworthy global options are:

--jsonl to get output as JSONL. This includes all information extracted by snscrape (e.g. message content, datetime, images; details vary by scraper).
--max-results NUMBER to only return the first NUMBER results.
--with-entity to get an item on the entity being scraped, e.g. the user or channel. This is not supported on all scrapers. (You can use this together with --max-results 0 to only fetch the entity info.)

Examples

Collect all tweets by Jason Scott (@textfiles):

snscrape twitter-user textfiles

It's usually useful to redirect the output to a file for further processing, e.g. in bash using the filename twitter-@textfiles:

snscrape twitter-user textfiles >twitter-@textfiles

To get the latest 100 tweets with the hashtag #archiveteam:

snscrape --max-results 100 twitter-hashtag archiveteam

Library

It is also possible to use snscrape as a library in Python, but this is currently undocumented.

Issue reporting

If you discover an issue with snscrape, please report it at https://github.com/JustAnotherArchivist/snscrape/issues. If you use the CLI, please run snscrape with -vv and include the log output in the issue. If you use snscrape as a module, please enable debug-level logging using import logging; logging.basicConfig(level = logging.DEBUG) (before using snscrape at all) and include the log output in the issue.

Dump files

In some cases, debugging may require more information than is available in the log. The CLI has a --dump-locals option that enables dumping all local variables within snscrape based on important log messages (rather than, by default, only on crashes). Note that the dump files may contain sensitive information in some cases and could potentially be used to identify you (e.g. if the service includes your IP address in its response). If you prefer to arrange a file transfer privately, just mention that in the issue.

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.

snscrape's People

Contributors

Stargazers

Watchers

Forkers

jodizzle ohhdemgirls emijrp corentinb flashfire42 peterk hook321 pl77 github-userx archiveteam backwardn ghostofapacket krupakmurthy project-renard-survey innovativeinventor akuma527 endofline yunusemrecatalcam jacobmas iitians bonchae umbertho harrypeas callaaaa skywd max-pedersen mathiasfls ibpad kana1017 davidhorrock0949 jarbinks darvell dutchosintguy bostonraul boshez martinomensio steveobd bellingcat ashwch tumo123 niehaus ksuszko greyjedix n4rr34n6 hikkymouse1007 lepy edsu yanggao65 vanessa920 nickwilders tommuhm cedoard johannestang hejazif unachka wajdiz salamtamp koaly yashodhank juliansteam basma-shu evrimulgen sidixu301 paulkaefer burakoglakci vyao3130 mahdimokadem swipswaps minhtranca egonunkafasi salihhozturk hakanmazi lthiell kukupigs realeaxis justaahh yomguithereal e-stat stiles klprint didierfr75 zammalhabe tiklimoumita temidayo1234 greece57 cloudmcloudyo parinfuture assassinukg darkdirt magda-zielinska hexuanzhang-ds alfalmi chpmem jcarbonnell nuhu-ibrahim sunwk cedric-brisebois karan-s-mittal lukpier marquisvictor

snscrape's Issues

Support scraping a Facebook profile's uploaded photos

"KeyError: 'data-expanded-url'" crash on encountering certain tweets

Example tweet: https://twitter.com/Europarl_PL/status/1128652934521159681

Example command that's crashing:

snscrape -v twitter-search --max-position TWEET-1128654559075885058-1129149757925023744 '#TellEurope'

Crash:

Traceback (most recent call last):
  File ".../bin/snscrape", line 11, in <module>
    load_entry_point('snscrape', 'console_scripts', 'snscrape')()
  File ".../snscrape/cli.py", line 107, in main
    for i, item in enumerate(scraper.get_items(), start = 1):
  File ".../snscrape/modules/twitter.py", line 106, in get_items
    yield from self._feed_to_items(feed)
  File ".../snscrape/modules/twitter.py", line 50, in _feed_to_items
    outlinks.append(a['data-expanded-url'])
  File ".../lib/python3.6/site-packages/bs4/element.py", line 1071, in __getitem__
    return self.attrs[key]
KeyError: 'data-expanded-url'

I don't know what's going on here. The tweet has a u-hidden class when I access it through Firefox but not when snscrape gets it. UA magic?

Support for media sets

Followup for #36

snscrape should produce the /media/set/ URL, fetch it, and also produce the URL for each photo in the set.

Support for Facebook groups

Fails to extract all content on Facebook

Facebook being Facebook, there are a lot of different URL formats for posts. There's /username/posts/ID and photos/videos, which snscrape discovers. But there's also at least three more which are currently not found: /permalink.php?..., /notes/profile-slug/note-slug/ID, and /events/ID (which are not page-specific but still part of the feed).

Support for Telegram

2019-06-02 15:04:23 UTC  <@JAA> Relevant URLs: https://t.me/s/Telegram/ -> https://t.me/telegram/91 -> https://t.me/s/telegram/91 which then links to older posts.
2019-06-02 15:05:18 UTC  <@JAA> There's also this: https://t.me/s/Telegram?before=91

Support scraping visitor posts on Facebook

Facebook community pages can receive visitor posts. snscrape should be able to scrape those.

Examples:

https://www.facebook.com/pg/malcolmturnbull/community/
https://www.facebook.com/pg/marionsugeknight/community/
https://www.facebook.com/pg/JBishopMP/community/ (empty as of 2018-10-04)

Block invalid Twitter usernames

Yes, yet another bug in Twitter's search. This time, it involves invalid usernames that look like domains. For example, https://twitter.com/search?q=from:archiveteam.org returns various Tweets containing links to archiveteam.org, which contradicts the search documentation (hidden behind "operators" on search-home). An account with the username archiveteam.org does not exist, so that search should not return any results. In fact, such an account can't exist: Twitter usernames must match [A-Za-z0-9_]{1,15} per Twitter's help page.

Because this unexpectedly produces tweets e.g. on snscrape twitter-user archiveteam.org, snscrape should directly block such invalid usernames.

Twitter search scraping sometimes stops early

snscrape twitter-search 'ProJared since:2019-05-01' stopped much too early just now. No error or anything, it just stopped.

Add module for Instagram hashtags

Support for Pinterest

Requested by Ryz

Support for Facebook hashtags

The hashtag page seems to be pretty much broken though. For example, https://www.facebook.com/hashtag/giletsjaunes lists only 5 posts and has a broken "see all" link for videos... So this might be tricky.

Facebook: JSONDecodeError due to empty response

While scraping FWPthailand, I just got this exception:

Traceback (most recent call last):
  File ".../bin/snscrape", line 11, in <module>
    load_entry_point('snscrape==0.1.3', 'console_scripts', 'snscrape')()
  File ".../lib/python3.6/site-packages/snscrape/cli.py", line 59, in main
    for i, item in enumerate(scraper.get_items(), start = 1):
  File ".../lib/python3.6/site-packages/snscrape/modules/facebook.py", line 59, in get_items
    response = json.loads(spuriousForLoopPattern.sub('', r.text))
  File ".../lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File ".../lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File ".../lib/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Facebook apparently sends a completely empty response there. I can reproduce this at the moment.

Support for Twitter replies

Similar to #12 but in the opposite direction: start with a certain tweet, then fetch all of its replies, recursively.

Support for Sina Weibo

As requested by Ryz in #archivebot. Example: https://www.weibo.com/zeddofficial

Only return posts newer than X

It would be useful to be able to only collect results which are newer than a certain date. For example, this would allow to periodically archive an account's new messages without having to reretrieve the entire post history.

Envisioned usage example: snscrape --since 2018-01-01 twitter-user username

This requires #9. Once that is implemented, this is a trivial date comparison in the CLI loop.

Facebook "Ignoring odd link: #"

Scraping https://www.facebook.com/petroporoshenko/ produces lots of

WARNING  snscrape.modules.facebook  Ignoring odd link: #

snscrape already handles those href="#" links in one particular case (user added photos to an album), but it looks like there are others. I haven't looked into what's the reason on this particular profile yet.

Support for Mastodon

Mastodon and Pleroma are both federated alternative social networks that would be useful to support (especially Pleroma as the website is loaded with Javascript so an archivebot won't grab a user). Ideally, users, hashtags, threads, and entire instance timelines would be supported.

Support for Twitter threads

Twitter threads are a common way to convey more information on Twitter than fits into a single message. snscrape should support scraping these.

This could be either a discovery of the original tweeter's replies to their own initial tweet or a recursive discovery of all replies to a tweet. The latter would of course be much more expensive.

Return information on the source page (e.g. correct capitalisation of a username or the user ID)

SNS are typically case-preserving, i.e. a user can sign up with any particular capitalisation but can later also be found using a different capitalisation. For example, https://twitter.com/textfiles and https://twitter.com/Textfiles both return Jason Scott's Twitter profile, with the correct capitalisation being the lower-case "textfiles".

Another issue becomes apparent on Facebook and Google+: on both sites, you can refer to a user either by their username or by a user ID. For example, in the case of Google+, https://plus.google.com/+theeldersorg and https://plus.google.com/114101993040150430579 return the same page. The post URLs can be varied in the same way, e.g. https://plus.google.com/+theeldersorg/posts/eZ9siFsegrc and https://plus.google.com/114101993040150430579/posts/eZ9siFsegrc are both valid links to The Elders' most recent post on Google+. Google+ is actually quite inconsistent in when it returns usernames/handles and IDs.

It would be good if snscrape could return information on a source page so that the caller can adapt accordingly. In the above examples, this would include the information that "textfiles" is the correct capitalisation and that +theeldersorg and 114101993040150430579 are equivalent, respectively. More information could be included, of course, e.g. profile picture, description, signup date, or whatever else is available on the platform. The caller could then for example also print the canonical profile URL or replace usernames and user IDs in URLs so that you could e.g. archive both with another tool (so it's easier to find the post). This would also need to be exposed on the CLI somehow.

Scraping certain Twitter users produces no or too few results

snscrape is unable to list the tweets at all or before some point in time for some users. This seems to be a bug in Twitter's search engine.

In some cases, this appears to be temporary. For example, when I tried to scrape https://twitter.com/NewsweekUK on 2018-02-08, it did not produce any results. I confirmed in the browser back then that the search for from:NewsweekUK (which is what snscrape uses) indeed turned up empty. For this particular user, the issue seems to have been fixed.

Other examples:

https://twitter.com/TrumpTracker_AI produced no results on 2018-02-11.

I believe that this issue is unfixable from snscrape's side. As a workaround, however, snscrape could fall back to scraping the user profile page if it finds no results on the search. This would only yield the 3200 most recent tweets, but that's still better than zero.

ERROR snscrape.modules.twitter Content type of ... is not JSON

While running snscrape twitter-user jmestepa, I observed:

2018-09-28 10:58:51.418  ERROR  snscrape.modules.twitter  Content type of https://twitter.com/i/search/timeline?f=tweets&vertical=default&lang=en&q=from%3Ajmestepa&include_available_features=1&include_entities=1&reset_error_state=false&src=typd&max_position=TWEET-329031970229207042-1045516283784171520 is not JSON
2018-09-28 11:02:36.135  ERROR  snscrape.modules.twitter  Content type of https://twitter.com/i/search/timeline?f=tweets&vertical=default&lang=en&q=from%3Ajmestepa&include_available_features=1&include_entities=1&reset_error_state=false&src=typd&max_position=TWEET-18781362769-1045516283784171520 is not JSON

That looks like a transient error on a bad response that snscrape should probably re-fetch, if it doesn't already.

From the error message, I assume the worst case of "it gave up after that error", but if that is not the case, maybe that could be made more obvious?

Edit: on other accounts too:

2018-09-28 11:09:36.836  ERROR  snscrape.modules.twitter  Content type of https://twitter.com/i/search/timeline?f=tweets&vertical=default&lang=en&q=from%3Arepublicoftogo&include_available_features=1&include_entities=1&reset_error_state=false&src=typd&max_position=TWEET-1020278804193972224-1045610783923744768 is not JSON

Some Facebook pages return no results

Because Facebook sucks, snscrape currently returns no results on some pages even though it follows the pagination correctly.

Examples:

Provide a cleaned URL

Some services include garbage in their URLs. For example, Instagram posts linked on the profile page have a useless taken-by parameter carrying the username, and Facebook recently (a few weeks ago) started including some (most likely tracking) parameters __xts__, __tn__, and eid in the post links. While the Instagram case is not too problematic, since that parameter has a meaning and is constant, the Facebook one is definitely undesired since those parameters are carried on when forwarding link to another person or software and can probably be used to identify the scraping user. When using snscrape for archival purposes, it can also make it very difficult to find the archived page later.

The Item instances yielded by the modules should carry both the original and the cleaned URL, and the CLI should provide options to print either or both of these variants. The default should presumably be to print the cleaned URL.

What exactly is garbage and what isn't still needs to be figured out though. For example, photo posts on Facebook typically include a type=3 parameter, which I believe determines the way the picture is displayed. I'm not sure if this should be stripped or not.

AttributeError when scraping facebook user

Similar to #35:

2019-05-01 22:25:38.717  INFO  snscrape.modules.facebook  Retrieving next page
2019-05-01 22:25:38.717  INFO  snscrape.base  Retrieving https://www.facebook.com/pages_reaction_units/more/?page_id=9094598058&cursor=%7B%22timeline_cursor%22%3A%22timeline_unit%3A1%3A00000000001319942759%3A04611686018427387904%3A09223372036854775704%3A04611686018427387904%22%2C%22timeline_section_cursor%22%3A%7B%22profile_id%22%3A9094598058%2C%22start%22%3A1325404800%2C%22end%22%3A1357027199%2C%22query_type%22%3A8%2C%22filter%22%3A1%7D%2C%22has_next_page%22%3Atrue%7D&surface=www_pages_home&unit_count=8&__a=1
2019-05-01 22:25:38.717  DEBUG  snscrape.base  ... with headers: {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36', 'Accept-Language': 'en-US,en;q=0.5'}
2019-05-01 22:25:39.275  DEBUG  urllib3.connectionpool  https://www.facebook.com:443 "GET /pages_reaction_units/more/?page_id=9094598058&cursor=%7B%22timeline_cursor%22%3A%22timeline_unit%3A1%3A00000000001319942759%3A04611686018427387904%3A09223372036854775704%3A04611686018427387904%22%2C%22timeline_section_cursor%22%3A%7B%22profile_id%22%3A9094598058%2C%22start%22%3A1325404800%2C%22end%22%3A1357027199%2C%22query_type%22%3A8%2C%22filter%22%3A1%7D%2C%22has_next_page%22%3Atrue%7D&surface=www_pages_home&unit_count=8&__a=1 HTTP/1.1" 200 None
2019-05-01 22:25:39.279  DEBUG  snscrape.base  https://www.facebook.com/pages_reaction_units/more/?page_id=9094598058&cursor=%7B%22timeline_cursor%22%3A%22timeline_unit%3A1%3A00000000001319942759%3A04611686018427387904%3A09223372036854775704%3A04611686018427387904%22%2C%22timeline_section_cursor%22%3A%7B%22profile_id%22%3A9094598058%2C%22start%22%3A1325404800%2C%22end%22%3A1357027199%2C%22query_type%22%3A8%2C%22filter%22%3A1%7D%2C%22has_next_page%22%3Atrue%7D&surface=www_pages_home&unit_count=8&__a=1 retrieved successfully
Traceback (most recent call last):
  File "/home/user/env/bin/snscrape", line 11, in <module>
    load_entry_point('snscrape==0.1.3', 'console_scripts', 'snscrape')()
  File "/home/user/env/lib/python3.7/site-packages/snscrape-0.1.3-py3.7.egg/snscrape/cli.py", line 83, in main
    for i, item in enumerate(scraper.get_items(), start = 1):
  File "/home/user/env/lib/python3.7/site-packages/snscrape-0.1.3-py3.7.egg/snscrape/modules/facebook.py", line 112, in get_items
    yield from self._soup_to_items(soup, baseUrl)
  File "/home/user/env/lib/python3.7/site-packages/snscrape-0.1.3-py3.7.egg/snscrape/modules/facebook.py", line 60, in _soup_to_items
    href = entryA.get('href')
AttributeError: 'NoneType' object has no attribute 'get'

The facebook user is 'leopoldolopezoficial'. This one also seems to be reproducible.

twitter: KeyError: 'content-type'

This happens (infrequently) when scraping a twitter-user; it looks spurious and not related to a specific user:

Traceback (most recent call last):
  File "/home/grab/sns-venv/bin/snscrape", line 11, in <module>
    load_entry_point('snscrape==0.1.3', 'console_scripts', 'snscrape')()
  File "/home/grab/sns-venv/lib/python3.7/site-packages/snscrape/cli.py", line 59, in main
    for i, item in enumerate(scraper.get_items(), start = 1):
  File "/home/grab/sns-venv/lib/python3.7/site-packages/snscrape/modules/twitter.py", line 63, in get_items
    responseOkCallback = self._check_json_callback)
  File "/home/grab/sns-venv/lib/python3.7/site-packages/snscrape/base.py", line 99, in _get
    return self._request('GET', *args, **kwargs)
  File "/home/grab/sns-venv/lib/python3.7/site-packages/snscrape/base.py", line 72, in _request
    success, msg = responseOkCallback(r)
  File "/home/grab/sns-venv/lib/python3.7/site-packages/snscrape/modules/twitter.py", line 29, in _check_json_callback
    if r.headers['content-type'] != 'application/json;charset=utf-8':
  File "/home/grab/sns-venv/lib/python3.7/site-packages/requests/structures.py", line 52, in __getitem__
    return self._store[key.lower()][1]
KeyError: 'content-type'

Add `--version' option

Support for Reddit

Should support users, subreddits, and maybe threads (i.e. discovering the post URLs for each post in a thread; maybe also only for those that aren't displayed on the normal thread page due to deep nesting or too many comments).

This would have to go through the Pushshift API since Reddit's own API is garbage.

Twitter: "RuntimeError: Unable to find min-position" for empty users

Example: https://twitter.com/Eric_Andrieu

Traceback (most recent call last):
  File ".../bin/snscrape", line 11, in <module>
    load_entry_point('snscrape', 'console_scripts', 'snscrape')()
  File ".../snscrape/cli.py", line 111, in main
    for i, item in enumerate(scraper.get_items(), start = 1):
  File ".../snscrape/modules/twitter.py", line 99, in get_items
    feed, maxPosition = self._get_feed_from_html(r.text, True)
  File ".../snscrape/modules/twitter.py", line 40, in _get_feed_from_html
    raise RuntimeError('Unable to find min-position')
RuntimeError: Unable to find min-position

Document the individual modules better

The modules are currently not documented very well, especially their limitations (e.g. Twitter not discovering retweets).

Support for excluding replies

It would be useful to be able to get a list of posts that aren't replies (to other people). Threads should still be included, though.

Return proper items from modules

The current modules simply yield URLItem instances, i.e. they ignore all the other information about the post. To make the scraper more useful, they should instead yield items which also contain the post text, username, date, and potentially other metadata.

The output format of the snscrape CLI should remain the URL list by default, but it should also be possible to customise that. The available fields would vary by scraper then. Envisioned usage example: snscrape twitter-search --format '{utcdate:%Y-%m-%d %H:%M:%SZ} {user} {text}' 'word' would print lines of the date in UTC and ISO-8601 format, the tweeter, and the text message for each tweet found when searching for "word". (The text message might need some sanitisation for this, otherwise linebreaks might make the output confusing.)

GUI

Any possible way for a GUI wrapper for the average noob who wants to scrape an account.

Support for vKontakte

Scraping Twitter results in infinite loop

Twitter changed something about the way their scrolling works today. snscrape runs into an infinite loop.

Add a test suite

Tests for the modules should cover both recorded traffic and the current service; the former for regression tests, the latter for breaking server-side changes. This might require creating test accounts whose contents don't change with time.

instagram: JSONDecodeError

In two separate occasions across a couple thousand Instagram users, I've come across some sort of JSON error:

Case 1:

Traceback (most recent call last):
  File "/home/user/bin/snscrape_process.py", line 53, in <module>
    main()
  File "/home/user/bin/snscrape_process.py", line 47, in main
    for post in posts:
  File "/home/user/bin/snscrape_process.py", line 38, in get_posts
    for item in scraper.get_items():
  File "/home/user/snscrape/snscrape/modules/instagram.py", line 61, in get_items
    response = json.loads(r.text)
  File "/home/user/.pyenv/versions/3.7.0/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/home/user/.pyenv/versions/3.7.0/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/user/.pyenv/versions/3.7.0/lib/python3.7/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 37597 (char 37596)

Case 2:

Traceback (most recent call last):
  File "/home/user/bin/snscrape_process.py", line 53, in <module>
    main()
  File "/home/user/bin/snscrape_process.py", line 47, in main
    for post in posts:
  File "/home/user/bin/snscrape_process.py", line 38, in get_posts
    for item in scraper.get_items():
  File "/home/user/snscrape/snscrape/modules/instagram.py", line 35, in get_items
    response = json.loads(jsonData)
  File "/home/user/.pyenv/versions/3.7.0/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/home/user/.pyenv/versions/3.7.0/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/user/.pyenv/versions/3.7.0/lib/python3.7/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 29228 (char 29227)

(In the tracebacks, snscrape_process.py is the custom script I'm running snscrape from, i.e., I'm not using the CLI directly.)

I'm not sure how reproducible these issues are. A quick test makes it seem like not very.

If it helps, I can share the particular accounts I was scraping from and/or the logs (I was logging at INFO level).

Document the API

Document how snscrape can be used from other Python modules

Support for Twitter user profile pages

As a partial workaround for #4, the Twitter module should fall back to scraping the user profile page if the search returns no results.

As a partial workaround for #8, specifically requesting retweets could also trigger this scrape (deduplicated with the search scrape).

Support for Twitter list members

Example: https://twitter.com/archivesnext/lists/archives-on-twitter/members

Facebook module mistakenly ignores some photo URLs

For example, scraping DanoveAOdvodoveTipyJozefaMihala after a while yields the warning:

2019-05-26 00:43:57.507  WARNING  snscrape.modules.facebook  Ignoring odd link: https://www.facebook.com/photo.php?fbid=10207652901634185&set=a.1131882710860&type=3&__xts__%5B0%5D=...

It should yield https://www.facebook.com/photo.php?fbid=10207652901634185 instead.

Twitter user scrapes do not include retweets

Because of how snscrape discovers tweets (through the search page rather than the user profile, because the latter is limited to 3200 results while the former is not), it can't discover retweets. Or at least I haven't found any way to do that. The search term suggestions I found online are all several years old and no longer work.

I believe this is unfixable, but as a workaround, snscrape could at least retrieve the retweets contained among the user's 3200 most recent tweets by scraping the user profile page if requested. See #5.

Support for Pleroma

@ealgase on #43:

Mastodon and Pleroma are both federated alternative social networks that would be useful to support (especially Pleroma as the website is loaded with Javascript so an archivebot won't grab a user). Ideally, users, hashtags, threads, and entire instance timelines would be supported.

Add a --fast option that uses string or regex extraction instead of HTML parsing

I haven't done any profiling, but it is likely that the parsing takes up a significant fraction of the total runtime. String/regex extraction is more sensitive to subtle changes in the markup but much more efficient than full parsing. A --fast option (or similar) that enables such processing instead of the full parser could be useful in particular for large scrapes.

(Although I've had this idea before, credit goes to /u/dmn002 for bringing it to my attention again.)

Support TOR proxy settings

Rate limiting on e.g. Instagram makes it hard to monitor accounts for new posts. If proxy settings could be passed to snscrape it would be easier to scrape more frequently.

Limit frame locals dumping to snscrape modules

9e65385 introduces dumping the locals in all frames of get_items and below. This should be restricted to snscrape modules to avoid potentially dumping huge amounts of data from third-party libraries that are hardly useful for debugging anyway.

Not able to install it on raspbian

Hi,

I compiled and installed python3.6 on raspbian (Debian stretch based distro) and try to install snscrape in via pip in a virtualenv:

$ virtualenv --python=python3.6 snscrape
$ cd snscrape
$ . /bin/activate
$ pip install snscrape

But I'm always getting some compiling error. (?)

I already did:

$ sudo apt-get install python3-libxml2 python3-lxml python-libxml2 python-lxml

Do we have to install the libxml2 and lxml dependencies via pip? I assume if that would be the case, they would automatically be installed with the snscrape package.

I apologize, I am still a noob.

AttributeError on extracting tweet date

Here's the error, from the tail end of a log file with increased verbosity:

2019-05-02 00:33:05.481  INFO  snscrape.modules.twitter  Retrieving scroll page TWEET-1103001670152269824-1123744738421747713
2019-05-02 00:33:05.482  INFO  snscrape.base  Retrieving https://twitter.com/i/search/timeline?f=tweets&vertical=default&lang=en&q=%23MaduroRegime&include_available_features=1&include_entities=1&reset_error_state=false&src=typd&qf=off&max_position=TWEET-1103001670152269824-1123744738421747713
2019-05-02 00:33:05.482  DEBUG  snscrape.base  ... with headers: {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.2584.142 Safari/537.36'}
2019-05-02 00:33:05.953  DEBUG  urllib3.connectionpool  https://twitter.com:443 "GET /i/search/timeline?f=tweets&vertical=default&lang=en&q=%23MaduroRegime&include_available_features=1&include_entities=1&reset_error_state=false&src=typd&qf=off&max_position=TWEET-1103001670152269824-1123744738421747713 HTTP/1.1" 200 19292
2019-05-02 00:33:05.956  DEBUG  snscrape.base  https://twitter.com/i/search/timeline?f=tweets&vertical=default&lang=en&q=%23MaduroRegime&include_available_features=1&include_entities=1&reset_error_state=false&src=typd&qf=off&max_position=TWEET-1103001670152269824-1123744738421747713 retrieved successfully
Traceback (most recent call last):
  File "/home/user/env/bin/snscrape", line 11, in <module>
    load_entry_point('snscrape==0.1.3', 'console_scripts', 'snscrape')()
  File "/home/user/env/lib/python3.7/site-packages/snscrape-0.1.3-py3.7.egg/snscrape/cli.py", line 83, in main
    for i, item in enumerate(scraper.get_items(), start = 1):
  File "/home/user/env/lib/python3.7/site-packages/snscrape-0.1.3-py3.7.egg/snscrape/modules/twitter.py", line 83, in get_items
    yield from self._feed_to_items(feed)
  File "/home/user/env/lib/python3.7/site-packages/snscrape-0.1.3-py3.7.egg/snscrape/modules/twitter.py", line 38, in _feed_to_items
    date = datetime.datetime.fromtimestamp(int(tweet.find('a', 'tweet-timestamp').find('span', '_timestamp')['data-time']), datetime.timezone.utc)
AttributeError: 'NoneType' object has no attribute 'find'

The hashtag being collected was 'MaduroRegime'. It also seems to be reproducible, at least on my end, and at least within a short time frame.

Add module to get a users Facebook userID

So we don't have to use findmyfbid.com or similar services anymore.

If I remember correctly the UserID can already be found in the profile HTML page.

Disable quality filter on Twitter

Twitter's so-called "quality filter" is not desired on snscrape; we want the full list of tweets instead.