bpb27 / twitter_scraping Goto Github PK

Grab all a user's tweets (and get past 3200 limit)

Python 100.00%

twitter_scraping's Introduction

Twitter Scraper

Twitter makes it hard to get all of a user's tweets (assuming they have more than 3200). This is a way to get around that using Python, Selenium, and Tweepy.

Essentially, we will use Selenium to open up a browser and automatically visit Twitter's search page, searching for a single user's tweets on a single day. If we want all tweets from 2015, we will check all 365 days / pages. This would be a nightmare to do manually, so the scrape.py script does it all for you - all you have to do is input a date range and a twitter user handle, and wait for it to finish.

The scrape.py script collects tweet ids. If you know a tweet's id number, you can get all the information available about that tweet using Tweepy - text, timestamp, number of retweets / replies / favorites, geolocation, etc. Tweepy uses Twitter's API, so you will need to get API keys. Once you have them, you can run the get_metadata.py script.

Requirements

basic knowledge on how to use a terminal
Safari 10+ with 'Allow Remote Automation' option enabled in Safari's Develop menu to control Safari via WebDriver.
python3
- to check, in your terminal, enter python3
- if you don't have it, check YouTube for installation instructions
pip or pip3
- to check, in your terminal, enter pip or pip3
- if you don't have it, again, check YouTube for installation instructions
selenium (3.0.1)
- pip3 install selenium
tweepy (3.5.0)
- pip3 install tweepy

Running the scraper

open up scrape.py and edit the user, start, and end variables (and save the file)
run python3 scrape.py
you'll see a browser pop up and output in the terminal
do some fun other task until it finishes
once it's done, it outputs all the tweet ids it found into all_ids.json
every time you run the scraper with different dates, it will add the new ids to the same file
- it automatically removes duplicates so don't worry about small date overlaps

Troubleshooting the scraper

do you get a no such file error? you need to cd to the directory of scrape.py
do you get a driver error when you try and run the script?
- open scrape.py and change the driver to use Chrome() or Firefox()
  - if neither work, google the error (you probably need to install a new driver)
does it seem like it's not collecting tweets for days that have tweets?
- open scrape.py and change the delay variable to 2 or 3

Getting the metadata

first you'll need to get twitter API keys
- sign up for a developer account here https://dev.twitter.com/
- get your keys here: https://apps.twitter.com/
put your keys into the sample_api_keys.json file
change the name of sample_api_keys.json to api_keys.json
open up get_metadata.py and edit the user variable (and save the file)
run python3 get_metadata.py
this will get metadata for every tweet id in all_ids.json
it will create 4 files
- username.json (master file with all metadata)
- username.zip (a zipped file of the master file with all metadata)
- username_short.json (smaller master file with relevant metadata fields)
- username.csv (csv version of the smaller master file)

twitter_scraping's People

Contributors

Stargazers

Watchers

Forkers

aboutaaron cheulkaravi doggeddev rmdes kiemma muranava gmakestyle jenn738 garftalk benjamesbabala oztalha jonmcalder youvegottabecrazy rush00121 infused alexandrama mspear agile-innovations kenankobicstudent nthreads pohyuquan abehmiel bytearchive schmidtscf kevinbsc nttrungmt shafiahmed prokopyev vdeleon ykydxt tezzutezzu arianamoran mattieuconn adl175 kpolimis gunnertech jogging520 meugarfo coffeecreature shah334 aapatel96 fongbb132 binaryflesh vasubits irrlphnt nanavatisneha salonibhogale deepak0004 jyotibukkapatil shaz13 ro9ueadmin hipstervizninja khouli00 gitpatrickharris 4sakura aquilesc luizcz dhruv16146 ying-ying-chen jensfinnas mt-digital anacaroll sycatkim christinazxy ceshine lapomeray rbazelais ultimape gouper harshagoonewardana dominicburkart bfeldman89 ticary cteodorski ofraklein urbankocmut grukz hugoewald emmaeleanoryork unforeseenocean huzhx drraider ergosteur vaibhav90 ahmed-atta shubhamgondane kevindumanoir vigaeatery guptadeepak18 cyearee kgarrett cascadeone rifal89 jmundi sspatel jerrygaolondon yanagiragi everton137 fiercelyred muddy91

twitter_scraping's Issues

Is there a reason this script does not iterate over dates in larger chunks?

Wouldn't this script be a lot faster if it loaded by chunks of weeks or months rather than everyday?

e.g.

Current behaviour:

2018-01-01 to 2018-01-02
2018-01-02 to 2018-01-03

Potentially better behaviour:

2018-01-01 to 2018-01-10
2018-01-10 to 2018-01-20

`statuses_lookup` sleep time

When scraping metadata with get_metadata.py, there's a 6 second sleep between each api.statuses_lookup(id_batch) call.

The Twitter API allows up to 300 calls per 15 minute window for application auth, or 900 for user auth (mentioned here). That means a call every 3 seconds for application auth or every second for user auth.

Is there a reason that 6 seconds is used rather than 3 or 1, since the example keys are for using the API through user authentication?

I understand that 150 requests per window is their standard 'if we aren't clear, use that' value, but I'm pretty sure we could speed up metadata scraping if I've understood their docs correctly.

Output locations

How do I change the location that scraping.py and get_metadata.py output the files to?

Issue in firefox selenium.

Please see the following error:

WebDriverException                        Traceback (most recent call last)
<ipython-input-10-8c8793c3187e> in <module>()
     14 # only edit these if you're having problems
     15 delay = 1  # time to wait on each page load before reading the page
---> 16 driver = webdriver.Firefox()  # options are Chrome() Firefox() Safari()
     17 
     18 

/home/himanshusainie97/anaconda2/lib/python2.7/site-packages/selenium/webdriver/firefox/webdriver.pyc in __init__(self, firefox_profile, firefox_binary, timeout, capabilities, proxy, executable_path, options, service_log_path, firefox_options, service_args, desired_capabilities, log_path, keep_alive)
    172                 command_executor=executor,
    173                 desired_capabilities=capabilities,
--> 174                 keep_alive=True)
    175 
    176         # Selenium remote

/home/himanshusainie97/anaconda2/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.pyc in __init__(self, command_executor, desired_capabilities, browser_profile, proxy, keep_alive, file_detector, options)
    155             warnings.warn("Please use FirefoxOptions to set browser profile",
    156                           DeprecationWarning, stacklevel=2)
--> 157         self.start_session(capabilities, browser_profile)
    158         self._switch_to = SwitchTo(self)
    159         self._mobile = Mobile(self)

/home/himanshusainie97/anaconda2/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.pyc in start_session(self, capabilities, browser_profile)
    250         parameters = {"capabilities": w3c_caps,
    251                       "desiredCapabilities": capabilities}
--> 252         response = self.execute(Command.NEW_SESSION, parameters)
    253         if 'sessionId' not in response:
    254             response = response['value']

/home/himanshusainie97/anaconda2/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.pyc in execute(self, driver_command, params)
    319         response = self.command_executor.execute(driver_command, params)
    320         if response:
--> 321             self.error_handler.check_response(response)
    322             response['value'] = self._unwrap_value(
    323                 response.get('value', None))

/home/himanshusainie97/anaconda2/lib/python2.7/site-packages/selenium/webdriver/remote/errorhandler.pyc in check_response(self, response)
    240                 alert_text = value['alert'].get('text')
    241             raise exception_class(message, screen, stacktrace, alert_text)
--> 242         raise exception_class(message, screen, stacktrace)
    243 
    244     def _value_or_default(self, obj, key, default):

WebDriverException: Message: newSession

How to get all_ids.json

Hi! Thank you so much for this - I really hope it works, I've tried, coded, and searched the internet for a solution and so far, yours seems to be the only one. However, I do not know how to get the all_ids.json file without actually knowing the IDs of the tweets I need (and for that, I'd first have to scrape the timeline in order to first obtain the IDs themselves, right?). I also want to download DJT's tweets. Do you have a suggestion how I can get the IDs, or could you maybe post your all_ids.json file (as you're also scraping DJT's posts)? That would be so great…

URL

Doest work in 2021, somebody know what is it? i think its url

It runs but doesn't collect any tweets.

I tried increasing the delay as well

Can't get the IDs

When I run the scrape.py, the final JSON created is blank without any ids in it. It was working last week. Does anyone know how to solve it?

Can't get all of them "LatencyInfo vector size 102 is too big."

I get an error while running scrape.py. I continuesly gives to same error i think it gives the error if the number of tweets is too much for a day. It get 58k ids while i trying to get 163k from a user. How can i solve the problem ?

[4524:1340:0520/180809.731:ERROR:latency_info.cc(164)] Display::DrawAndSwap, LatencyInfo vector size 102 is too big.

Meta Data problem

Ive followed all steps

but when I'll get the meta data there's a issue.

python3 get_metadata.py

Traceback (most recent call last):
File "get_metadata.py", line 25, in
with open('all_ids.json') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'all_ids.json'
[luan@labsalpha twitter_scraping-master]$

Character encoding for output files

Hello,

I'm a poor linguist trying his hand at Python, and this whole project has been a great resource to me. I'm currently trying to compile a Twitter corpus for my work, but when I run the script to get the metadata, I encounter this error:

Traceback (most recent call last): File "C:/Users/timho/Documents/Programming/Twitter-Corpus/twitter_scraping-master/get_metadata.py", line 93, in <module> f.writerow([x["favorite_count"], x["source"], x["text"], x["in_reply_to_screen_name"], x["is_retweet"], x["created_at"], x["retweet_count"], x["id_str"]]) File "C:\Users\XXX\TwitterCorpus\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f60e' in position 115: character maps to <undefined>

It's obviously related to character encoding; the character causing the problem here is the sunglasses emoji. However, I'm afraid I'm too new to Python to deal with this problem myself. Would greatly appreciate any help on how to fix this!

(Comment) A different way to get tweet text

If, like me, you need the tweet text, you don't even need to use tweepy. In the same way as the IDs are queried for, the text can be extracted from the tweet.

This goes at the beginning

# This is similar to the ID selector
text_selector = 'p.tweet-text'

This goes after ID logging

# Get the text
text = tweet.find_element_by_css_selector(text_selector).text

# Needed for inconsistent tweeters (like @realDonaldTrump in your code)
text = text.replace(u'–', '-')
text = text.replace(u'—', '-')
text = text.replace(u'―', '-')
text = text.replace(u'“', '"')
text = text.replace(u'”', '"')
text = text.replace(u'’', "'")
text = text.replace(u'‘', "'")
text = text.replace(u'…', '...')

# Remove hyperlinks (optional)
text = text.split("http")[0]

# Remove leading and trailing spaces
text = text.strip(' ')

# Remove colon that proceeded any hyperlink (optional)
if text.endswith(':'):
     text = text[:-1]

# Pad the string to the 140 character limit (useful for training recurrent neural networks)
text = text.ljust(140, ' ')

URL src parameter is malformed

Twitter has updated how they format the src parameter, they no longer use ...&src=typd.
The correct format is ...&src=typed_query.

License

Hi!

Would you add a license to this?

can't run the code

does anyone know what I'm doing wrong?

Thanks a lot

Can't write the file?

Hi there - huge thanks for this. For me it scrapes all the tweets and then says:

Traceback (most recent call last):
File "scrape.py", line 82, in
except FileNotFoundError:
NameError: name 'FileNotFoundError' is not defined

I have just installed Python for the first time so I'm guessing it could be something to do with my setup?

What is the difference between the total count and the scraped tweets count?

I was looking over your code. it looks great but I had a question about what you're actually doing:

all_ids = ids + json.load(f) data_to_write = list(set(all_ids)) print('tweets found on this scrape: ', len(ids)) print('total tweet count: ', len(data_to_write))

When I run the code, my output summary is below:

tweets found on this scrape: 1349
total tweet count: 6098

But then what does this mean? As the tweets are being scraped from the Twitter DOM, i can understand where the top number comes from. I assume there may be duplicate tweets (like if someone posted a tweet on day 1 and someone else posted a reply on day 2), but then where does the second line come from?

How does 1349 turn into 6098? Surely after removing duplicates the total tweet count is less than 1349?
i also copied the IDs from the json into excel and tried to remove the dupes (hoping it would reduce from 6098 down to 1349) but there were no duplicates :(

Scraping "restricted" accounts

Accounts like https://twitter.com/jackiec57 are unscrapible in search,
unless the user has logged into the browser, and turned off sensitivity.
Scrapy can't scrape it as it does not allow cookies, can Selenium?

Some ideas to reduce the repetitive need to click "show" on restricted accounts:

Login https://gist.github.com/momota10/969d904b4cad239da2a5c00df1ad87e7
Go to https://twitter.com/settings/safety
Check "show sensitive content" in
<input name="search-settings-nsfw" checked="" class="SearchSettings-toggleInput SearchSettings-toggleInput--sensitive" type="checkbox">
Click "submit" <button id="settings_save" class="EdgeButton EdgeButton--primary EdgeButton--medium" type="submit" disabled="">Save changes</button>

full_text

The code works fine except it doesn't show extended text. Is there a way around to get the full text.

get_metadata.py SyntaxError: invalid syntax

Hi all, i am very new here, my first experience
just trying this script all goes fine till creating "all_ids.json" file.
Now when I go to next step get_metadata.py this command return with error as following:

python3 Desktop/twitter_scraping-master/get_metadata.py
Traceback (most recent call last):
File "Desktop/twitter_scraping-master/get_metadata.py", line 1, in
import tweepy
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tweepy/init.py", line 17, in
from tweepy.streaming import Stream, StreamListener
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tweepy/streaming.py", line 358
def _start(self, async):
^
SyntaxError: invalid syntax

have no idea how to fix this error.
Please give me some clue what to do.
Thanks a lot

Can you give some pointers on how to achieve this?