Git Product home page Git Product logo

twitter_scraping's Introduction

Twitter Scraper

Twitter makes it hard to get all of a user's tweets (assuming they have more than 3200). This is a way to get around that using Python, Selenium, and Tweepy.

Essentially, we will use Selenium to open up a browser and automatically visit Twitter's search page, searching for a single user's tweets on a single day. If we want all tweets from 2015, we will check all 365 days / pages. This would be a nightmare to do manually, so the scrape.py script does it all for you - all you have to do is input a date range and a twitter user handle, and wait for it to finish.

The scrape.py script collects tweet ids. If you know a tweet's id number, you can get all the information available about that tweet using Tweepy - text, timestamp, number of retweets / replies / favorites, geolocation, etc. Tweepy uses Twitter's API, so you will need to get API keys. Once you have them, you can run the get_metadata.py script.

Requirements

  • basic knowledge on how to use a terminal
  • Safari 10+ with 'Allow Remote Automation' option enabled in Safari's Develop menu to control Safari via WebDriver.
  • python3
    • to check, in your terminal, enter python3
    • if you don't have it, check YouTube for installation instructions
  • pip or pip3
    • to check, in your terminal, enter pip or pip3
    • if you don't have it, again, check YouTube for installation instructions
  • selenium (3.0.1)
    • pip3 install selenium
  • tweepy (3.5.0)
    • pip3 install tweepy

Running the scraper

  • open up scrape.py and edit the user, start, and end variables (and save the file)
  • run python3 scrape.py
  • you'll see a browser pop up and output in the terminal
  • do some fun other task until it finishes
  • once it's done, it outputs all the tweet ids it found into all_ids.json
  • every time you run the scraper with different dates, it will add the new ids to the same file
    • it automatically removes duplicates so don't worry about small date overlaps

Troubleshooting the scraper

  • do you get a no such file error? you need to cd to the directory of scrape.py
  • do you get a driver error when you try and run the script?
    • open scrape.py and change the driver to use Chrome() or Firefox()
      • if neither work, google the error (you probably need to install a new driver)
  • does it seem like it's not collecting tweets for days that have tweets?
    • open scrape.py and change the delay variable to 2 or 3

Getting the metadata

  • first you'll need to get twitter API keys
  • put your keys into the sample_api_keys.json file
  • change the name of sample_api_keys.json to api_keys.json
  • open up get_metadata.py and edit the user variable (and save the file)
  • run python3 get_metadata.py
  • this will get metadata for every tweet id in all_ids.json
  • it will create 4 files
    • username.json (master file with all metadata)
    • username.zip (a zipped file of the master file with all metadata)
    • username_short.json (smaller master file with relevant metadata fields)
    • username.csv (csv version of the smaller master file)

twitter_scraping's People

Contributors

aboutaaron avatar bpb27 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

twitter_scraping's Issues

`statuses_lookup` sleep time

When scraping metadata with get_metadata.py, there's a 6 second sleep between each api.statuses_lookup(id_batch) call.

The Twitter API allows up to 300 calls per 15 minute window for application auth, or 900 for user auth (mentioned here). That means a call every 3 seconds for application auth or every second for user auth.

Is there a reason that 6 seconds is used rather than 3 or 1, since the example keys are for using the API through user authentication?

I understand that 150 requests per window is their standard 'if we aren't clear, use that' value, but I'm pretty sure we could speed up metadata scraping if I've understood their docs correctly.

Output locations

How do I change the location that scraping.py and get_metadata.py output the files to?

Issue in firefox selenium.

Please see the following error:

WebDriverException                        Traceback (most recent call last)
<ipython-input-10-8c8793c3187e> in <module>()
     14 # only edit these if you're having problems
     15 delay = 1  # time to wait on each page load before reading the page
---> 16 driver = webdriver.Firefox()  # options are Chrome() Firefox() Safari()
     17 
     18 

/home/himanshusainie97/anaconda2/lib/python2.7/site-packages/selenium/webdriver/firefox/webdriver.pyc in __init__(self, firefox_profile, firefox_binary, timeout, capabilities, proxy, executable_path, options, service_log_path, firefox_options, service_args, desired_capabilities, log_path, keep_alive)
    172                 command_executor=executor,
    173                 desired_capabilities=capabilities,
--> 174                 keep_alive=True)
    175 
    176         # Selenium remote

/home/himanshusainie97/anaconda2/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.pyc in __init__(self, command_executor, desired_capabilities, browser_profile, proxy, keep_alive, file_detector, options)
    155             warnings.warn("Please use FirefoxOptions to set browser profile",
    156                           DeprecationWarning, stacklevel=2)
--> 157         self.start_session(capabilities, browser_profile)
    158         self._switch_to = SwitchTo(self)
    159         self._mobile = Mobile(self)

/home/himanshusainie97/anaconda2/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.pyc in start_session(self, capabilities, browser_profile)
    250         parameters = {"capabilities": w3c_caps,
    251                       "desiredCapabilities": capabilities}
--> 252         response = self.execute(Command.NEW_SESSION, parameters)
    253         if 'sessionId' not in response:
    254             response = response['value']

/home/himanshusainie97/anaconda2/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.pyc in execute(self, driver_command, params)
    319         response = self.command_executor.execute(driver_command, params)
    320         if response:
--> 321             self.error_handler.check_response(response)
    322             response['value'] = self._unwrap_value(
    323                 response.get('value', None))

/home/himanshusainie97/anaconda2/lib/python2.7/site-packages/selenium/webdriver/remote/errorhandler.pyc in check_response(self, response)
    240                 alert_text = value['alert'].get('text')
    241             raise exception_class(message, screen, stacktrace, alert_text)
--> 242         raise exception_class(message, screen, stacktrace)
    243 
    244     def _value_or_default(self, obj, key, default):

WebDriverException: Message: newSession

How to get all_ids.json

Hi! Thank you so much for this - I really hope it works, I've tried, coded, and searched the internet for a solution and so far, yours seems to be the only one. However, I do not know how to get the all_ids.json file without actually knowing the IDs of the tweets I need (and for that, I'd first have to scrape the timeline in order to first obtain the IDs themselves, right?). I also want to download DJT's tweets. Do you have a suggestion how I can get the IDs, or could you maybe post your all_ids.json file (as you're also scraping DJT's posts)? That would be so great…

URL

Doest work in 2021, somebody know what is it? i think its url

Can't get the IDs

When I run the scrape.py, the final JSON created is blank without any ids in it. It was working last week. Does anyone know how to solve it?

Can't get all of them "LatencyInfo vector size 102 is too big."

I get an error while running scrape.py. I continuesly gives to same error i think it gives the error if the number of tweets is too much for a day. It get 58k ids while i trying to get 163k from a user. How can i solve the problem ?

[4524:1340:0520/180809.731:ERROR:latency_info.cc(164)] Display::DrawAndSwap, LatencyInfo vector size 102 is too big.

Meta Data problem

Ive followed all steps

but when I'll get the meta data there's a issue.

python3 get_metadata.py

Traceback (most recent call last):
File "get_metadata.py", line 25, in
with open('all_ids.json') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'all_ids.json'
[luan@labsalpha twitter_scraping-master]$

Character encoding for output files

Hello,

I'm a poor linguist trying his hand at Python, and this whole project has been a great resource to me. I'm currently trying to compile a Twitter corpus for my work, but when I run the script to get the metadata, I encounter this error:

Traceback (most recent call last): File "C:/Users/timho/Documents/Programming/Twitter-Corpus/twitter_scraping-master/get_metadata.py", line 93, in <module> f.writerow([x["favorite_count"], x["source"], x["text"], x["in_reply_to_screen_name"], x["is_retweet"], x["created_at"], x["retweet_count"], x["id_str"]]) File "C:\Users\XXX\TwitterCorpus\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f60e' in position 115: character maps to <undefined>

It's obviously related to character encoding; the character causing the problem here is the sunglasses emoji. However, I'm afraid I'm too new to Python to deal with this problem myself. Would greatly appreciate any help on how to fix this!

(Comment) A different way to get tweet text

If, like me, you need the tweet text, you don't even need to use tweepy. In the same way as the IDs are queried for, the text can be extracted from the tweet.

This goes at the beginning

# This is similar to the ID selector
text_selector = 'p.tweet-text'

This goes after ID logging

# Get the text
text = tweet.find_element_by_css_selector(text_selector).text

# Needed for inconsistent tweeters (like @realDonaldTrump in your code)
text = text.replace(u'–', '-')
text = text.replace(u'—', '-')
text = text.replace(u'―', '-')
text = text.replace(u'“', '"')
text = text.replace(u'”', '"')
text = text.replace(u'’', "'")
text = text.replace(u'‘', "'")
text = text.replace(u'…', '...')

# Remove hyperlinks (optional)
text = text.split("http")[0]

# Remove leading and trailing spaces
text = text.strip(' ')

# Remove colon that proceeded any hyperlink (optional)
if text.endswith(':'):
     text = text[:-1]

# Pad the string to the 140 character limit (useful for training recurrent neural networks)
text = text.ljust(140, ' ')

URL src parameter is malformed

Twitter has updated how they format the src parameter, they no longer use ...&src=typd.
The correct format is ...&src=typed_query.

License

Hi!

Would you add a license to this?

Can't write the file?

Hi there - huge thanks for this. For me it scrapes all the tweets and then says:

Traceback (most recent call last):
File "scrape.py", line 82, in
except FileNotFoundError:
NameError: name 'FileNotFoundError' is not defined

I have just installed Python for the first time so I'm guessing it could be something to do with my setup?

What is the difference between the total count and the scraped tweets count?

I was looking over your code. it looks great but I had a question about what you're actually doing:

all_ids = ids + json.load(f) data_to_write = list(set(all_ids)) print('tweets found on this scrape: ', len(ids)) print('total tweet count: ', len(data_to_write))

When I run the code, my output summary is below:

tweets found on this scrape: 1349
total tweet count: 6098

But then what does this mean? As the tweets are being scraped from the Twitter DOM, i can understand where the top number comes from. I assume there may be duplicate tweets (like if someone posted a tweet on day 1 and someone else posted a reply on day 2), but then where does the second line come from?

How does 1349 turn into 6098? Surely after removing duplicates the total tweet count is less than 1349?
i also copied the IDs from the json into excel and tried to remove the dupes (hoping it would reduce from 6098 down to 1349) but there were no duplicates :(

Scraping "restricted" accounts

Accounts like https://twitter.com/jackiec57 are unscrapible in search,
unless the user has logged into the browser, and turned off sensitivity.
Scrapy can't scrape it as it does not allow cookies, can Selenium?

Some ideas to reduce the repetitive need to click "show" on restricted accounts:

  1. Login https://gist.github.com/momota10/969d904b4cad239da2a5c00df1ad87e7
  2. Go to https://twitter.com/settings/safety
  3. Check "show sensitive content" in
    <input name="search-settings-nsfw" checked="" class="SearchSettings-toggleInput SearchSettings-toggleInput--sensitive" type="checkbox">
  4. Click "submit" <button id="settings_save" class="EdgeButton EdgeButton--primary EdgeButton--medium" type="submit" disabled="">Save changes</button>

full_text

The code works fine except it doesn't show extended text. Is there a way around to get the full text.

get_metadata.py SyntaxError: invalid syntax

Hi all, i am very new here, my first experience
just trying this script all goes fine till creating "all_ids.json" file.
Now when I go to next step get_metadata.py this command return with error as following:

python3 Desktop/twitter_scraping-master/get_metadata.py
Traceback (most recent call last):
File "Desktop/twitter_scraping-master/get_metadata.py", line 1, in
import tweepy
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tweepy/init.py", line 17, in
from tweepy.streaming import Stream, StreamListener
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tweepy/streaming.py", line 358
def _start(self, async):
^
SyntaxError: invalid syntax

have no idea how to fix this error.
Please give me some clue what to do.
Thanks a lot

include:retweets does not work

Hi it seems that the search result does not return retweets although the request contains include:retweets

Any idea? has Twitter called off this kind of requests?

crawl tweets backwards

Hey,

Trying to get most recent 1000 tweets from user and would like to go backwards, from most recent to the oldest, until either you git the end date or 1000 tweets.

Can you give some pointers on how to achieve this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.