bpb27 / twitter_scraping Goto Github PK
View Code? Open in Web Editor NEWGrab all a user's tweets (and get past 3200 limit)
Grab all a user's tweets (and get past 3200 limit)
When scraping metadata with get_metadata.py
, there's a 6 second sleep between each api.statuses_lookup(id_batch)
call.
The Twitter API allows up to 300 calls per 15 minute window for application auth, or 900 for user auth (mentioned here). That means a call every 3 seconds for application auth or every second for user auth.
Is there a reason that 6 seconds is used rather than 3 or 1, since the example keys are for using the API through user authentication?
I understand that 150 requests per window is their standard 'if we aren't clear, use that' value, but I'm pretty sure we could speed up metadata scraping if I've understood their docs correctly.
Hello,
I'm a poor linguist trying his hand at Python, and this whole project has been a great resource to me. I'm currently trying to compile a Twitter corpus for my work, but when I run the script to get the metadata, I encounter this error:
Traceback (most recent call last): File "C:/Users/timho/Documents/Programming/Twitter-Corpus/twitter_scraping-master/get_metadata.py", line 93, in <module> f.writerow([x["favorite_count"], x["source"], x["text"], x["in_reply_to_screen_name"], x["is_retweet"], x["created_at"], x["retweet_count"], x["id_str"]]) File "C:\Users\XXX\TwitterCorpus\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f60e' in position 115: character maps to <undefined>
It's obviously related to character encoding; the character causing the problem here is the sunglasses emoji. However, I'm afraid I'm too new to Python to deal with this problem myself. Would greatly appreciate any help on how to fix this!
Twitter has updated how they format the src
parameter, they no longer use ...&src=typd
.
The correct format is ...&src=typed_query
.
If, like me, you need the tweet text, you don't even need to use tweepy. In the same way as the IDs are queried for, the text can be extracted from the tweet.
This goes at the beginning
# This is similar to the ID selector
text_selector = 'p.tweet-text'
This goes after ID logging
# Get the text
text = tweet.find_element_by_css_selector(text_selector).text
# Needed for inconsistent tweeters (like @realDonaldTrump in your code)
text = text.replace(u'–', '-')
text = text.replace(u'—', '-')
text = text.replace(u'―', '-')
text = text.replace(u'“', '"')
text = text.replace(u'”', '"')
text = text.replace(u'’', "'")
text = text.replace(u'‘', "'")
text = text.replace(u'…', '...')
# Remove hyperlinks (optional)
text = text.split("http")[0]
# Remove leading and trailing spaces
text = text.strip(' ')
# Remove colon that proceeded any hyperlink (optional)
if text.endswith(':'):
text = text[:-1]
# Pad the string to the 140 character limit (useful for training recurrent neural networks)
text = text.ljust(140, ' ')
Doest work in 2021, somebody know what is it? i think its url
Please see the following error:
WebDriverException Traceback (most recent call last)
<ipython-input-10-8c8793c3187e> in <module>()
14 # only edit these if you're having problems
15 delay = 1 # time to wait on each page load before reading the page
---> 16 driver = webdriver.Firefox() # options are Chrome() Firefox() Safari()
17
18
/home/himanshusainie97/anaconda2/lib/python2.7/site-packages/selenium/webdriver/firefox/webdriver.pyc in __init__(self, firefox_profile, firefox_binary, timeout, capabilities, proxy, executable_path, options, service_log_path, firefox_options, service_args, desired_capabilities, log_path, keep_alive)
172 command_executor=executor,
173 desired_capabilities=capabilities,
--> 174 keep_alive=True)
175
176 # Selenium remote
/home/himanshusainie97/anaconda2/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.pyc in __init__(self, command_executor, desired_capabilities, browser_profile, proxy, keep_alive, file_detector, options)
155 warnings.warn("Please use FirefoxOptions to set browser profile",
156 DeprecationWarning, stacklevel=2)
--> 157 self.start_session(capabilities, browser_profile)
158 self._switch_to = SwitchTo(self)
159 self._mobile = Mobile(self)
/home/himanshusainie97/anaconda2/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.pyc in start_session(self, capabilities, browser_profile)
250 parameters = {"capabilities": w3c_caps,
251 "desiredCapabilities": capabilities}
--> 252 response = self.execute(Command.NEW_SESSION, parameters)
253 if 'sessionId' not in response:
254 response = response['value']
/home/himanshusainie97/anaconda2/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.pyc in execute(self, driver_command, params)
319 response = self.command_executor.execute(driver_command, params)
320 if response:
--> 321 self.error_handler.check_response(response)
322 response['value'] = self._unwrap_value(
323 response.get('value', None))
/home/himanshusainie97/anaconda2/lib/python2.7/site-packages/selenium/webdriver/remote/errorhandler.pyc in check_response(self, response)
240 alert_text = value['alert'].get('text')
241 raise exception_class(message, screen, stacktrace, alert_text)
--> 242 raise exception_class(message, screen, stacktrace)
243
244 def _value_or_default(self, obj, key, default):
WebDriverException: Message: newSession
Ive followed all steps
but when I'll get the meta data there's a issue.
python3 get_metadata.py
Traceback (most recent call last):
File "get_metadata.py", line 25, in
with open('all_ids.json') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'all_ids.json'
[luan@labsalpha twitter_scraping-master]$
Hey,
Trying to get most recent 1000 tweets from user and would like to go backwards, from most recent to the oldest, until either you git the end date or 1000 tweets.
Can you give some pointers on how to achieve this?
I get an error while running scrape.py. I continuesly gives to same error i think it gives the error if the number of tweets is too much for a day. It get 58k ids while i trying to get 163k from a user. How can i solve the problem ?
[4524:1340:0520/180809.731:ERROR:latency_info.cc(164)] Display::DrawAndSwap, LatencyInfo vector size 102 is too big.
Hi it seems that the search result does not return retweets although the request contains include:retweets
Any idea? has Twitter called off this kind of requests?
is it possible to use this script just replacing with another browser?
Hi! Thank you so much for this - I really hope it works, I've tried, coded, and searched the internet for a solution and so far, yours seems to be the only one. However, I do not know how to get the all_ids.json file without actually knowing the IDs of the tweets I need (and for that, I'd first have to scrape the timeline in order to first obtain the IDs themselves, right?). I also want to download DJT's tweets. Do you have a suggestion how I can get the IDs, or could you maybe post your all_ids.json file (as you're also scraping DJT's posts)? That would be so great…
I was looking over your code. it looks great but I had a question about what you're actually doing:
all_ids = ids + json.load(f) data_to_write = list(set(all_ids)) print('tweets found on this scrape: ', len(ids)) print('total tweet count: ', len(data_to_write))
When I run the code, my output summary is below:
tweets found on this scrape: 1349
total tweet count: 6098
But then what does this mean? As the tweets are being scraped from the Twitter DOM, i can understand where the top number comes from. I assume there may be duplicate tweets (like if someone posted a tweet on day 1 and someone else posted a reply on day 2), but then where does the second line come from?
How does 1349 turn into 6098? Surely after removing duplicates the total tweet count is less than 1349?
i also copied the IDs from the json into excel and tried to remove the dupes (hoping it would reduce from 6098 down to 1349) but there were no duplicates :(
Hi there - huge thanks for this. For me it scrapes all the tweets and then says:
Traceback (most recent call last):
File "scrape.py", line 82, in
except FileNotFoundError:
NameError: name 'FileNotFoundError' is not defined
I have just installed Python for the first time so I'm guessing it could be something to do with my setup?
Accounts like https://twitter.com/jackiec57 are unscrapible in search,
unless the user has logged into the browser, and turned off sensitivity.
Scrapy can't scrape it as it does not allow cookies, can Selenium?
Some ideas to reduce the repetitive need to click "show" on restricted accounts:
<input name="search-settings-nsfw" checked="" class="SearchSettings-toggleInput SearchSettings-toggleInput--sensitive" type="checkbox">
<button id="settings_save" class="EdgeButton EdgeButton--primary EdgeButton--medium" type="submit" disabled="">Save changes</button>
How do I change the location that scraping.py and get_metadata.py output the files to?
When I run the scrape.py, the final JSON created is blank without any ids in it. It was working last week. Does anyone know how to solve it?
Wouldn't this script be a lot faster if it loaded by chunks of weeks or months rather than everyday?
e.g.
Current behaviour:
Potentially better behaviour:
The code works fine except it doesn't show extended text. Is there a way around to get the full text.
Hi all, i am very new here, my first experience
just trying this script all goes fine till creating "all_ids.json" file.
Now when I go to next step get_metadata.py this command return with error as following:
python3 Desktop/twitter_scraping-master/get_metadata.py
Traceback (most recent call last):
File "Desktop/twitter_scraping-master/get_metadata.py", line 1, in
import tweepy
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tweepy/init.py", line 17, in
from tweepy.streaming import Stream, StreamListener
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tweepy/streaming.py", line 358
def _start(self, async):
^
SyntaxError: invalid syntax
have no idea how to fix this error.
Please give me some clue what to do.
Thanks a lot
Hi!
Would you add a license to this?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.