bpb27 / twitter_scraping Goto Github PK

Grab all a user's tweets (and get past 3200 limit)

Python 100.00%

twitter_scraping's Issues

`statuses_lookup` sleep time

When scraping metadata with get_metadata.py, there's a 6 second sleep between each api.statuses_lookup(id_batch) call.

The Twitter API allows up to 300 calls per 15 minute window for application auth, or 900 for user auth (mentioned here). That means a call every 3 seconds for application auth or every second for user auth.

Is there a reason that 6 seconds is used rather than 3 or 1, since the example keys are for using the API through user authentication?

I understand that 150 requests per window is their standard 'if we aren't clear, use that' value, but I'm pretty sure we could speed up metadata scraping if I've understood their docs correctly.

Character encoding for output files

Hello,

I'm a poor linguist trying his hand at Python, and this whole project has been a great resource to me. I'm currently trying to compile a Twitter corpus for my work, but when I run the script to get the metadata, I encounter this error:

Traceback (most recent call last): File "C:/Users/timho/Documents/Programming/Twitter-Corpus/twitter_scraping-master/get_metadata.py", line 93, in <module> f.writerow([x["favorite_count"], x["source"], x["text"], x["in_reply_to_screen_name"], x["is_retweet"], x["created_at"], x["retweet_count"], x["id_str"]]) File "C:\Users\XXX\TwitterCorpus\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f60e' in position 115: character maps to <undefined>

It's obviously related to character encoding; the character causing the problem here is the sunglasses emoji. However, I'm afraid I'm too new to Python to deal with this problem myself. Would greatly appreciate any help on how to fix this!

URL src parameter is malformed

Twitter has updated how they format the src parameter, they no longer use ...&src=typd.
The correct format is ...&src=typed_query.

(Comment) A different way to get tweet text

If, like me, you need the tweet text, you don't even need to use tweepy. In the same way as the IDs are queried for, the text can be extracted from the tweet.

This goes at the beginning

# This is similar to the ID selector
text_selector = 'p.tweet-text'

This goes after ID logging

# Get the text
text = tweet.find_element_by_css_selector(text_selector).text

# Needed for inconsistent tweeters (like @realDonaldTrump in your code)
text = text.replace(u'–', '-')
text = text.replace(u'—', '-')
text = text.replace(u'―', '-')
text = text.replace(u'“', '"')
text = text.replace(u'”', '"')
text = text.replace(u'’', "'")
text = text.replace(u'‘', "'")
text = text.replace(u'…', '...')

# Remove hyperlinks (optional)
text = text.split("http")[0]

# Remove leading and trailing spaces
text = text.strip(' ')

# Remove colon that proceeded any hyperlink (optional)
if text.endswith(':'):
     text = text[:-1]

# Pad the string to the 140 character limit (useful for training recurrent neural networks)
text = text.ljust(140, ' ')

URL

Doest work in 2021, somebody know what is it? i think its url

Issue in firefox selenium.

Please see the following error:

WebDriverException                        Traceback (most recent call last)
<ipython-input-10-8c8793c3187e> in <module>()
     14 # only edit these if you're having problems
     15 delay = 1  # time to wait on each page load before reading the page
---> 16 driver = webdriver.Firefox()  # options are Chrome() Firefox() Safari()
     17 
     18 

/home/himanshusainie97/anaconda2/lib/python2.7/site-packages/selenium/webdriver/firefox/webdriver.pyc in __init__(self, firefox_profile, firefox_binary, timeout, capabilities, proxy, executable_path, options, service_log_path, firefox_options, service_args, desired_capabilities, log_path, keep_alive)
    172                 command_executor=executor,
    173                 desired_capabilities=capabilities,
--> 174                 keep_alive=True)
    175 
    176         # Selenium remote

/home/himanshusainie97/anaconda2/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.pyc in __init__(self, command_executor, desired_capabilities, browser_profile, proxy, keep_alive, file_detector, options)
    155             warnings.warn("Please use FirefoxOptions to set browser profile",
    156                           DeprecationWarning, stacklevel=2)
--> 157         self.start_session(capabilities, browser_profile)
    158         self._switch_to = SwitchTo(self)
    159         self._mobile = Mobile(self)

/home/himanshusainie97/anaconda2/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.pyc in start_session(self, capabilities, browser_profile)
    250         parameters = {"capabilities": w3c_caps,
    251                       "desiredCapabilities": capabilities}
--> 252         response = self.execute(Command.NEW_SESSION, parameters)
    253         if 'sessionId' not in response:
    254             response = response['value']

/home/himanshusainie97/anaconda2/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.pyc in execute(self, driver_command, params)
    319         response = self.command_executor.execute(driver_command, params)
    320         if response:
--> 321             self.error_handler.check_response(response)
    322             response['value'] = self._unwrap_value(
    323                 response.get('value', None))

/home/himanshusainie97/anaconda2/lib/python2.7/site-packages/selenium/webdriver/remote/errorhandler.pyc in check_response(self, response)
    240                 alert_text = value['alert'].get('text')
    241             raise exception_class(message, screen, stacktrace, alert_text)
--> 242         raise exception_class(message, screen, stacktrace)
    243 
    244     def _value_or_default(self, obj, key, default):

WebDriverException: Message: newSession

Meta Data problem

Ive followed all steps

but when I'll get the meta data there's a issue.

python3 get_metadata.py

Traceback (most recent call last):
File "get_metadata.py", line 25, in
with open('all_ids.json') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'all_ids.json'
[luan@labsalpha twitter_scraping-master]$

crawl tweets backwards

Hey,

Trying to get most recent 1000 tweets from user and would like to go backwards, from most recent to the oldest, until either you git the end date or 1000 tweets.

Can you give some pointers on how to achieve this?

Can't get all of them "LatencyInfo vector size 102 is too big."

I get an error while running scrape.py. I continuesly gives to same error i think it gives the error if the number of tweets is too much for a day. It get 58k ids while i trying to get 163k from a user. How can i solve the problem ?

[4524:1340:0520/180809.731:ERROR:latency_info.cc(164)] Display::DrawAndSwap, LatencyInfo vector size 102 is too big.

include:retweets does not work

Hi it seems that the search result does not return retweets although the request contains include:retweets

Any idea? has Twitter called off this kind of requests?

Selenium browser other than safari?

is it possible to use this script just replacing with another browser?

It runs but doesn't collect any tweets.

I tried increasing the delay as well

How to get all_ids.json

Hi! Thank you so much for this - I really hope it works, I've tried, coded, and searched the internet for a solution and so far, yours seems to be the only one. However, I do not know how to get the all_ids.json file without actually knowing the IDs of the tweets I need (and for that, I'd first have to scrape the timeline in order to first obtain the IDs themselves, right?). I also want to download DJT's tweets. Do you have a suggestion how I can get the IDs, or could you maybe post your all_ids.json file (as you're also scraping DJT's posts)? That would be so great…

What is the difference between the total count and the scraped tweets count?

I was looking over your code. it looks great but I had a question about what you're actually doing:

all_ids = ids + json.load(f) data_to_write = list(set(all_ids)) print('tweets found on this scrape: ', len(ids)) print('total tweet count: ', len(data_to_write))

When I run the code, my output summary is below:

tweets found on this scrape: 1349
total tweet count: 6098

But then what does this mean? As the tweets are being scraped from the Twitter DOM, i can understand where the top number comes from. I assume there may be duplicate tweets (like if someone posted a tweet on day 1 and someone else posted a reply on day 2), but then where does the second line come from?

How does 1349 turn into 6098? Surely after removing duplicates the total tweet count is less than 1349?
i also copied the IDs from the json into excel and tried to remove the dupes (hoping it would reduce from 6098 down to 1349) but there were no duplicates :(

Can't write the file?

Hi there - huge thanks for this. For me it scrapes all the tweets and then says:

Traceback (most recent call last):
File "scrape.py", line 82, in
except FileNotFoundError:
NameError: name 'FileNotFoundError' is not defined

I have just installed Python for the first time so I'm guessing it could be something to do with my setup?

Scraping "restricted" accounts

Accounts like https://twitter.com/jackiec57 are unscrapible in search,
unless the user has logged into the browser, and turned off sensitivity.
Scrapy can't scrape it as it does not allow cookies, can Selenium?

Some ideas to reduce the repetitive need to click "show" on restricted accounts:

Login https://gist.github.com/momota10/969d904b4cad239da2a5c00df1ad87e7
Go to https://twitter.com/settings/safety
Check "show sensitive content" in
<input name="search-settings-nsfw" checked="" class="SearchSettings-toggleInput SearchSettings-toggleInput--sensitive" type="checkbox">
Click "submit" <button id="settings_save" class="EdgeButton EdgeButton--primary EdgeButton--medium" type="submit" disabled="">Save changes</button>

Output locations

How do I change the location that scraping.py and get_metadata.py output the files to?

can't run the code

does anyone know what I'm doing wrong?

Thanks a lot

Can't get the IDs

When I run the scrape.py, the final JSON created is blank without any ids in it. It was working last week. Does anyone know how to solve it?

Is there a reason this script does not iterate over dates in larger chunks?

Wouldn't this script be a lot faster if it loaded by chunks of weeks or months rather than everyday?

e.g.

Current behaviour:

2018-01-01 to 2018-01-02
2018-01-02 to 2018-01-03

Potentially better behaviour:

2018-01-01 to 2018-01-10
2018-01-10 to 2018-01-20

full_text

The code works fine except it doesn't show extended text. Is there a way around to get the full text.

get_metadata.py SyntaxError: invalid syntax

Hi all, i am very new here, my first experience
just trying this script all goes fine till creating "all_ids.json" file.
Now when I go to next step get_metadata.py this command return with error as following:

python3 Desktop/twitter_scraping-master/get_metadata.py
Traceback (most recent call last):
File "Desktop/twitter_scraping-master/get_metadata.py", line 1, in
import tweepy
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tweepy/init.py", line 17, in
from tweepy.streaming import Stream, StreamListener
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tweepy/streaming.py", line 358
def _start(self, async):
^
SyntaxError: invalid syntax

have no idea how to fix this error.
Please give me some clue what to do.
Thanks a lot

License

Hi!

Would you add a license to this?

bpb27 / twitter_scraping Goto Github PK

twitter_scraping's Issues

Recommend Projects

Recommend Topics

Recommend Org