altimis / scweet Goto Github PK

View Code? Open in Web Editor NEW

1.0K 18.0 222.0 28.21 MB

A simple and unlimited twitter scraper : scrape tweets, likes, retweets, following, followers, user info, images...

License: MIT License

Python 75.44% Jupyter Notebook 24.56%

selenium-webdriver scraper scraping twitter tweets python following followers twitter-scraper scrape

scweet's Introduction

Hi there 👋 I'm Yassine, a data scientist with two-years professional experience.

📫 Reach me on :

scweet's People

Contributors

Stargazers

Watchers

Forkers

sbabty perara-libs jagoosw davidandreoletti giulionf matissandersons varunp2k heleenee themulti0 guilherme-sal cjmling roy601912008 prakharm112 daalen03 topliftarm rudera pydxflwb insight-group fmarchenko ytsmarschall nxlevel petroimanol burneraccount-hash yutao-data rfoo lassehhansen puck3000 ahazeemi hugmatj gelfand youshen7 youshen07 prem-mishra007 michalwitczak unnati914 victoriaasefon steveshep chenhuizhu-0930 insistyxj fitzlm alduayji duongnosu qlub53 pascalthesing esskay0000 luukbuijsman cihan-github elecm arattha bobycv06fpm merce78 julieva1 oscarmh idrisibrahimerten youarelate ronming1303 guilhermeterriaga rajivnagesh chang111 pokyisme rodenluo moumentahar 9cat ssrbloginsoft rangersmyth74 mhunzalaawan gabrielmelo96 0xdigiscore akkaponw abdulmk787 vitalyford jorgeih xukunpeng24 imperialite gaia-x-alter bayes-nft-research kp-forks sam-code-sam hnliao justinphan3110 axl2468 rosscg akihiro-inui yuxy411 stjordanis cahn1 tenrensang syauqima jdyuankai myownbhaskar chuyuqiao stevenchu1125 vpinchuk15 dlt-code javierespinmegias stevensfu maple68 zhanghanke xlight mouashmawy

scweet's Issues

Search or Get Following/Followers by User ID

This is one of the features https://github.com/rishi-raj-jain/twitterUsernamefromUserID provides.
I would hope that @rishi-raj-jain can help this library out by integrating the username-to-id feature into this library.

Utility:

some datasets provide a list of users in ID format instead of handles, to prevent heavy name and avatar changes.
inversely, Scweet currently only scrape usernames by handles, and they are inept in saving data as ID format for stability

Scweet doesn't work now (but I used it weeks ago)

Hi,

When I try to use Scweet to scrap today, it doesn't work - it always works before, at least weeks ago it worked. I also tried the example jupyter notebook, which also doesn't work. No matter what query I used, the reply is simply scroll 0 scroll 1 and then move on to the next date. And nothing is stored into the data object.

I'm wondering if Scweet now becomes somehow masked by Twitter or if there is any idea about what happens.

Thanks!

Some questons

Thank you too much for this repository！I have spent nearly two weeks to research on how to crawl tweets with reply, but all repository like TWINT didn't work.
Do you know TWINT? I'm a developer from China. After using the proxy, TWINT still keeps reporting errors.

WARNING:root:Error retrieving https://twitter.com/: ReadTimeout(ReadTimeoutError("HTTPSConnectionPool(host='twitter.com', port=443): Read timed out. (read timeout=10)",),), retrying

I saw you said that Twitter has blocked all crawlers during this period of time. Is twitter unable to use it for this reason, or is it because I am in China and set up a proxy incorrectly?

What should I do if I just want to catch the information according to 'keyword' without 'from_account'

I am looking for some hot spot content on Twitter, and I want to analyze how one hot spot grows. But it seems that I can only search information within the 'from_account' I give.

I really want a function to do global search without the limitation of 'from_account'. Could you help me?

Request add new feature

Change the data features

UserName -> UserScreenName
Handle -> UserName

The below may be useful for some users. Add a feature that can retrieve from tweets

tweet-url

User information

date user joined
user no. of following
user no. of followers
user is verified?

I find a bug

When you search followers or following you should use set to store the result, not the list, or you will get the wrong result.
Because you repeat some information.

Just change the follows_elem = [] to follows_elem = set(), maybe you can find other solutions.

The examples need to be updated...

because the @RealDonaldTrump account was permanently suspended by twitter.

Argument to exlude embeded/reply text

Scweet/Scweet/utils.py

Line 53 in 87138a4

text = comment + responding

Currently, this line concatinates the text with the embeded text in the tweet. This happens for replys, wich could be usefull, but also for all embeded text, from linked websites/videos/etc

We are using this tool to scrape swedish data, and I found it usefull to change this line to not inlcude the embeded texts, since they are often in english.
I would suggest a argument to scweet to enable/disable the reply/embeded text
Or maybe add the reply text in a separate column in the CSV file? for easier processing

Thank you for this tool!

User profile tweets not captured

Hello,

Thank you for the tool. I am a research student working with Twitter data. For my research, I am trying to scrape data from user profiles. Unfortunately due to Scweet pulling data from the search tab instead of the actual user profile, I am only able to capture 1/10th of the tweets. Is there some way to make it scrape data from the user profile itself?

Thank you in advance!

Documentation regarding Twint and "dead libraries"

Take a look at https://github.com/5hirish/tweet_scrapper#features and see if Scweet covers the information.
Also As a side note, do some comparison with https://github.com/twintproject/twint about missing features.

Followings and followers.

Hi, I've been using twint to fetch followers, followers, favorites. But it's not working there. Are you also working on it?

NoSuchElementException on get_users_following()

I'm trying to run the Example.ipynb file and I'm encountering this error once I try to run the get_users_following code cell.

I've adjusted my .env file to reflect an active Twitter account already. Any ideas on what I should debug? I'm running this on Python 3.9 on a Jupyter notebook.

--hashtag is missing from the command line argument

It seems to me that there is a small omission, while --hashtag is used as a command line argument in the README file, it is not parsed in the file scweet.py :
parser.add_argument('--hashtag', type=str, help='Hashtag', default=None) hashtag = args.hashtag
should be added and used in the call of scrap()

didn't get all of the tweets

Thank you so much for this repository！But I met up with a problem.
I try to run scrap(start_date="2020-01-11", max_date="2020-01-16", words = "tesla",interval=1, headless=True, display_type="Top", hashtag=None)
you know, there are thounds of tweets about tesla each day, but I got only 360+ tweets from "2020-01-11" to "2020-01-16".
Could u help me please? Thanks a lot!

"from_account=" function returning tweets posted by other accounts

Thanks for developing this very useful scraper!
I used the from_account function but were still receiving tweets posted by other accounts that are replying to them. How to fix this?
Also, I am trying to scrape the tweets posted by specific accounts, not their responses to others. Is there a way to filter them out?

Thank you for your help!

Already split above

Scweet/Scweet/scweet.py

Lines 39 to 40 in d540228

 words = words.split("//") 

 path = save_dir + "/" + words.split("//")[0] + '_' + str(init_date).split(' ')[0] + '_' + \

Join Date and Birthday Mixed up

Hi,

if the username has a birthday presented on the profile, this is scraped instead of the join date.

Regards,

Join date not picked up if location contains an emoji sometimes

Unsure about this, but a few accounts some with emojis in the location result in the join date not being picked up.

@akduluneda
@x_born_to_die_x
@av_ahmet
@pabu232

IndexError: list index out of range

Hi, I am trying to run the example script:
from scweet import scrap
from user import get_user_information, get_following, get_followers

following, followers, join_date, location, website, description = get_user_information("realDonaldTrump")

print('following:'+following+'\nfollowers: '+followers+'\njoin_date: '+join_date+ '\nlocation: ' +location + '\nwebsite: '+website + '\ndescription: '+description)

And am facing the following issue:

Traceback (most recent call last):
File "user_data.py", line 4, in
following, followers, join_date, location, website, description = get_user_information("realDonaldTrump")
File "D:\Twitter\Scweet\Scweet\user.py", line 8, in get_user_information
driver = utils.init_driver(headless=headless)
File "D:\Twitter\Scweet\Scweet\utils.py", line 187, in init_driver
chromedriver_path = chromedriver_autoinstaller.install()
File "D:\Twitter\Scweet\scweet-venv\lib\site-packages\chromedriver_autoinstaller_init_.py", line 15, in install
chromedriver_filepath = utils.download_chromedriver(cwd)
File "D:\Twitter\Scweet\scweet-venv\lib\site-packages\chromedriver_autoinstaller\utils.py", line 166, in download_chromedriver
chrome_version = get_chrome_version()
File "D:\Twitter\Scweet\scweet-venv\lib\site-packages\chromedriver_autoinstaller\utils.py", line 117, in get_chrome_version
version = process.communicate()[0].decode('UTF-8').strip().split()[-1]
IndexError: list index out of range

Please help!! Thank you.

Username case sensitive

Is it possible to remove case sensitivity in the user search ? I tried using .casefold() in user.py but didn't get very far.

For example, for the username 'user'. this won't work if the twitter handle is @user. I have a large database of usernames but it is all in lowercase.

Thank you for your time.

Any reason why not use &f=live for latest tweets?

I see you visit the url first then click to latest tweet button based on the display_type.

You can simply append &f=live into the url for latest tweet right?

The best to way to handle twitter's search error?

data = scrap(words = 'fosun pharma', start_date="2020-01-08", max_date="2020-12-31",interval=1,
      headless=False, proxy="socks5://127.0.0.1:7891")

I'm using this code to query tweets about fosun pharma during the year 2020.

The operation works amazingly but when things get to some threshold, like after a consistent query of 120 days, Twitter will refuse to offer the result and shows a search error.

I split my query into a group of a period of 4 months to make things down. I'm wondering is there any better choice to handle such a situation, recovering from twitter's query error.

Add the instructions of proxy Settings

I hope the Scweet can add the instructions of proxy Settings, so that I can use the Scweet in China. Thank you!

Some questions

WARNING:root:Error retrieving https://twitter.com/: ReadTimeout(ReadTimeoutError("HTTPSConnectionPool(host='twitter.com', port=443): Read timed out. (read timeout=10)",),), retrying

I saw you said that Twitter has blocked all crawlers during this period of time. Is twitter unable to use it for this reason, or is it because I am in China and set up a proxy incorrectly?

Not collecting last batch of tweets

Hi Altimis, been using your tool and working well. Although just noticed 2 issues:

Issue 1. The last tweet collected was the reply to the chain of tweets from the user account I am scraping rather than the chain of tweets? some images for clarity:

Twitter user with last tweet:

Output from Scraping:

Actual tweet chain I was looking for:

Issue 2: I also noticed there was a large chunk of tweets not scraped during mid November to beginning of December? Not sure why as there were tweets available.

Here is the code input:

Any thoughts?

Twitter search not work via scweet with selenium

My Code:

from scweet import scrap
from user import get_user_information, get_users_following, get_users_followers

data = scrap(words = 'world', start_date="2020-02-07", max_date="2020-02-09",interval=1, limit=10,
      headless=False, proxy="socks5://127.0.0.1:7891")

Run the code in non-headless mode, I can see that there is an error got when searching:

But if I run the query from https://twitter.com/search manually, everything works fine:

display_type depend on lang

Related to #35

if you set lang to other than "en", driver.find_element_by_link_text(display_type).click() won't work right? because the test displayed for Latest and Top will be different

Drivers Issue in Ubuntu

How can a newbie run it using ubuntu as it shows the following error there (Complete trackback)

`Traceback (most recent call last):
File "/home/sami/.local/lib/python3.6/site-packages/selenium/webdriver/common/service.py", line 76, in start
stdin=PIPE)
File "/usr/lib/python3.6/subprocess.py", line 729, in init
restore_signals, start_new_session)
File "/usr/lib/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: './drivers/chromedriver.exe': './drivers/chromedriver.exe'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "twitter_Scraper.py", line 314, in
data=scrap(words, to_account, from_account, start_date,max_date,interval,navig,lang, headless, limit,display_type,resume)
File "twitter_Scraper.py", line 162, in scrap
driver=init_driver(navig, headless)
File "twitter_Scraper.py", line 127, in init_driver
driver = webdriver.Chrome(options=options,executable_path=browser_path)
File "/home/sami/.local/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 73, in init
self.service.start()
File "/home/sami/.local/lib/python3.6/site-packages/selenium/webdriver/common/service.py", line 83, in start
os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'chromedriver.exe' executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home

Thanks for you work!

I want to know what's the meaning of to account and from account.
And how can i get a tweeter user's tweet including text and the link of picture and video.
Thanks again!

Packaging Scweet as a Conda or Pip library

Since Twint already has that covered, why not allow Scweet to be accessible as well?
Take a look at https://pypi.org/
Observing https://github.com/Altimis/Scweet/blob/master/setup.py exists, the documentation could be changed

Utils are unusable

Since there is no package install function, I tried to put the whole folder in with the main directory of the ipynb.
following = get_users_following(users=temp, verbose=0, headless = True, wait=1)

ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-9-xxxxxxxxxxxx> in <module>
----> 1 from Scweet.scweet import scrap

~\{BLAH}\Scweet\scweet.py in <module>
      5 import pandas as pd
      6 
----> 7 from utils import init_driver, get_last_date_from_csv, log_search_page, keep_scroling, dowload_images
      8 
      9 

ModuleNotFoundError: No module named 'utils'

max retrive 1000 tweets

Hi @Altimis. I've tested Scweet several times in my machine (Linux OS), but I only get 1000 tweets. I think this is a limitation from Twitter. I don't know this issue in another OS.
Maybe it needs test in Windows/Mac and put in the description that max retrieve Scweet is 1000 tweets due to Twitter limitation (I think)

my query
python3 scweet.py --words "covid19//covid-19" --lang="id" --max_date 2020-03-19 --start_date 2020-03-18 --interval 1 --navig chrome --display_type Latest --headless True

ImportError: attempted relative import with no known parent package

Wondering if you could help with the following. When executing the script I get the following Import Error:

File "scweet.py", line 9, in
from .utils import init_driver
ImportError: attempted relative import with no known parent package

Thanks

Add "location" feature

Hi,

Thank you so much for the wonderful work! Your code is really good to use. Would you add another the tweet's "location" feature to the final csv file? That would be a great help. Thanks again!

Found 0 following when crawling following information

Hi, thanks for your excellent tool! I met a problem and want to know how to solve it. When I crawl following list of users, after I got info of 5 accounts, I only get "Found 0 following" for remaining users, whose following lists are actually not empty. I set the "wait" param as 3, and I even wait for 20s after crawling several users. The username and password can be normally used in the webpage of Twitter. I wonder how I can solve this problem~
Thank you so much!

problem in following

hi, thank you very much for your tools.

unfortunately i can not scrap following
this is my error :

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//a[contains(@href,"/following")]/span[1]/span[1]"}
(Session info: headless chrome=88.0.4324.150)

best regards

Request to add the Location feature based on Latitude / Longitude or Zipcode

Hi,

Is it possible to replicate the location function (from Twint)? Or is that not possible because of the upgrades made by Twitter?

Thank you!

Add as a package to pypi

Great library, should be on pypi

Fix single word

If words contain only single word such as words = ("John")

It become (J%20OR%20o%20OR%20h%20OR%20n)%20

Scweet/Scweet/utils.py

Line 150 in d540228

words = "(" + str('%20OR%20'.join(words)) + ")%20"

.env credentials format

Hi Altimis, thank you for your effort!

The script is running smooth in my machine. The only exception is the get_users_following and get_users_followers functions.

I create a .env file in the main directory, inserted my username and password but still couldn't return user's followers or following.

I am not sure which format for the credentials I should use in the .env file. Can you help me?

Thanks!

ModuleNotFoundError: No module named 'utils' . Use import by relative path to fix?

I clone the repo into the directory as structured.

In my main.py I imported scrap but it will throw error.

File "C:\Users\User\Projects\twitter-scrape\Scweet\Scweet\scweet.py", line 7, in <module>
    from utils import init_driver, get_last_date_from_csv, log_search_page, keep_scroling, dowload_images
ModuleNotFoundError: No module named 'utils'

The code on that line is

from utils import init_driver, get_last_date_from_csv, log_search_page, keep_scroling, dowload_images

If I edit then import by relative such as will fix the issue

from .utils import init_driver, get_last_date_from_csv, log_search_page, keep_scroling, dowload_images

Similarly in utils.py
import const to from . import const

Cant get code to execute = Notebook / Pycharm

Hi Altimis, thanks a stack for putting this together, I think I will get really good use of this if I can get it to work. I fear my lack of experience probably half the problem but when I copy your code into Jupyter Lab and change the run line (data = scrap(....)) to the following:

data=scrap(max_date=2020-1-5,start_date=2020-1-1, to_account="tesla", interval=1,limit=10)

(Ive also tried using the format 2020-01-05)

I get the an error:

"usage: ipykernel_launcher.py [-h] [--words WORDS]
[--from_account FROM_ACCOUNT]
[--to_account TO_ACCOUNT] --max_date MAX_DATE
--start_date START_DATE [--interval INTERVAL]
[--navig NAVIG] [--lang LANG]
[--headless HEADLESS] [--limit LIMIT]
[--display_type DISPLAY_TYPE] [--resume RESUME]
[--proxy PROXY]
ipykernel_launcher.py: error: the following arguments are required: --max_date, --start_date
An exception has occurred, use %tb to see the full traceback."

I assume this means I cant run it out of Notebook?

Secondly the modules import datetimeKeys and from selenium.common.exceptions import NoSuchElementExceptiondoesn't doesnt seem to exist? so when using pycharm it just goes grey.

Finally if it's not too much to ask can you add a few lines to the readme.txt with some more examples as I used your exact example in the python terminal and count get it to execute?

I see it working for others in the issues commentary so thanks for the code and appreciate the work given the loss of old school scrapers.

No Module named scweet

Hello, thank you very much for the repository!
I'm having an issue with importing scweet.
I've successfully installed it, but the module is not found.
Do you have any idea how to solve this problem?
Thank you for your help!

Not an issue but thanks!

Altimis I finally got everything working, amazing work, thank you for building this out will make my analysis 10 fold easier. I am using the CMD control as per your original usage. Some issues I encountered along the way:

"sanity check failed on NUMPY" was due to numpy package having issues with the os check so jumped one version back and worked like a dream.

Great work!

Cannot run the module

mkdir: cannot create directory '/run/user/0': Permission denied
Scraping on headless mode.
Traceback (most recent call last):
File "scweet.py", line 155, in
data = scrap(start_date, max_date, words, to_account, from_account, interval, navig, lang, headless, limit,
File "scweet.py", line 22, in scrap
driver = init_driver(navig, headless, proxy)
File "/home/wew/Pictures/Scweet/Scweet/utils.py", line 201, in init_driver
driver = webdriver.Chrome(options=options, executable_path=chromedriver_path)
File "/usr/local/lib/python3.8/dist-packages/selenium/webdriver/chrome/webdriver.py", line 76, in init
RemoteWebDriver.init(
File "/usr/local/lib/python3.8/dist-packages/selenium/webdriver/remote/webdriver.py", line 157, in init
self.start_session(capabilities, browser_profile)
File "/usr/local/lib/python3.8/dist-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/usr/local/lib/python3.8/dist-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python3.8/dist-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally.
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /snap/bin/chromium is no longer running, so ChromeDriver is assuming that Chrome has crashed.)

Experimental: Recursive Following (AKA Friend) Tree Function

As demonstrated in the repos below, recursive graph mining of Twitter (following AKA friends) is doable under API

However, to do this without API may be tricky, as there needs to be a breadth-first search function that prevents repeated scraping.
As demonstrated in the bottom links, three degrees of following/friends is enough to analyze the community.

Importing the module

Hi, Thank you for your effort.

I am a beginner trying to use your module in Google Colab with the example you provided but it failed. I installed your module either by git cloning or by pip command. It seems the package installed with the name of 'Scweet' (capital letter) but still I cannot import the scrap. Could anyone please let me know how to successfully import it? Thanks in advance.

Scrape all tweets from a specific account

Can you make it so that the word is not an actual requirement if you want to scrape tweets coming from a specific account? Would be awesome for that to not be mandatory rather an extra feature.

Not collecting some fields in the CSV

Hi, thank you very much for this exceptional code!!

I just found a little problem, can you help me to understand if I'm missing something?

In the output CSV, I found all the information except for the followings:

'Emojis' : emojis existing in tweet
'Comments' : number of comments
'Likes' : number of likes
'Retweets' : number of retweets

Error when running scweet

Hi,

I have an error with selenium driver when I run the app. Any idea ? Thank's a lot for your help

pip3 install -r requirements.txt
pip3 install selenium
Requirement already satisfied: selenium in /usr/local/lib/python3.7/dist-packages (3.141.0)
Requirement already satisfied: urllib3 in /usr/lib/python3/dist-packages (from selenium) (1.24.1)

python scweet.py --words "excellente//car" --to_account "tesla" --max_date 2020-01-05 --start_date 2020-01-01 --limit 10 --interval 1 --navig chrome --display_type Latest --lang="en" --headless True
Traceback (most recent call last):
File "scweet.py", line 5, in
from selenium.webdriver.common.keys import Keys
ImportError: No module named selenium.webdriver.common.keys

	words = words.split("//")
	path = save_dir + "/" + words.split("//")[0] + '_' + str(init_date).split(' ')[0] + '_' + \