altimis / scweet Goto Github PK
View Code? Open in Web Editor NEWA simple and unlimited twitter scraper : scrape tweets, likes, retweets, following, followers, user info, images...
License: MIT License
A simple and unlimited twitter scraper : scrape tweets, likes, retweets, following, followers, user info, images...
License: MIT License
I want to know what's the meaning of to account and from account.
And how can i get a tweeter user's tweet including text and the link of picture and video.
Thanks again!
I see you visit the url first then click to latest tweet button based on the display_type.
You can simply append &f=live into the url for latest tweet right?
Unsure about this, but a few accounts some with emojis in the location result in the join date not being picked up.
@akduluneda
@x_born_to_die_x
@av_ahmet
@pabu232
Hi, I am trying to run the example script:
from scweet import scrap
from user import get_user_information, get_following, get_followers
following, followers, join_date, location, website, description = get_user_information("realDonaldTrump")
print('following:'+following+'\nfollowers: '+followers+'\njoin_date: '+join_date+ '\nlocation: ' +location + '\nwebsite: '+website + '\ndescription: '+description)
And am facing the following issue:
Traceback (most recent call last):
File "user_data.py", line 4, in
following, followers, join_date, location, website, description = get_user_information("realDonaldTrump")
File "D:\Twitter\Scweet\Scweet\user.py", line 8, in get_user_information
driver = utils.init_driver(headless=headless)
File "D:\Twitter\Scweet\Scweet\utils.py", line 187, in init_driver
chromedriver_path = chromedriver_autoinstaller.install()
File "D:\Twitter\Scweet\scweet-venv\lib\site-packages\chromedriver_autoinstaller_init_.py", line 15, in install
chromedriver_filepath = utils.download_chromedriver(cwd)
File "D:\Twitter\Scweet\scweet-venv\lib\site-packages\chromedriver_autoinstaller\utils.py", line 166, in download_chromedriver
chrome_version = get_chrome_version()
File "D:\Twitter\Scweet\scweet-venv\lib\site-packages\chromedriver_autoinstaller\utils.py", line 117, in get_chrome_version
version = process.communicate()[0].decode('UTF-8').strip().split()[-1]
IndexError: list index out of range
Please help!! Thank you.
This is one of the features https://github.com/rishi-raj-jain/twitterUsernamefromUserID provides.
I would hope that @rishi-raj-jain can help this library out by integrating the username-to-id feature into this library.
Utility:
Hi, thank you very much for this exceptional code!!
I just found a little problem, can you help me to understand if I'm missing something?
In the output CSV, I found all the information except for the followings:
Since there is no package install function, I tried to put the whole folder in with the main directory of the ipynb.
following = get_users_following(users=temp, verbose=0, headless = True, wait=1)
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-9-xxxxxxxxxxxx> in <module>
----> 1 from Scweet.scweet import scrap
~\{BLAH}\Scweet\scweet.py in <module>
5 import pandas as pd
6
----> 7 from utils import init_driver, get_last_date_from_csv, log_search_page, keep_scroling, dowload_images
8
9
ModuleNotFoundError: No module named 'utils'
I clone the repo into the directory as structured.
In my main.py I imported scrap
but it will throw error.
File "C:\Users\User\Projects\twitter-scrape\Scweet\Scweet\scweet.py", line 7, in <module>
from utils import init_driver, get_last_date_from_csv, log_search_page, keep_scroling, dowload_images
ModuleNotFoundError: No module named 'utils'
The code on that line is
from utils import init_driver, get_last_date_from_csv, log_search_page, keep_scroling, dowload_images
If I edit then import by relative such as will fix the issue
from .utils import init_driver, get_last_date_from_csv, log_search_page, keep_scroling, dowload_images
Similarly in utils.py
import const
to from . import const
If words contain only single word such as words = ("John")
It become (J%20OR%20o%20OR%20h%20OR%20n)%20
Line 150 in d540228
Thank you too much for this repository!I have spent nearly two weeks to research on how to crawl tweets with reply, but all repository like TWINT didn't work.
Do you know TWINT? I'm a developer from China. After using the proxy, TWINT still keeps reporting errors.
WARNING:root:Error retrieving https://twitter.com/: ReadTimeout(ReadTimeoutError("HTTPSConnectionPool(host='twitter.com', port=443): Read timed out. (read timeout=10)",),), retrying
I saw you said that Twitter has blocked all crawlers during this period of time. Is twitter unable to use it for this reason, or is it because I am in China and set up a proxy incorrectly?
My Code:
from scweet import scrap
from user import get_user_information, get_users_following, get_users_followers
data = scrap(words = 'world', start_date="2020-02-07", max_date="2020-02-09",interval=1, limit=10,
headless=False, proxy="socks5://127.0.0.1:7891")
Run the code in non-headless mode, I can see that there is an error got when searching:
But if I run the query from https://twitter.com/search manually, everything works fine:
Hi, thanks for your excellent tool! I met a problem and want to know how to solve it. When I crawl following list of users, after I got info of 5 accounts, I only get "Found 0 following" for remaining users, whose following lists are actually not empty. I set the "wait" param as 3, and I even wait for 20s after crawling several users. The username and password can be normally used in the webpage of Twitter. I wonder how I can solve this problem~
Thank you so much!
Hi Altimis, been using your tool and working well. Although just noticed 2 issues:
Issue 1. The last tweet collected was the reply to the chain of tweets from the user account I am scraping rather than the chain of tweets? some images for clarity:
Actual tweet chain I was looking for:
Issue 2: I also noticed there was a large chunk of tweets not scraped during mid November to beginning of December? Not sure why as there were tweets available.
Any thoughts?
Hi
Wondering if you could help with the following. When executing the script I get the following Import Error:
File "scweet.py", line 9, in
from .utils import init_driver
ImportError: attempted relative import with no known parent package
Thanks
Hi Altimis, thanks a stack for putting this together, I think I will get really good use of this if I can get it to work. I fear my lack of experience probably half the problem but when I copy your code into Jupyter Lab and change the run line (data = scrap(....)) to the following:
data=scrap(max_date=2020-1-5,start_date=2020-1-1, to_account="tesla", interval=1,limit=10)
(Ive also tried using the format 2020-01-05)
I get the an error:
"usage: ipykernel_launcher.py [-h] [--words WORDS]
[--from_account FROM_ACCOUNT]
[--to_account TO_ACCOUNT] --max_date MAX_DATE
--start_date START_DATE [--interval INTERVAL]
[--navig NAVIG] [--lang LANG]
[--headless HEADLESS] [--limit LIMIT]
[--display_type DISPLAY_TYPE] [--resume RESUME]
[--proxy PROXY]
ipykernel_launcher.py: error: the following arguments are required: --max_date, --start_date
An exception has occurred, use %tb to see the full traceback."
I assume this means I cant run it out of Notebook?
Secondly the modules import datetimeKeys and from selenium.common.exceptions import NoSuchElementExceptiondoesn't doesnt seem to exist? so when using pycharm it just goes grey.
Finally if it's not too much to ask can you add a few lines to the readme.txt with some more examples as I used your exact example in the python terminal and count get it to execute?
I see it working for others in the issues commentary so thanks for the code and appreciate the work given the loss of old school scrapers.
Hi @Altimis. I've tested Scweet several times in my machine (Linux OS), but I only get 1000 tweets. I think this is a limitation from Twitter. I don't know this issue in another OS.
Maybe it needs test in Windows/Mac and put in the description that max retrieve Scweet is 1000 tweets due to Twitter limitation (I think)
my query
python3 scweet.py --words "covid19//covid-19" --lang="id" --max_date 2020-03-19 --start_date 2020-03-18 --interval 1 --navig chrome --display_type Latest --headless True
It seems to me that there is a small omission, while --hashtag is used as a command line argument in the README file, it is not parsed in the file scweet.py :
parser.add_argument('--hashtag', type=str, help='Hashtag', default=None) hashtag = args.hashtag
should be added and used in the call of scrap()
Hi Altimis, thank you for your effort!
The script is running smooth in my machine. The only exception is the get_users_following and get_users_followers functions.
I create a .env file in the main directory, inserted my username and password but still couldn't return user's followers or following.
I am not sure which format for the credentials I should use in the .env file. Can you help me?
Thanks!
Hi,
if the username has a birthday presented on the profile, this is scraped instead of the join date.
Regards,
Thank you too much for this repository!I have spent nearly two weeks to research on how to crawl tweets with reply, but all repository like TWINT didn't work.
Do you know TWINT? I'm a developer from China. After using the proxy, TWINT still keeps reporting errors.
WARNING:root:Error retrieving https://twitter.com/: ReadTimeout(ReadTimeoutError("HTTPSConnectionPool(host='twitter.com', port=443): Read timed out. (read timeout=10)",),), retrying
I saw you said that Twitter has blocked all crawlers during this period of time. Is twitter unable to use it for this reason, or is it because I am in China and set up a proxy incorrectly?
hi, thank you very much for your tools.
unfortunately i can not scrap following
this is my error :
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//a[contains(@href,"/following")]/span[1]/span[1]"}
(Session info: headless chrome=88.0.4324.150)
best regards
Take a look at https://github.com/5hirish/tweet_scrapper#features and see if Scweet covers the information.
Also As a side note, do some comparison with https://github.com/twintproject/twint about missing features.
Related to #35
if you set lang
to other than "en", driver.find_element_by_link_text(display_type).click()
won't work right? because the test displayed for Latest and Top will be different
Hi,
I have an error with selenium driver when I run the app. Any idea ? Thank's a lot for your help
pip3 install -r requirements.txt
pip3 install selenium
Requirement already satisfied: selenium in /usr/local/lib/python3.7/dist-packages (3.141.0)
Requirement already satisfied: urllib3 in /usr/lib/python3/dist-packages (from selenium) (1.24.1)
python scweet.py --words "excellente//car" --to_account "tesla" --max_date 2020-01-05 --start_date 2020-01-01 --limit 10 --interval 1 --navig chrome --display_type Latest --lang="en" --headless True
Traceback (most recent call last):
File "scweet.py", line 5, in
from selenium.webdriver.common.keys import Keys
ImportError: No module named selenium.webdriver.common.keys
Hello,
Thank you for the tool. I am a research student working with Twitter data. For my research, I am trying to scrape data from user profiles. Unfortunately due to Scweet pulling data from the search tab instead of the actual user profile, I am only able to capture 1/10th of the tweets. Is there some way to make it scrape data from the user profile itself?
Thank you in advance!
Line 53 in 87138a4
Currently, this line concatinates the text with the embeded text in the tweet. This happens for replys, wich could be usefull, but also for all embeded text, from linked websites/videos/etc
We are using this tool to scrape swedish data, and I found it usefull to change this line to not inlcude the embeded texts, since they are often in english.
I would suggest a argument to scweet to enable/disable the reply/embeded text
Or maybe add the reply text in a separate column in the CSV file? for easier processing
Thank you for this tool!
data = scrap(words = 'fosun pharma', start_date="2020-01-08", max_date="2020-12-31",interval=1,
headless=False, proxy="socks5://127.0.0.1:7891")
I'm using this code to query tweets about fosun pharma
during the year 2020.
The operation works amazingly but when things get to some threshold, like after a consistent query of 120 days, Twitter will refuse to offer the result and shows a search error.
I split my query into a group of a period of 4 months to make things down. I'm wondering is there any better choice to handle such a situation, recovering from twitter's query error.
Altimis I finally got everything working, amazing work, thank you for building this out will make my analysis 10 fold easier. I am using the CMD control as per your original usage. Some issues I encountered along the way:
"sanity check failed on NUMPY" was due to numpy package having issues with the os check so jumped one version back and worked like a dream.
Great work!
Thank you so much for this repository!But I met up with a problem.
I try to run scrap(start_date="2020-01-11", max_date="2020-01-16", words = "tesla",interval=1, headless=True, display_type="Top", hashtag=None)
you know, there are thounds of tweets about tesla each day, but I got only 360+ tweets from "2020-01-11" to "2020-01-16".
Could u help me please? Thanks a lot!
How can a newbie run it using ubuntu as it shows the following error there (Complete trackback)
`Traceback (most recent call last):
File "/home/sami/.local/lib/python3.6/site-packages/selenium/webdriver/common/service.py", line 76, in start
stdin=PIPE)
File "/usr/lib/python3.6/subprocess.py", line 729, in init
restore_signals, start_new_session)
File "/usr/lib/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: './drivers/chromedriver.exe': './drivers/chromedriver.exe'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "twitter_Scraper.py", line 314, in
data=scrap(words, to_account, from_account, start_date,max_date,interval,navig,lang, headless, limit,display_type,resume)
File "twitter_Scraper.py", line 162, in scrap
driver=init_driver(navig, headless)
File "twitter_Scraper.py", line 127, in init_driver
driver = webdriver.Chrome(options=options,executable_path=browser_path)
File "/home/sami/.local/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 73, in init
self.service.start()
File "/home/sami/.local/lib/python3.6/site-packages/selenium/webdriver/common/service.py", line 83, in start
os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'chromedriver.exe' executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home
`
because the @RealDonaldTrump account was permanently suspended by twitter.
Lines 39 to 40 in d540228
Thanks for developing this very useful scraper!
I used the from_account function but were still receiving tweets posted by other accounts that are replying to them. How to fix this?
Also, I am trying to scrape the tweets posted by specific accounts, not their responses to others. Is there a way to filter them out?
Thank you for your help!
Change the data features
The below may be useful for some users. Add a feature that can retrieve from tweets
User information
I am looking for some hot spot content on Twitter, and I want to analyze how one hot spot grows. But it seems that I can only search information within the 'from_account' I give.
I really want a function to do global search without the limitation of 'from_account'. Could you help me?
Since Twint already has that covered, why not allow Scweet to be accessible as well?
Take a look at https://pypi.org/
Observing https://github.com/Altimis/Scweet/blob/master/setup.py exists, the documentation could be changed
When you search followers or following you should use set to store the result, not the list, or you will get the wrong result.
Because you repeat some information.
Just change the follows_elem = [] to follows_elem = set(), maybe you can find other solutions.
As demonstrated in the repos below, recursive graph mining of Twitter (following AKA friends) is doable under API
However, to do this without API may be tricky, as there needs to be a breadth-first search function that prevents repeated scraping.
As demonstrated in the bottom links, three degrees of following/friends is enough to analyze the community.
Hi,
Thank you so much for the wonderful work! Your code is really good to use. Would you add another the tweet's "location" feature to the final csv file? That would be a great help. Thanks again!
I hope the Scweet can add the instructions of proxy Settings, so that I can use the Scweet in China. Thank you!
Hi, I've been using twint to fetch followers, followers, favorites. But it's not working there. Are you also working on it?
Hi,
When I try to use Scweet to scrap today, it doesn't work - it always works before, at least weeks ago it worked. I also tried the example jupyter notebook, which also doesn't work. No matter what query I used, the reply is simply scroll 0 scroll 1 and then move on to the next date. And nothing is stored into the data object.
I'm wondering if Scweet now becomes somehow masked by Twitter or if there is any idea about what happens.
Thanks!
Great library, should be on pypi
Hi,
Is it possible to replicate the location function (from Twint)? Or is that not possible because of the upgrades made by Twitter?
Thank you!
Is it possible to remove case sensitivity in the user search ? I tried using .casefold() in user.py but didn't get very far.
For example, for the username 'user'. this won't work if the twitter handle is @user. I have a large database of usernames but it is all in lowercase.
Thank you for your time.
mkdir: cannot create directory '/run/user/0': Permission denied
Scraping on headless mode.
Traceback (most recent call last):
File "scweet.py", line 155, in
data = scrap(start_date, max_date, words, to_account, from_account, interval, navig, lang, headless, limit,
File "scweet.py", line 22, in scrap
driver = init_driver(navig, headless, proxy)
File "/home/wew/Pictures/Scweet/Scweet/utils.py", line 201, in init_driver
driver = webdriver.Chrome(options=options, executable_path=chromedriver_path)
File "/usr/local/lib/python3.8/dist-packages/selenium/webdriver/chrome/webdriver.py", line 76, in init
RemoteWebDriver.init(
File "/usr/local/lib/python3.8/dist-packages/selenium/webdriver/remote/webdriver.py", line 157, in init
self.start_session(capabilities, browser_profile)
File "/usr/local/lib/python3.8/dist-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/usr/local/lib/python3.8/dist-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python3.8/dist-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally.
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /snap/bin/chromium is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
Can you make it so that the word is not an actual requirement if you want to scrape tweets coming from a specific account? Would be awesome for that to not be mandatory rather an extra feature.
Hi, Thank you for your effort.
I am a beginner trying to use your module in Google Colab with the example you provided but it failed. I installed your module either by git cloning or by pip command. It seems the package installed with the name of 'Scweet' (capital letter) but still I cannot import the scrap. Could anyone please let me know how to successfully import it? Thanks in advance.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.