Git Product home page Git Product logo

scweet's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scweet's Issues

Thanks for you work!

I want to know what's the meaning of to account and from account.
And how can i get a tweeter user's tweet including text and the link of picture and video.
Thanks again!

IndexError: list index out of range

Hi, I am trying to run the example script:
from scweet import scrap
from user import get_user_information, get_following, get_followers

following, followers, join_date, location, website, description = get_user_information("realDonaldTrump")

print('following:'+following+'\nfollowers: '+followers+'\njoin_date: '+join_date+ '\nlocation: ' +location + '\nwebsite: '+website + '\ndescription: '+description)

And am facing the following issue:

Traceback (most recent call last):
File "user_data.py", line 4, in
following, followers, join_date, location, website, description = get_user_information("realDonaldTrump")
File "D:\Twitter\Scweet\Scweet\user.py", line 8, in get_user_information
driver = utils.init_driver(headless=headless)
File "D:\Twitter\Scweet\Scweet\utils.py", line 187, in init_driver
chromedriver_path = chromedriver_autoinstaller.install()
File "D:\Twitter\Scweet\scweet-venv\lib\site-packages\chromedriver_autoinstaller_init_.py", line 15, in install
chromedriver_filepath = utils.download_chromedriver(cwd)
File "D:\Twitter\Scweet\scweet-venv\lib\site-packages\chromedriver_autoinstaller\utils.py", line 166, in download_chromedriver
chrome_version = get_chrome_version()
File "D:\Twitter\Scweet\scweet-venv\lib\site-packages\chromedriver_autoinstaller\utils.py", line 117, in get_chrome_version
version = process.communicate()[0].decode('UTF-8').strip().split()[-1]
IndexError: list index out of range

Please help!! Thank you.

Search or Get Following/Followers by User ID

This is one of the features https://github.com/rishi-raj-jain/twitterUsernamefromUserID provides.
I would hope that @rishi-raj-jain can help this library out by integrating the username-to-id feature into this library.

Utility:

  • some datasets provide a list of users in ID format instead of handles, to prevent heavy name and avatar changes.
  • inversely, Scweet currently only scrape usernames by handles, and they are inept in saving data as ID format for stability

NoSuchElementException on get_users_following()

I'm trying to run the Example.ipynb file and I'm encountering this error once I try to run the get_users_following code cell.

image

I've adjusted my .env file to reflect an active Twitter account already. Any ideas on what I should debug? I'm running this on Python 3.9 on a Jupyter notebook.

Not collecting some fields in the CSV

Hi, thank you very much for this exceptional code!!

I just found a little problem, can you help me to understand if I'm missing something?

In the output CSV, I found all the information except for the followings:

  • 'Emojis' : emojis existing in tweet
  • 'Comments' : number of comments
  • 'Likes' : number of likes
  • 'Retweets' : number of retweets

Utils are unusable

Since there is no package install function, I tried to put the whole folder in with the main directory of the ipynb.
following = get_users_following(users=temp, verbose=0, headless = True, wait=1)

ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-9-xxxxxxxxxxxx> in <module>
----> 1 from Scweet.scweet import scrap

~\{BLAH}\Scweet\scweet.py in <module>
      5 import pandas as pd
      6 
----> 7 from utils import init_driver, get_last_date_from_csv, log_search_page, keep_scroling, dowload_images
      8 
      9 

ModuleNotFoundError: No module named 'utils'

ModuleNotFoundError: No module named 'utils' . Use import by relative path to fix?

I clone the repo into the directory as structured.

Screenshot_134

In my main.py I imported scrap but it will throw error.

File "C:\Users\User\Projects\twitter-scrape\Scweet\Scweet\scweet.py", line 7, in <module>
    from utils import init_driver, get_last_date_from_csv, log_search_page, keep_scroling, dowload_images
ModuleNotFoundError: No module named 'utils'

The code on that line is

from utils import init_driver, get_last_date_from_csv, log_search_page, keep_scroling, dowload_images

If I edit then import by relative such as will fix the issue

from .utils import init_driver, get_last_date_from_csv, log_search_page, keep_scroling, dowload_images

Similarly in utils.py
import const to from . import const

Some questions

Thank you too much for this repository!I have spent nearly two weeks to research on how to crawl tweets with reply, but all repository like TWINT didn't work.
Do you know TWINT? I'm a developer from China. After using the proxy, TWINT still keeps reporting errors.

WARNING:root:Error retrieving https://twitter.com/: ReadTimeout(ReadTimeoutError("HTTPSConnectionPool(host='twitter.com', port=443): Read timed out. (read timeout=10)",),), retrying

I saw you said that Twitter has blocked all crawlers during this period of time. Is twitter unable to use it for this reason, or is it because I am in China and set up a proxy incorrectly?

Twitter search not work via scweet with selenium

My Code:

from scweet import scrap
from user import get_user_information, get_users_following, get_users_followers

data = scrap(words = 'world', start_date="2020-02-07", max_date="2020-02-09",interval=1, limit=10,
      headless=False, proxy="socks5://127.0.0.1:7891")

Run the code in non-headless mode, I can see that there is an error got when searching:
image

But if I run the query from https://twitter.com/search manually, everything works fine:

image

Found 0 following when crawling following information

Hi, thanks for your excellent tool! I met a problem and want to know how to solve it. When I crawl following list of users, after I got info of 5 accounts, I only get "Found 0 following" for remaining users, whose following lists are actually not empty. I set the "wait" param as 3, and I even wait for 20s after crawling several users. The username and password can be normally used in the webpage of Twitter. I wonder how I can solve this problem~
Thank you so much!
image

Not collecting last batch of tweets

Hi Altimis, been using your tool and working well. Although just noticed 2 issues:

Issue 1. The last tweet collected was the reply to the chain of tweets from the user account I am scraping rather than the chain of tweets? some images for clarity:

Twitter user with last tweet:
Twitter User Account

Output from Scraping:
output

Actual tweet chain I was looking for:
Screenshot 1
Screenshot 2

Issue 2: I also noticed there was a large chunk of tweets not scraped during mid November to beginning of December? Not sure why as there were tweets available.

No scraping

Here is the code input:
Code

Any thoughts?

Cant get code to execute = Notebook / Pycharm

Hi Altimis, thanks a stack for putting this together, I think I will get really good use of this if I can get it to work. I fear my lack of experience probably half the problem but when I copy your code into Jupyter Lab and change the run line (data = scrap(....)) to the following:

data=scrap(max_date=2020-1-5,start_date=2020-1-1, to_account="tesla", interval=1,limit=10)

(Ive also tried using the format 2020-01-05)

I get the an error:

"usage: ipykernel_launcher.py [-h] [--words WORDS]
[--from_account FROM_ACCOUNT]
[--to_account TO_ACCOUNT] --max_date MAX_DATE
--start_date START_DATE [--interval INTERVAL]
[--navig NAVIG] [--lang LANG]
[--headless HEADLESS] [--limit LIMIT]
[--display_type DISPLAY_TYPE] [--resume RESUME]
[--proxy PROXY]
ipykernel_launcher.py: error: the following arguments are required: --max_date, --start_date
An exception has occurred, use %tb to see the full traceback."

I assume this means I cant run it out of Notebook?

Secondly the modules import datetimeKeys and from selenium.common.exceptions import NoSuchElementExceptiondoesn't doesnt seem to exist? so when using pycharm it just goes grey.

Finally if it's not too much to ask can you add a few lines to the readme.txt with some more examples as I used your exact example in the python terminal and count get it to execute?

I see it working for others in the issues commentary so thanks for the code and appreciate the work given the loss of old school scrapers.

No Module named scweet

Hello, thank you very much for the repository!
I'm having an issue with importing scweet.
I've successfully installed it, but the module is not found.
Do you have any idea how to solve this problem?
Thank you for your help!

issue
issue2

max retrive 1000 tweets

Hi @Altimis. I've tested Scweet several times in my machine (Linux OS), but I only get 1000 tweets. I think this is a limitation from Twitter. I don't know this issue in another OS.
Maybe it needs test in Windows/Mac and put in the description that max retrieve Scweet is 1000 tweets due to Twitter limitation (I think)

my query
python3 scweet.py --words "covid19//covid-19" --lang="id" --max_date 2020-03-19 --start_date 2020-03-18 --interval 1 --navig chrome --display_type Latest --headless True

--hashtag is missing from the command line argument

It seems to me that there is a small omission, while --hashtag is used as a command line argument in the README file, it is not parsed in the file scweet.py :
parser.add_argument('--hashtag', type=str, help='Hashtag', default=None) hashtag = args.hashtag
should be added and used in the call of scrap()

.env credentials format

Hi Altimis, thank you for your effort!

The script is running smooth in my machine. The only exception is the get_users_following and get_users_followers functions.

I create a .env file in the main directory, inserted my username and password but still couldn't return user's followers or following.

I am not sure which format for the credentials I should use in the .env file. Can you help me?

Thanks!

Some questons

Thank you too much for this repository!I have spent nearly two weeks to research on how to crawl tweets with reply, but all repository like TWINT didn't work.
Do you know TWINT? I'm a developer from China. After using the proxy, TWINT still keeps reporting errors.

WARNING:root:Error retrieving https://twitter.com/: ReadTimeout(ReadTimeoutError("HTTPSConnectionPool(host='twitter.com', port=443): Read timed out. (read timeout=10)",),), retrying

I saw you said that Twitter has blocked all crawlers during this period of time. Is twitter unable to use it for this reason, or is it because I am in China and set up a proxy incorrectly?

problem in following

hi, thank you very much for your tools.

unfortunately i can not scrap following
this is my error :

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//a[contains(@href,"/following")]/span[1]/span[1]"}
(Session info: headless chrome=88.0.4324.150)

best regards

display_type depend on lang

Related to #35

if you set lang to other than "en", driver.find_element_by_link_text(display_type).click() won't work right? because the test displayed for Latest and Top will be different

Error when running scweet

Hi,

I have an error with selenium driver when I run the app. Any idea ? Thank's a lot for your help

pip3 install -r requirements.txt
pip3 install selenium
Requirement already satisfied: selenium in /usr/local/lib/python3.7/dist-packages (3.141.0)
Requirement already satisfied: urllib3 in /usr/lib/python3/dist-packages (from selenium) (1.24.1)

python scweet.py --words "excellente//car" --to_account "tesla" --max_date 2020-01-05 --start_date 2020-01-01 --limit 10 --interval 1 --navig chrome --display_type Latest --lang="en" --headless True
Traceback (most recent call last):
File "scweet.py", line 5, in
from selenium.webdriver.common.keys import Keys
ImportError: No module named selenium.webdriver.common.keys

User profile tweets not captured

Hello,

Thank you for the tool. I am a research student working with Twitter data. For my research, I am trying to scrape data from user profiles. Unfortunately due to Scweet pulling data from the search tab instead of the actual user profile, I am only able to capture 1/10th of the tweets. Is there some way to make it scrape data from the user profile itself?

Thank you in advance!

Argument to exlude embeded/reply text

text = comment + responding

Currently, this line concatinates the text with the embeded text in the tweet. This happens for replys, wich could be usefull, but also for all embeded text, from linked websites/videos/etc

We are using this tool to scrape swedish data, and I found it usefull to change this line to not inlcude the embeded texts, since they are often in english.
I would suggest a argument to scweet to enable/disable the reply/embeded text
Or maybe add the reply text in a separate column in the CSV file? for easier processing

Thank you for this tool!

The best to way to handle twitter's search error?

data = scrap(words = 'fosun pharma', start_date="2020-01-08", max_date="2020-12-31",interval=1,
      headless=False, proxy="socks5://127.0.0.1:7891")

I'm using this code to query tweets about fosun pharma during the year 2020.

The operation works amazingly but when things get to some threshold, like after a consistent query of 120 days, Twitter will refuse to offer the result and shows a search error.

image

I split my query into a group of a period of 4 months to make things down. I'm wondering is there any better choice to handle such a situation, recovering from twitter's query error.

Not an issue but thanks!

Altimis I finally got everything working, amazing work, thank you for building this out will make my analysis 10 fold easier. I am using the CMD control as per your original usage. Some issues I encountered along the way:

"sanity check failed on NUMPY" was due to numpy package having issues with the os check so jumped one version back and worked like a dream.

Great work!

didn't get all of the tweets

Thank you so much for this repository!But I met up with a problem.
I try to run scrap(start_date="2020-01-11", max_date="2020-01-16", words = "tesla",interval=1, headless=True, display_type="Top", hashtag=None)
you know, there are thounds of tweets about tesla each day, but I got only 360+ tweets from "2020-01-11" to "2020-01-16".
Could u help me please? Thanks a lot!

Drivers Issue in Ubuntu

How can a newbie run it using ubuntu as it shows the following error there (Complete trackback)

`Traceback (most recent call last):
File "/home/sami/.local/lib/python3.6/site-packages/selenium/webdriver/common/service.py", line 76, in start
stdin=PIPE)
File "/usr/lib/python3.6/subprocess.py", line 729, in init
restore_signals, start_new_session)
File "/usr/lib/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: './drivers/chromedriver.exe': './drivers/chromedriver.exe'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "twitter_Scraper.py", line 314, in
data=scrap(words, to_account, from_account, start_date,max_date,interval,navig,lang, headless, limit,display_type,resume)
File "twitter_Scraper.py", line 162, in scrap
driver=init_driver(navig, headless)
File "twitter_Scraper.py", line 127, in init_driver
driver = webdriver.Chrome(options=options,executable_path=browser_path)
File "/home/sami/.local/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 73, in init
self.service.start()
File "/home/sami/.local/lib/python3.6/site-packages/selenium/webdriver/common/service.py", line 83, in start
os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'chromedriver.exe' executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home

`

"from_account=" function returning tweets posted by other accounts

Thanks for developing this very useful scraper!
I used the from_account function but were still receiving tweets posted by other accounts that are replying to them. How to fix this?
Also, I am trying to scrape the tweets posted by specific accounts, not their responses to others. Is there a way to filter them out?

Thank you for your help!

Request add new feature

Change the data features

  • UserName -> UserScreenName
  • Handle -> UserName

The below may be useful for some users. Add a feature that can retrieve from tweets

  • tweet-url

User information

  • date user joined
  • user no. of following
  • user no. of followers
  • user is verified?

I find a bug

When you search followers or following you should use set to store the result, not the list, or you will get the wrong result.
Because you repeat some information.

Just change the follows_elem = [] to follows_elem = set(), maybe you can find other solutions.

Experimental: Recursive Following (AKA Friend) Tree Function

As demonstrated in the repos below, recursive graph mining of Twitter (following AKA friends) is doable under API

However, to do this without API may be tricky, as there needs to be a breadth-first search function that prevents repeated scraping.
As demonstrated in the bottom links, three degrees of following/friends is enough to analyze the community.

Add "location" feature

Hi,

Thank you so much for the wonderful work! Your code is really good to use. Would you add another the tweet's "location" feature to the final csv file? That would be a great help. Thanks again!

Followings and followers.

Hi, I've been using twint to fetch followers, followers, favorites. But it's not working there. Are you also working on it?

Scweet doesn't work now (but I used it weeks ago)

Hi,

When I try to use Scweet to scrap today, it doesn't work - it always works before, at least weeks ago it worked. I also tried the example jupyter notebook, which also doesn't work. No matter what query I used, the reply is simply scroll 0 scroll 1 and then move on to the next date. And nothing is stored into the data object.

I'm wondering if Scweet now becomes somehow masked by Twitter or if there is any idea about what happens.

Thanks!

Username case sensitive

Is it possible to remove case sensitivity in the user search ? I tried using .casefold() in user.py but didn't get very far.

For example, for the username 'user'. this won't work if the twitter handle is @user. I have a large database of usernames but it is all in lowercase.

Thank you for your time.

Cannot run the module

mkdir: cannot create directory '/run/user/0': Permission denied
Scraping on headless mode.
Traceback (most recent call last):
File "scweet.py", line 155, in
data = scrap(start_date, max_date, words, to_account, from_account, interval, navig, lang, headless, limit,
File "scweet.py", line 22, in scrap
driver = init_driver(navig, headless, proxy)
File "/home/wew/Pictures/Scweet/Scweet/utils.py", line 201, in init_driver
driver = webdriver.Chrome(options=options, executable_path=chromedriver_path)
File "/usr/local/lib/python3.8/dist-packages/selenium/webdriver/chrome/webdriver.py", line 76, in init
RemoteWebDriver.init(
File "/usr/local/lib/python3.8/dist-packages/selenium/webdriver/remote/webdriver.py", line 157, in init
self.start_session(capabilities, browser_profile)
File "/usr/local/lib/python3.8/dist-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/usr/local/lib/python3.8/dist-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python3.8/dist-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally.
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /snap/bin/chromium is no longer running, so ChromeDriver is assuming that Chrome has crashed.)

Scrape all tweets from a specific account

Can you make it so that the word is not an actual requirement if you want to scrape tweets coming from a specific account? Would be awesome for that to not be mandatory rather an extra feature.

Importing the module

Hi, Thank you for your effort.

I am a beginner trying to use your module in Google Colab with the example you provided but it failed. I installed your module either by git cloning or by pip command. It seems the package installed with the name of 'Scweet' (capital letter) but still I cannot import the scrap. Could anyone please let me know how to successfully import it? Thanks in advance.

image

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.