bisguzar / twitter-scraper Goto Github PK
View Code? Open in Web Editor NEWScrape the Twitter Frontend API without authentication.
License: MIT License
Scrape the Twitter Frontend API without authentication.
License: MIT License
/Users/zachadams/Desktop/scraper/venv/bin/Python /Users/zachadams/Desktop/scraper/scape.py
Traceback (most recent call last):
File "/Users/zachadams/Desktop/scraper/scape.py", line 10, in
trends = get_trends()
File "/Users/zachadams/Desktop/scraper/venv/lib/python3.7/site-packages/twitter_scraper-0.4.0-py3.7.egg/twitter_scraper/modules/trends.py", line 9, in get_trends
File "/Users/zachadams/Desktop/scraper/venv/lib/python3.7/site-packages/requests-2.23.0-py3.7.egg/requests/models.py", line 898, in json
return complexjson.loads(self.text, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/init.py", line 348, in loads
return _default_decoder.decode(s)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
I could install the library on Py 3.5 but it doesn't work. I know the support is from 3.6+.
Any simple tweak or code change that could serve my purpose?
from twitter_scraper import get_tweets
next(get_tweets("socgaudenti"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../python3.6/site-packages/twitter_scraper.py", line 78, in get_tweets
yield from gen_tweets(pages)
File ".../python3.6/site-packages/twitter_scraper.py", line 35, in gen_tweets
text = tweet.find('.tweet-text')[0].full_text
IndexError: list index out of range
It seems to be triggered by the "Tagged users" at the bottom of https://twitter.com/socgaudenti/status/975725303808086016 (I'm new to twitter)
A simple fix could be discarding those elements
Trying to run your test script with the random user https://twitter.com/preta_6 whos first tweet is only a picture, running the basic example from your readme.md gives me this error.
File "twitterTest2.py", line 3, in <module> for tweet in get_tweets('preta_6', pages=1): File "/home/username/twitterToMastodon/env/lib/python3.7/site-packages/twitter_scraper.py", line 78, in get_tweets yield from gen_tweets(pages) File "/home/username/twitterToMastodon/env/lib/python3.7/site-packages/twitter_scraper.py", line 42, in gen_tweets replies = int(interactions[0].split(" ")[0].replace(comma, "").replace(dot,"")) ValueError: invalid literal for int() with base 10: '1\xa0256'
Simply put, sometimes you don't want retweets, only tweets made by the actual account in question. A simple flag could solve this.
I believe retweets have the retweeted
class on the first <div>
under the <li>
tags.
from twitter_scraper import get_tweets
Traceback (most recent call last):
File "", line 1, in
File "E:\twitter\env\lib\site-packages\twitter_scraper.py", line 2, in
from requests_html import Session, HTML
ImportError: cannot import name 'Session'
I manually installed requests-html through pip
It would be great if you could get the trend-hashtags of a specific location. I think it's not implemented, right?
Not sure how far you want to go to obfuscate your collection efforts here from twitter powers that be, but would be really cool to implement these things. Want me to fork and do a pull request? What is the contribution guide? Is that the direction you want to take this?
Latest version uses https://github.com/kennethreitz/requests-html instead requests
but missing it in setup.py
.
(test-hxkKlP5o)
$ pipenv install twitter-scraper
Installing twitter-scraper
Collecting twitter-scraper
Using cached twitter_scraper-0.2.0-py2.py3-none-any.whl
Collecting pyquery (from twitter-scraper)
Using cached pyquery-1.4.0-py2.py3-none-any.whl
Collecting requests (from twitter-scraper)
Using cached requests-2.18.4-py2.py3-none-any.whl
Collecting lxml>=2.1 (from pyquery->twitter-scraper)
Using cached lxml-4.1.1-cp36-cp36m-manylinux1_x86_64.whl
Collecting cssselect>0.7.9 (from pyquery->twitter-scraper)
Using cached cssselect-1.0.3-py2.py3-none-any.whl
Collecting chardet<3.1.0,>=3.0.2 (from requests->twitter-scraper)
Using cached chardet-3.0.4-py2.py3-none-any.whl
Collecting idna<2.7,>=2.5 (from requests->twitter-scraper)
Using cached idna-2.6-py2.py3-none-any.whl
Collecting certifi>=2017.4.17 (from requests->twitter-scraper)
Using cached certifi-2018.1.18-py2.py3-none-any.whl
Collecting urllib3<1.23,>=1.21.1 (from requests->twitter-scraper)
Using cached urllib3-1.22-py2.py3-none-any.whl
Installing collected packages: lxml, cssselect, pyquery, chardet, idna, certifi, urllib3, requests, twitter-scraper
Successfully installed certifi-2018.1.18 chardet-3.0.4 cssselect-1.0.3 idna-2.6 lxml-4.1.1 pyquery-1.4.0 requests-2.18.4 twitter-scraper-0.2.0 urllib3-1.22
Adding twitter-scraper to Pipfile's [packages]\u2026
Locking [dev-packages] dependencies\u2026
Locking [packages] dependencies\u2026
Updated Pipfile.lock (6e2d92)!
(test-hxkKlP5o)
$ python
Python 3.6.3 (default, Feb 23 2018, 10:26:19)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from twitter_scraper import get_tweets
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/cuongnv/.local/share/virtualenvs/test-hxkKlP5o/lib/python3.6/site-packages/twitter_scraper.py", line 2, in <module>
from requests_html import Session, HTML
ModuleNotFoundError: No module named 'requests_html'
>>>
What are the basic differences between this repo and Twint's API-free scraping?
from twitter_scraper import get_tweets
for tweet in get_tweets('la_UPC', pages=1):
print(tweet['text'])
Traceback (most recent call last):
File ".../main.py", line 156, in
for tweet in get_tweets('la_UPC', pages=1):
File ".../twitter_scraper.py", line 78, in get_tweets
yield from gen_tweets(pages)
File ".../twitter_scraper.py", line 35, in gen_tweets
text = tweet.find('.tweet-text')[0].full_text
IndexError: list index out of range
testuser: NoDaysOffCrypto
my code
`
from twitter_scraper import get_tweets
for tweet in get_tweets('NoDaysOffCrypto', pages=1):
if tweet['tweetId'] == '1209138991347326976':
print(tweet)
`
output .
{'tweetId': '1209138991347326976', 'isRetweet': False, 'time': datetime.datetime(2019, 12, 23, 23, 49, 34), 'text': 'The best #Cryptoassets wallet in the #cryptocurrencymarket to store your #Cryptocurrency with upgraded Security Protocols\nManage your #BTC, #ETH #XRP #TRX #XRP #DOGE with Huobi Wallet App\nHows that for a post? Lol @huobiwallet \nFor Doge: D6eSXqBiJutuUP55eXWn6ceB5DG6QS4q9 https://twitter.com/HuobiWallet/status/1204645286863200256\xa0…', 'replies': 5, 'retweets': 14, 'likes': 37, 'entries': {'hashtags': ['#Cryptoassets', '#cryptocurrencymarket', '#Cryptocurrency', '#BTC', '#ETH', '#XRP', '#TRX', '#XRP', '#DOGE', '#BTC', '#ETH', '#XRP', '#TRX', '#XRP', '#DOGE'], 'urls': [], 'photos': [], 'videos': []}}
full text https://twitter.com/NoDaysOffCrypto/status/1209138991347326976
D6eSXqBiJutuUP55eXWn6ceB5DG6QS4q9 missing the last x
This works perfectly on my computer, however, I'm trying to work with this in a server (which I do not have sudo permissions) and I can't import it. I get the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "twitter_scraper/__init__.py", line 1, in <module>
from .modules.profile import Profile
File "twitter_scraper/modules/profile.py", line 27
"Referer": f"https://twitter.com/{username}",
^
SyntaxError: invalid syntax
Since, as I said, I don't have sudo permissions, I have to install it from pip.
Example usage:
for t in get_tweets(twitter, pages=n):
...
Error:
Getting 4 tweets from ggreenwald
Traceback (most recent call last):
File "bot.py", line 37, in <module>
main()
File "bot.py", line 24, in main
tweets = combine_tweets()
File "bot.py", line 14, in combine_tweets
for t in get_tweets(twitter, pages=n):
File "/home/ben/local/bot/env/lib/python3.6/site-packages/twitter_scraper.py", line 78, in get_tweets
yield from gen_tweets(pages)
File "/home/ben/local/bot/env/lib/python3.6/site-packages/twitter_scraper.py", line 57, in gen_tweets
video_id = tmp[:tmp.index('.jpg')]
ValueError: substring not found
This happens when user does not exist or the account is private
Trace-back logs
Traceback (most recent call last):
File "test.py", line 10, in <module>
for tweet in get_tweets(user, pages=1):
File "/home/aldnav/Pro/twitter-scraper/twitter_scraper.py", line 35, in get_tweets
yield from gen_tweets(pages)
File "/home/aldnav/Pro/twitter-scraper/twitter_scraper.py", line 20, in gen_tweets
d = pq(r.json()['items_html'])
KeyError: 'items_html'
Traceback (most recent call last):
File "/home/aldnav/.virtualenvs/twitter-scraper-Y8df1NKw/lib/python3.6/site-packages/pyquery/pyquery.py", line 95, in fromstring
result = getattr(etree, meth)(context)
File "src/lxml/etree.pyx", line 3230, in lxml.etree.fromstring (src/lxml/etree.c:81070)
File "src/lxml/parser.pxi", line 1871, in lxml.etree._parseMemoryDocument (src/lxml/etree.c:121250)
File "src/lxml/parser.pxi", line 1752, in lxml.etree._parseDoc (src/lxml/etree.c:119804)
File "src/lxml/parser.pxi", line 1066, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/etree.c:113546)
File "src/lxml/parser.pxi", line 598, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/etree.c:107738)
File "src/lxml/parser.pxi", line 709, in lxml.etree._handleParseResult (src/lxml/etree.c:109447)
File "src/lxml/parser.pxi", line 638, in lxml.etree._raiseParseError (src/lxml/etree.c:108301)
File "<string>", line 17
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 17, column 1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/aldnav/Pro/twitter-scraper/twitter_scraper.py", line 20, in gen_tweets
d = pq(r.json()['items_html'])
File "/home/aldnav/.virtualenvs/twitter-scraper-Y8df1NKw/lib/python3.6/site-packages/pyquery/pyquery.py", line 255, in __init__
elements = fromstring(context, self.parser)
File "/home/aldnav/.virtualenvs/twitter-scraper-Y8df1NKw/lib/python3.6/site-packages/pyquery/pyquery.py", line 99, in fromstring
result = getattr(lxml.html, meth)(context)
File "/home/aldnav/.virtualenvs/twitter-scraper-Y8df1NKw/lib/python3.6/site-packages/lxml/html/__init__.py", line 876, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "/home/aldnav/.virtualenvs/twitter-scraper-Y8df1NKw/lib/python3.6/site-packages/lxml/html/__init__.py", line 765, in document_fromstring
"Document is empty")
lxml.etree.ParserError: Document is empty
Thanks for you nice work.
It would be nice if you could provide a new release so that the PyPI package gets upgraded as well.
Best regards
Hi,
I would like to extract likes and retweets against all tweets.
Please help...
When using the script, the tweets returned sometimes contains "pic.twitter.com" references which were are not part of the original tweet.
For example, from Donald Trump, the returned tweet text of 2018-03-17 15:00 gives: "Happy #StPatricksDaypic.twitter.com/4vVsW2smhB". The pic.twitter.com reference is a data element html tag, marked as hidden. The html parser returns it however as being part of the tweet text.
How to resolve this?
As you commented in the source code, the same function can be also used # for searching:
.
Please expose the internal gen_tweets()
in case someone would like to use it directly.
In case you like the idea but have no time for that, I am able to contribute to it since I am doing this for my company.
P.S.: I see the module tweets
is importing mechanicalsoup
without using it.
With some tweets, the number of likes/reactions are zero.
In a previous issue the example "6,967 replies 10,161 retweets 38,094 likes" was given and a fix was made to remove the ',' character. However, in my language setting the seperation character is a dot '.'.
Could this be added?
Some tweets are retweeted by users using the "Retweet With Comment" option, and we need to access the text present in the retweeted tweet. The example for the mentioned tweets given as a screenshot below. It is not a regular retweet and it has its own unique tweet ID etc. Can you add a new field for tweets to retrieve these texts if they are present?
Can we get all the tweets of particular user from 2006
E.g:
for tweet in get_tweets('realDonaldTrump', pages=1):
... print(tweet)
...{'text': 'Possible progress being made in talks with North Korea. For the first time in many years, a serious effort is being made by all parties concerned. The World is watching and waiting! May be false hope, but the U.S. is ready to go hard in either direction!', 'replies': 0, 'retweets': 0, 'likes': 0}
Source:
Possible progress being made in talks with North Korea. For the first time in many years, a serious effort is being made by all parties concerned. The World is watching and waiting! May be false hope, but the U.S. is ready to go hard in either direction!
6,967 replies 10,161 retweets 38,094 likes
Since the latest twitter update, I have been encountering multiple errors. I think it's related to the new html on twitter pages, the scraper probably needs to be reconfigured
Hi
The profile function fails in some profiles. I made a few changes in the previous PR: pr
However, the errors continue. Ex;
>>> profile = Profile('atifceylan')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/es/CODERLAB/twitter-scraper/twitter_scraper/modules/profile.py", line 30, in __init__
self.__parse_profile(page)
File "/home/es/CODERLAB/twitter-scraper/twitter_scraper/modules/profile.py", line 79, in __parse_profile
self.tweets_count = int(q.attrs["title"].split(' ')[0].replace(',',''))
ValueError: invalid literal for int() with base 10: '8.393'
>>>
I'll do my best.
Some accounts use special links which are displayed as buttons with images, and in the entries obtained from calling get_tweets(), these links can not be seen in the field of "urls". The specialized link that I am talking about can be seen in the image below. It is not explicitly displayed but rather wrapped in an image button.
I am interested in adding new attribute in class Profile. It shows whether profile is verified. Also I checked what kind of exception raising if no location. It returns anything and should be 'None' . In addition location hasn't added to dict yet.
Hi there,
This project is growing, so we need better and detailed documentation for it. I would gladly approve any PR about documentation, please create PR if you are interested.
Hi,
Can you add a method to download tweets with_replies ?
Thanks.
I set up pipenv and was trying with first approach using Markov. I could do till installing Markovify. But then how to run the virtualenv and start typing commands? If I type pipenv run/shell after installing, I still don't get how to get to the interpreter option to type python commands
this is what I get when I navigate to the directory but as you can see there is a number of actvate files. When I go on to run any of them the terminal opens and disappears.
I'm running python 3.7+ on Windows 10.
how can search tag?
Each tweet can contains entries
actual: no entries are extracted
expected: entries(links/pics/videos) are extracted
can i get top tweets by hashtag?
Hello, I noticed an IndexError: list index out of range
error coming from this line sometimes:
text = tweet.find('.tweet-text')[0].full_text
code:
from twitter_scraper import get_tweets
for tweet in get_tweets('VuduFans', pages=1):
print(tweet['text'].encode('ascii', 'ignore').decode())
gives error:
Traceback (most recent call last):
File "./twitter_error.py", line 3, in <module>
for tweet in get_tweets('VuduFans', pages=1):
File "/usr/local/lib/python3.6/site-packages/twitter_scraper.py", line 78, in get_tweets
yield from gen_tweets(pages)
File "/usr/local/lib/python3.6/site-packages/twitter_scraper.py", line 35, in gen_tweets
text = tweet.find('.tweet-text')[0].full_text
IndexError: list index out of range
I'm getting this error when I try to run:
Traceback (most recent call last):
File "C:\Users\USER\AppData\Local\Programs\Python\Python36\lib\site-packages\pyquery\pyquery.py", line 95, in fromstring
result = getattr(etree, meth)(context)
File "src\lxml\etree.pyx", line 3213, in lxml.etree.fromstring
File "src\lxml\parser.pxi", line 1876, in lxml.etree._parseMemoryDocument
File "src\lxml\parser.pxi", line 1764, in lxml.etree._parseDoc
File "src\lxml\parser.pxi", line 1126, in lxml.etree._BaseParser._parseDoc
File "src\lxml\parser.pxi", line 600, in lxml.etree._ParserContext._handleParseResultDoc
File "src\lxml\parser.pxi", line 710, in lxml.etree._handleParseResult
File "src\lxml\parser.pxi", line 639, in lxml.etree._raiseParseError
File "", line 17
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 17, column 1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\USER\Desktop\Twitter Stock Market\current.py", line 16, in
for tweet in get_tweets('trump', pages=3):
File "C:\Users\USER\AppData\Local\Programs\Python\Python36\lib\site-packages\twitter_scraper.py", line 78, in get_tweets
yield from gen_tweets(pages)
File "C:\Users\USER\AppData\Local\Programs\Python\Python36\lib\site-packages\twitter_scraper.py", line 26, in gen_tweets
url='bunk', default_encoding='utf-8')
File "C:\Users\USER\AppData\Local\Programs\Python\Python36\lib\site-packages\requests_html.py", line 419, in init
element=PyQuery(html)('html') or PyQuery(f'{html}')('html'),
File "C:\Users\USER\AppData\Local\Programs\Python\Python36\lib\site-packages\pyquery\pyquery.py", line 255, in init
elements = fromstring(context, self.parser)
File "C:\Users\USER\AppData\Local\Programs\Python\Python36\lib\site-packages\pyquery\pyquery.py", line 99, in fromstring
result = getattr(lxml.html, meth)(context)
File "C:\Users\USER\AppData\Local\Programs\Python\Python36\lib\site-packages\lxml\html_init_.py", line 876, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "C:\Users\USER\AppData\Local\Programs\Python\Python36\lib\site-packages\lxml\html_init_.py", line 765, in document_fromstring
"Document is empty")
lxml.etree.ParserError: Document is empty
Show if a profile retweets someone else tweet.
Eg:
@kennethreitz retweeted @CommitStrip:
Why does it take such a long time? ...
For example from my Twitter profile, I want to only get Wrote about my workflow in making workflows
back. As well as the link attached to the tweet.
It would be even more awesome if you could also get back the link to the media (image/video) that is attached (embedded) to the tweet.
I really hope it is possible to do. Thank you for sharing this.
I've picked an account which posts a lot of emojis, and the emojis are simply omitted
Hello,
is it possible to only scrape the first tweet of a profile? (Not pinned/retweeted). This would make my program faster. It would also be nice if there is an option to see who is tagged in the tweet.
Thanks for your time,
Thoosje
Hi
realated to #75
I'm interested in this project and I'm Korean.
Many people in Korea (including president of Republic of Korea ) enjoys twitter. So, it might be helpful to translate documents in Korean.
Is it Okay to create new translate-KO docs directory, makes translated version and then PR to you?
So, I checked get_tweets function with default pages arg 25 and different hashtags. And as output, I got just 25 same pages of tweets. I did this:
from twitter_scraper import get_tweets
for tweet in get_tweets('#brexit'):
print(tweet['text'], tweet['time'])
When I define the number of pages, it doesn't really change anything.
I am fetching the account information from 10 different sources and only one of the accounts show wrong information about the number followers and followings. The numbers are provided in swapped form. I have tried with different number of sources but there is one specific source giving wrong information "Breaking911". The account has 0 followings and 570k followers but these numbers are displayed as 1 follower and 570k followings. I believe there could be a bug due to having accounts with 0 "followings"
Is it possible to retrieve a list with all followers of a profile?
Hi,
I tried the example in your README file (USAGE), but I get the following error:
ValueError: invalid literal for int() with base 10: '2\xa0571'
Could you help me fix it?
Thanks
script:
from twitter_scraper import get_tweets
for twitter_acct in [ "xbox", "Minecraft", "VuduFans", "minecraftearth" ]:
for tweet in get_tweets(twitter_acct, pages=1):
print(tweet['text'])
causes exception:
Traceback (most recent call last):
File "test.py", line 3, in <module>
for tweet in get_tweets(twitter_acct, pages=1):
File "/usr/local/lib/python3.6/site-packages/twitter_scraper.py", line 78, in get_tweets
yield from gen_tweets(pages)
File "/usr/local/lib/python3.6/site-packages/twitter_scraper.py", line 35, in gen_tweets
text = tweet.find('.tweet-text')[0].full_text
IndexError: list index out of range
Hopefully you can reproduce it on your side
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.