caolvchong-top / twitter_download Goto Github PK
View Code? Open in Web Editor NEW推特 图片 视频 爬虫;一键下载
推特 图片 视频 爬虫;一键下载
有一个扩展很好用:Twitter Media Downloader ( version 0.1.5.4 ),它可以批量增量下载媒体,但我不想每天每个用户都点进去按下载XD 所以只能拜托你看看~https://chrome.google.com/webstore/detail/twitter-media-downloader/cblpjenafgeohmnjknfhpdbdljfkndig
建议以(推特用户操作的)时间开头。
爬取的 retweet 的视频/图片,文件名开头是“retweet”,不能根据时间顺序浏览下载的文件。
(highlights 和 likes 不知道是不是也一样,没来得及测试,api 被限制了)
具体需要那几个参数,求告知。最好能举例说明。
大佬推特纯文字文本好像不能获取,需要更改什么地方呢
不懂PY,是打开IDEL用F5运行吧。直接跳出这个错误。Traceback (most recent call last):
File "C:\Users\Administrator\Downloads\main.py", line 4, in
import httpx
ModuleNotFoundError: No module named 'httpx'
您好!
借用您的工具对其他账号的媒体进行备份一切正常。
冒昧向您提起一个需求。
因为我是从事美术类的工作,所以会对大量的其他人发布的推进行like操作。
以作为素材存放在自己的likes时间线中,之前一直使用外部第三方工具,对自己的likes时间线进行爬取。
以便下载媒体到本地进行整理。
2023年twitter更新了API政策后,外部工具宣布停止运营了,所以我无法备份自己的likes时间线中的媒体到本地。
在研读您的代码后,发现您已经开发对于媒体、亮点等时间线的爬取功能,同时在无措之际,有看到自己tw的likes时间线的API前面部分为:
https://twitter.com/i/api/graphql/-fbTO1rKPa3nO6-XIRgEFQ/Likes?variables=
似乎与您代码中描述的其他时间线的区别在于variables前面的likes上。
因为我不懂编码,所以一些拙见让您见笑了。
如果可以的话,希望您开发:对于自己的twitter账号的likes时间线进行爬取备份的功能。
同时,因为艺术工作者日常积累素材,likes的数量很多时候有大几万,所以还有需要向您请教的是:
如果每日API调用次数是1000次,每次调用会返回20个推文。那么每日总计能返回的推文数量就是2W。
如果我有3W条推文需要爬取。第一天爬取了1~20000号推文。
那是否能在第二天恢复API调用次数后,从20001号推文开始爬取,直到3W号推文全部爬取完成呢?
感谢您的分享
顺祝商祺
可否针对推文发布时间增加下载时间范围的限制呢?发布时间未填写默认全部,填写则需要指定开始时间戳及结束时间戳,麻烦辛苦看看。
大佬有没有可能做个Scrapy版本?感觉Scrapy可拓展性和系统性比较强,无论是爬图片还是推文做学术
参考:
img文件命名格式姓名:twitterId-发布时间-img1.jpg
viedeo文件命名格式形如:twitterId-发布时间-vid.mp4
2023.12.12版本,同一推文会重复下载多次
顺便,作者大大能帮忙把推文网页码(status/1717859159146951031)就是这串数字放进文件名里面吗,呜呜呜我想通过文件找推文。
作者大大牛逼!!!
作者有空搞个一键傻瓜式的吗
获取信息失败
{"data":{}}
共耗时:1.6001014709472656秒
共调用1次API
共下载0份图片/视频
PS F:\IDM下载\twitter_download-main> python .\main.py
Traceback (most recent call last):
File "F:\IDM下载\twitter_download-main\main.py", line 50, in
settings = json.load(f)
^^^^^^^^^^^^
File "C:\Anaconda\envs\python3.11\Lib\json_init_.py", line 293, in load
return loads(fp.read(),
^^^^^^^^^^^^^^^^
File "C:\Anaconda\envs\python3.11\Lib\json_init_.py", line 346, in loads
return _default_decoder.decode(s)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Anaconda\envs\python3.11\Lib\json\decoder.py", line 340, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 40 column 1 (char 1316)
非常感谢您提的建议,但其实在之前的更新中已经添加了对并发的限制,在 main.py 的[281行](https://github.com/caolvchong-top/twitter_download/blob/7281a8c8480d867f8d384e8a17a9ba746c9d038c/main.py#L281)。
photo_lst 中存放的是某页(20-100条推文)中的全部 图片/视频 的源地址,每个链接都等待两秒,对于几百上千的图片量就有些长了。
Originally posted by @caolvchong-top in #25 (comment)
例如:修改之前,设置并发数为1,实际的并发数量不是只有1个。在一个链接下载出现错误时,会紧接着开始并发下一个链接,可能会导致同样的下载失败。起码在我下载过程中,遇到了下载失败时,必须等待2s才能开始并发的情况。
因此,当前链接下载失败,进行下一个链接下载的情况时,要有并发等待时间,才能下载成功。
两个t都能正常浏览网页,但是都连接失败......
`D:\runtime\Python\python.exe E:\Downloads\twitter_download-main\main.py
获取信息失败
Traceback (most recent call last):
File "D:\runtime\Python\lib\site-packages\httpcore_exceptions.py", line 10, in map_exceptions
yield
File "D:\runtime\Python\lib\site-packages\httpcore_backends\sync.py", line 168, in start_tls
raise exc
File "D:\runtime\Python\lib\site-packages\httpcore_backends\sync.py", line 163, in start_tls
sock = ssl_context.wrap_socket(
File "D:\runtime\Python\lib\ssl.py", line 500, in wrap_socket
return self.sslsocket_class._create(
File "D:\runtime\Python\lib\ssl.py", line 1040, in _create
self.do_handshake()
File "D:\runtime\Python\lib\ssl.py", line 1309, in do_handshake
self._sslobj.do_handshake()
OSError: [Errno 0] Error
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "D:\runtime\Python\lib\site-packages\httpx_transports\default.py", line 66, in map_httpcore_exceptions
yield
File "D:\runtime\Python\lib\site-packages\httpx_transports\default.py", line 228, in handle_request
resp = self._pool.handle_request(req)
File "D:\runtime\Python\lib\site-packages\httpcore_sync\connection_pool.py", line 268, in handle_request
raise exc
File "D:\runtime\Python\lib\site-packages\httpcore_sync\connection_pool.py", line 251, in handle_request
response = connection.handle_request(request)
File "D:\runtime\Python\lib\site-packages\httpcore_sync\http_proxy.py", line 289, in handle_request
connect_response = self._connection.handle_request(
File "D:\runtime\Python\lib\site-packages\httpcore_sync\connection.py", line 99, in handle_request
raise exc
File "D:\runtime\Python\lib\site-packages\httpcore_sync\connection.py", line 76, in handle_request
stream = self._connect(request)
File "D:\runtime\Python\lib\site-packages\httpcore_sync\connection.py", line 156, in _connect
stream = stream.start_tls(**kwargs)
File "D:\runtime\Python\lib\site-packages\httpcore_backends\sync.py", line 168, in start_tls
raise exc
File "D:\runtime\Python\lib\contextlib.py", line 131, in exit
self.gen.throw(type, value, traceback)
File "D:\runtime\Python\lib\site-packages\httpcore_exceptions.py", line 14, in map_exceptions
raise to_exc(exc) from exc
httpcore.ConnectError: [Errno 0] Error
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "E:\Downloads\twitter_download-main\main.py", line 84, in get_other_info
response = httpx.get(url, headers=_headers, proxies=proxies).text
File "D:\runtime\Python\lib\site-packages\httpx_api.py", line 189, in get
return request(
File "D:\runtime\Python\lib\site-packages\httpx_api.py", line 100, in request
return client.request(
File "D:\runtime\Python\lib\site-packages\httpx_client.py", line 814, in request
return self.send(request, auth=auth, follow_redirects=follow_redirects)
File "D:\runtime\Python\lib\site-packages\httpx_client.py", line 901, in send
response = self._send_handling_auth(
File "D:\runtime\Python\lib\site-packages\httpx_client.py", line 929, in _send_handling_auth
response = self._send_handling_redirects(
File "D:\runtime\Python\lib\site-packages\httpx_client.py", line 966, in _send_handling_redirects
response = self._send_single_request(request)
File "D:\runtime\Python\lib\site-packages\httpx_client.py", line 1002, in _send_single_request
response = transport.handle_request(request)
File "D:\runtime\Python\lib\site-packages\httpx_transports\default.py", line 228, in handle_request
resp = self._pool.handle_request(req)
File "D:\runtime\Python\lib\contextlib.py", line 131, in exit
self.gen.throw(type, value, traceback)
File "D:\runtime\Python\lib\site-packages\httpx_transports\default.py", line 83, in map_httpcore_exceptions
raise mapped_exc(message) from exc
httpx.ConnectError: [Errno 0] Error
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "E:\Downloads\twitter_download-main\main.py", line 287, in
main(User_info(i))
File "E:\Downloads\twitter_download-main\main.py", line 266, in main
if not get_other_info(_user_info):
File "E:\Downloads\twitter_download-main\main.py", line 93, in get_other_info
print(response)
UnboundLocalError: local variable 'response' referenced before assignment
Process finished with exit code 1
`
RT
Traceback (most recent call last):
File "D:\Program Files\yasuobao\twitter_download-main\main.py", line 4, in
import httpx
ModuleNotFoundError: No module named 'httpx'
这个是什么原因呀
下载完成显示这个error
Exception ignored in: <function _ProactorBasePipeTransport.__del__ at 0x000001E5B280C1F0>
Traceback (most recent call last):
File "C:\Users\59476\miniconda3\envs\py39\lib\asyncio\proactor_events.py", line 116, in __del__
self.close()
File "C:\Users\59476\miniconda3\envs\py39\lib\asyncio\proactor_events.py", line 108, in close
self._loop.call_soon(self._call_connection_lost, None)
File "C:\Users\59476\miniconda3\envs\py39\lib\asyncio\base_events.py", line 751, in call_soon
self._check_closed()
File "C:\Users\59476\miniconda3\envs\py39\lib\asyncio\base_events.py", line 515, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
smart下载完成
共耗时:529.8902490139008秒
共调用3次API
共下载161份图片/视频
Process finished with exit code 0
一个用户最多只能获取79个推文
例如:userName-20240517.csv
没接触过Python,尝试修改失败了
修改的是csv_gen.py文件
current_time = time.strftime("%Y-%m-%d %H:%M", time.localtime()) # 获取当前时间并格式化
file_name = f'{screen_name}_{current_time}.csv' # 生成文件名
file_path = f'{save_path}/{file_name}' # 拼接文件路径
self.f = open(file_path, 'w', encoding='utf-8-sig', newline='')
self.writer = csv.writer(self.f)
你好,我发现有些账号无法下载成功,有个共性是推文0评论(限制了被任何人评论),这种是可以处理的吗?非私人账号
那个cookie是登录信息的cookie吗
请问是否可以只抓取视频,而不抓取图片?我看到配置文件中,没有相应设置。另外下载的文件或者图片的文件名中,能否附带相应的推文信息?
Exception has occurred: IndexError
list index out of range
File "I:\huanjingbianliang\123123123\main.py", line 249, in main
_headers['x-csrf-token'] = re.findall(re_token,_headers['cookie'])[0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
File "I:\huanjingbianliang\123123123\main.py", line 272, in
main(User_info(i))
IndexError: list index out of range
旧版本一直闪退,刚更新后提示这个,但是cookie是刚刷新出来的,也加了转译。请问这个怎么解决?我发给GPT试着自己解决但实在是看不懂::(
Traceback (most recent call last):
File "c:\Users\x\Downloads\Compressed\twitter_download-main\twitter_download-main\text_down.py", line 180, in
text_down(user)
File "c:\Users\x\Downloads\Compressed\twitter_download-main\twitter_download-main\text_down.py", line 135, in init
self.get_clean_save()
File "c:\Users\x\Downloads\Compressed\twitter_download-main\twitter_download-main\text_down.py", line 162, in get_clean_save
_time_stamp = int(raw_text['edit_control']['editable_until_msecs'])
~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'editable_until_msecs'输出了个这个
但是输出的推文数明显就不足
大大,请问有可能基于最近下载的内容进行增量下载吗?日常经常会对同一账号多次下载,但不方便认为筛选分隔时间
感谢你的分享!请问现在的代码是否支持爬取搜索结果的文本呢?
我理解现在的代码是支持:1. 爬取指定用户的tweet文本,2. 指定用户或搜索tag后爬取相关的media
我这里想:搜索tag(或一般query)后,爬取tweet的文本和用户名
Releases 打的包下载时没反应,下载不下来,是时间就GitHub没存档了吗,还是我的问题呢,可以更新下吗
main.py
310 行:await asyncio.gather(*[asyncio.create_task(down_save(url[0], url[1], url[2], order)) for order,url in enumerate(photo_lst)])
增强稳定性修改:
interval = 2 # 并发时间间隔,单位为秒
tasks = []
for order, url in enumerate(photo_lst):
task = asyncio.create_task(down_save(url[0], url[1], url[2], order))
tasks.append(task)
await asyncio.sleep(interval)
await asyncio.gather(*tasks)
总推数(含转推):51
含图片/视频/音频推数(不含转推):21
<==================>
开始爬取...
已下载图片/视频:0
已下载图片/视频:0
xxxxx下载完成
之前用的好好的,突然就下载不了了
会出现一个场景是除了需要的媒体资源外还会用到推文的文本内容,
以前有款twitter media downloader扩展程序打包下载时会同步生成一个csv文件,里面主要包含推主信息,以及如下字段:
Tweet date、Action date、Display name、Username、Tweet URL、Media type 、Media URL、Saved filename、Remarks、Tweet content、Replies、Retweets、Likes
以上字段中推主信息和推文内容tweet content比较重要,想了解下大佬能否实现呢?是否有这方面考虑
文件可以自定义命名吗
感觉这个新API比以前的复杂了。。
可以根据爬取的用户名自动创建文件夹并写入吗,就不需要每次都在save_path修改了,如果同时爬取多个用户名也不好处理
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.