Git Product home page Git Product logo

mediacrawler-new's Introduction

免责声明:

本仓库原始代码源自网络,权责问题请咨询原作者。本仓库仅供备份。

本仓库的所有内容仅供学习和参考之用,禁止用于商业用途。任何人或组织不得将本仓库的内容用于非法用途或侵犯他人合法权益。本仓库所涉及的爬虫技术仅用于学习和研究,不得用于对其他平台进行大规模爬虫或其他非法行为。对于因使用本仓库内容而引起的任何法律责任,本仓库不承担任何责任。使用本仓库的内容即表示您同意本免责声明的所有条款和条件。

仓库描述

小红书爬虫抖音爬虫快手爬虫B站爬虫微博爬虫...。
目前能抓取小红书、抖音、快手、B站、微博的视频、图片、评论、点赞、转发等信息。

原理:利用playwright搭桥,保留登录成功后的上下文浏览器环境,通过执行JS表达式获取一些加密参数 通过使用此方式,免去了复现核心加密JS代码,逆向难度大大降低

功能列表

平台 Cookie 登录 二维码登录 指定创作者主页 关键词搜索 指定视频/帖子 ID 爬取 登录状态缓存 数据保存 IP 代理池 滑块验证码
小红书
抖音
快手
B 站
微博

使用方法

创建并激活 python 虚拟环境

# 进入项目根目录
cd MediaCrawler

# 创建虚拟环境
python -m venv venv

# macos & linux 激活虚拟环境
source venv/bin/activate

# windows 激活虚拟环境
venv\Scripts\activate

安装依赖库

pip3 install -r requirements.txt

安装 playwright浏览器驱动

playwright install

运行爬虫程序

# 默认没有开启评论爬取模式,有需要请到配置文件中指定
# 从配置文件中读取关键词搜索相关的帖子并爬去帖子信息与评论
python main.py --platform xhs --lt qrcode --type search

# 从配置文件中读取指定的帖子ID列表获取指定帖子的信息与评论信息
python main.py --platform xhs --lt qrcode --type detail

# 打开对应APP扫二维码登录
  
# 其他平台爬虫使用示例, 执行下面的命令查看
python main.py --help    

数据保存

  • 支持保存到关系型数据库(Mysql、PgSQL等)
  • 支持保存到csv中(data/目录下)
  • 支持保存到json中(data/目录下)

运行报错常见问题Q&A

➡️➡️➡️ 常见问题

项目代码结构

➡️➡️➡️ 项目代码结构说明

手机号登录说明

➡️➡️➡️ 手机号登录说明

参考

申明

本项目只作为学习用途, 切勿他用. 有任何问题可以进群交流。 二维码会不定期更新,如果过期请先star该repo,过几天再来看下。

Star History

Star History Chart

mediacrawler-new's People

Contributors

jiji262 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

mediacrawler-new's Issues

bili视频抓取报错

base_config.py配置信息如下:

基础配置

PLATFORM = "bili"
KEYWORDS = "python,golang"
LOGIN_TYPE = "cookie" # qrcode or phone or cookie
COOKIES = "xxxxxx"
SORT_TYPE = "popularity_descending" # 具体值参见media_platform.xxx.field下的枚举值,展示只支持小红书
CRAWLER_TYPE = "download_video" # 爬取类型,search(关键词搜索) | detail(帖子相亲)| creator(创作者主页数据) | video_download (视频下载暂时只支持 bili)

指定B站平台需要爬取的视频bvid列表

BILI_SPECIFIED_ID_LIST = [

"av1204161056",
# "av865189147",
# "BV1Sz4y1U77N",
# "av1204161056",
# ........................

]

报错信息如下:
[BiliBili] Extracting URL: https://www.bilibili.com/video/av1204161056
[BiliBili] 1204161056: Downloading webpage
[BiliBili] BV18f421U7Wk: Extracting videos in anthology
[BiliBili] Downloading playlist BV18f421U7Wk - add --no-playlist to download just the video BV18f421U7Wk
[download] Downloading playlist: 【全368集】强推!建议所有想学Python的同学,死磕这条视频,2024年字节大佬花了一周时间整理的Python(数据分析)保姆级教程,全程干货无废话!
[BiliBili] Playlist 【全368集】强推!建议所有想学Python的同学,死磕这条视频,2024年字节大佬花了一周时间整理的Python(数据分析)保姆级教程,全程干货无废话!: Downloading 100 items of 100
[download] Downloading item 1 of 100
[BiliBili] Extracting URL: https://www.bilibili.com/video/BV18f421U7Wk?p=1
[BiliBili] 18f421U7Wk: Downloading webpage
[BiliBili] BV18f421U7Wk: Extracting videos in anthology
[BiliBili] 1204161056: Extracting chapters
[BiliBili] Format(s) 1080P 高清, 720P 高清 are missing; you have to login or become premium member to download them. Use --cookies-from-browser or --cookies for the authentication. See https://github.com/yt-dlp/yt-dlp/wiki/FAQ#how-do-i-pass-cookies-to-yt-dlp for how to manually pass cookies
[info] BV18f421U7Wk_p1: Downloading 1 format(s): 100100+30280
ERROR: You have requested merging of multiple formats but ffmpeg is not installed. Aborting due to --abort-on-error
Traceback (most recent call last):
File "D:\code\MediaCrawler-new\main.py", line 62, in
asyncio.get_event_loop().run_until_complete(main())
File "D:\anaconda\Lib\asyncio\base_events.py", line 653, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "D:\code\MediaCrawler-new\main.py", line 53, in main
await crawler.start()
File "D:\code\MediaCrawler-new\media_platform\bilibili\core.py", line 93, in start
await self.download_video_given_url(video_id, path=f'./video/{video_id}')
File "D:\code\MediaCrawler-new\media_platform\bilibili\core.py", line 140, in download_video_given_url
result = ydl.download([video_id])
^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\anaconda\Lib\site-packages\yt_dlp\YoutubeDL.py", line 3572, in download
self.__download_wrapper(self.extract_info)(
File "D:\anaconda\Lib\site-packages\yt_dlp\YoutubeDL.py", line 3547, in wrapper
res = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "D:\anaconda\Lib\site-packages\yt_dlp\YoutubeDL.py", line 1595, in extract_info
return self.__extract_info(url, self.get_info_extractor(key), download, extra_info, process)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\anaconda\Lib\site-packages\yt_dlp\YoutubeDL.py", line 1606, in wrapper
return func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\anaconda\Lib\site-packages\yt_dlp\YoutubeDL.py", line 1762, in __extract_info
return self.process_ie_result(ie_result, download, extra_info)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\anaconda\Lib\site-packages\yt_dlp\YoutubeDL.py", line 1891, in process_ie_result
return self.__process_playlist(ie_result, download)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\anaconda\Lib\site-packages\yt_dlp\YoutubeDL.py", line 2035, in __process_playlist
entry_result = self.__process_iterable_entry(entry, download, collections.ChainMap({
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\anaconda\Lib\site-packages\yt_dlp\YoutubeDL.py", line 1606, in wrapper
return func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\anaconda\Lib\site-packages\yt_dlp\YoutubeDL.py", line 2067, in __process_iterable_entry
return self.process_ie_result(
^^^^^^^^^^^^^^^^^^^^^^^
File "D:\anaconda\Lib\site-packages\yt_dlp\YoutubeDL.py", line 1841, in process_ie_result
return self.extract_info(
^^^^^^^^^^^^^^^^^^
File "D:\anaconda\Lib\site-packages\yt_dlp\YoutubeDL.py", line 1595, in extract_info
return self.__extract_info(url, self.get_info_extractor(key), download, extra_info, process)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\anaconda\Lib\site-packages\yt_dlp\YoutubeDL.py", line 1606, in wrapper
return func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\anaconda\Lib\site-packages\yt_dlp\YoutubeDL.py", line 1762, in __extract_info
return self.process_ie_result(ie_result, download, extra_info)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\anaconda\Lib\site-packages\yt_dlp\YoutubeDL.py", line 1821, in process_ie_result
ie_result = self.process_video_result(ie_result, download=download)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\anaconda\Lib\site-packages\yt_dlp\YoutubeDL.py", line 2982, in process_video_result
self.process_info(new_info)
File "D:\anaconda\Lib\site-packages\yt_dlp\YoutubeDL.py", line 3406, in process_info
self.report_error(f'{msg}. Aborting due to --abort-on-error')
File "D:\anaconda\Lib\site-packages\yt_dlp\YoutubeDL.py", line 1073, in report_error
self.trouble(f'{self._format_err("ERROR:", self.Styles.ERROR)} {message}', *args, **kwargs)
File "D:\anaconda\Lib\site-packages\yt_dlp\YoutubeDL.py", line 1012, in trouble
raise DownloadError(message, exc_info)
yt_dlp.utils.DownloadError: ERROR: You have requested merging of multiple formats but ffmpeg is not installed. Aborting due to --abort-on-error
python-BaseException

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.