uupers / bilispider Goto Github PK
View Code? Open in Web Editor NEW开发 bilibili 网站爬虫,大数据分析研究
License: GNU General Public License v3.0
开发 bilibili 网站爬虫,大数据分析研究
License: GNU General Public License v3.0
例如b站出现了咨询专区 tid>200: 201:科学科普-科技 ; 203 :热点-资讯;204:环球-资讯 ;205:社会-资讯;206:综合-资讯
项目开始的时候由于错误估计了B站用户数据量的大小,未能合理的设计架构,导致了目前VPS的磁盘空间无法容纳未来爬取的所有数据。
开启mongo自带的压缩功能,每隔一段时间,将VPS上的数据回传本地,并删除VPS上的相应数据。等到所有数据爬完,对数据进行汇总整理:
[yxlllc] 节省空间的原则:
1. 重复数据只记录一次(比如fans 和 followers)
2. 相似的数据间只记录不同的部分 (比如 face 里的地址片段)
3. 只记录逻辑链条顶层的数据 (比如值经验值可以推出等级,则只记录经验值)
4. 尽可能数字化数据(比如性别用012表示)
5. 对数据库进行压缩
无需对现有架构进行调整,大家可以继续分布式爬虫。
PS:
把Mongo数据库迁移到本地一台空间充足,24小时在线的机器,以VPS作为中转服务器。此方案需要对现有架构进行一定的调整,调整包括对字段进行裁剪。
目前按方案A进行处理
uupers团队你们好 请问数据已经停止维护了吗?我最近在做一个关于bilibili视频信息的分析,能否使用一下你们的数据呢?非常感谢!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.