Light

hyegyu / capstonedesign2 Goto Github PK

View Code? Open in Web Editor NEW

0.0 1.0 0.0 2.6 MB

유튜브 급상승 영상 분석

Jupyter Notebook 100.00%

capstonedesign2's Introduction

유튜브 급상승 영상 분석

진행기간 : 2021.09 ~ 2021.11
목적 : 점점 커져가는 유튜브 시장 속에서 급상승 영상이 되는 요인 분석
혼자 진행한 프로젝트
사용 툴 : Python / Jupyter
사용 라이브러리 및 알고리즘
1. 크롤링 : BeautifulSoup, pytube, selenium
2. 시각화 : matplotlib, seaborn, wordcloud
3. 샘플링 : SMOTE
4. 알고리즘 : sklearn, KNN, LDA, SVM, K-Means, GMM, SOM, DB_SCAN
최종 PPT :https://github.com/hyegyu/CapstoneDesign2/blob/main/%EC%BA%A1%EC%8A%A4%ED%86%A4%EB%94%94%EC%9E%90%EC%9D%B82_%EB%B0%9C%ED%91%9C.pdf

1. 구글 API를 사용해서 데이터 수집 계획

구글에서 제공해주는 API는 하루에 제한이 있어서 만개가 넘는 데이터 수집까지 3달 예상
- 중단 후 방안 탐색

해결방안
- 캐글에 매일 정각 자동화 수집한 데이터 존재
- https://www.kaggle.com/rsrishav/youtube-trending-video-dataset
- 데이터 수는 79,554개, 중복제거 후 영상 수 10,923개
- 주제 변경도 고려했지만 캡스톤프로젝트 시 경진대회나 유명한 데이터를 사용하고 싶지 않았음

2. 급상승 분류 분석을 위해 급상승 영상과 올라가지 못한 영상을 비교 계획

이 데이터는 "매일 정각에 수집했다" 라는 조건을 가짐
정각에 급상승 순위에 있던 데이터 vs 비슷한 태그를 가졌지만 정각에 급상승 순위에 올라가지 못한 영상 수집 필요
- Beautiful, Pytube, Selenium 사용
기존데이터와 같은 속성을 가진 데이터를 수집가능해야하고 같은 기간동안에 업로드한 영상을 이용
중복제거 후 영상 수 5,750개

문제점
- 시계열 데이터이며, 유튜브에 그 시간이 조회수, 좋아요수, 싫어요수, 댓글 수에 대해 알기 어렵다고 생각
- 중단

3. 간단한 시각화, 상관관계 분석

카테고리는 Entertainment, People & Blogs, Music 순으로 인기가 많음
급상승 영상은 조회수 + 유튜브에서 정한 다른 기준으로 선정
- 그렇기 떄문에 조회수는 좋아요수, 싫어요수, 댓글수와 상관관계가 높음

4. 분류분석 진행

1의 값: 8099 / 0의 값 : 2824
균등한 상태의 데이터가 아님
처음에 미쳐 확인하지 못하고 진행
SVM, LinearSVM, LDA, LSTM, GRU, MLP 사용
- 표준화를 한 SVM, LinearSVM, LDA, MLP가 괜찮게 분류 됬지만 과적합 심각

5. 오버샘플링 후 분석 진행

유튜브에 오랫동안 있던 일수를 계산하여 통계를 냇을 때, 25%는 5일, 50%는 7일로 나옴
- 이렇게 하나 저렇게 하나 불균형데이터라 5일, 7일 둘 로 나누어 진행.
오버샘플링 방법은 SMOTE 사용
분류분석으로 KNN, LDA, SVM 사용
- KNN은 과적합, 나머지 둘은 잘 안됨
군집분석으로 PCA, K-MEANS, DB-SCAN, SOM, GMM 사용
- SOM이 그나마 괜찮게 분류

capstonedesign2's People

Contributors

Watchers

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.