Git Product home page Git Product logo

youtube-channel-transcript-api's Introduction

Youtube Transcript API for Channels and Playlists

Build Status MIT license image

Expand upon the youtube-transcript-api and allow users to easily request all of a channel's (or a playlist's) video caption data. This will require use of Youtube Data API v3.

Install

It is recommended to install this with pip

pip install youtube_channel_transcript_api

If you install from source, you will have to install the dependencies from the Pipfile with pipenv install For more information on pipenv see here

API

Integrate this package into your python 3.6+ application. It is built as a sort of expansion youtube-transcript-api. For that reason, that package's warnings/use cases mostly apply to this project as well.

The package revolves around creating YoutubeTranscripts objects, and then using them to obtain all of the caption data from that channel/playlist's videos. This package also is built on the YouTube Data API v3, which means to use this you will need to setup your own account and use your own API Key. See here for directions how to setup your account if you don't have one.

There are two types of YoutubeTranscripts objects, YoutubeChannelTranscripts and YoutubePlaylistTranscripts

To iniitialze a YoutubePlaylistTranscripts object, you would call like

YoutubePlaylistTranscripts(<playlist name>, <playlist id>, <youtube data api key>)

To initialize a YoutubeChannelTranscipts object, you would call like

YoutubeChannelTranscripts(<youtube channel name>, <youtube data api key>)

Note: A YoutubeChannelTranscripts object basically searches youtube for the top 5 channels closest to the given name and uses the top match. It creates a YoutubePlaylistTranscripts object with the data it gets back, so the rest of the two classes' functionality are identical.

You can then either call get_transcripts() to return a dictionary of all transcripts and a list of videos that errored, or you can call write_transcripts() to write out all of the transcripts to json files at the filepath location.

Here is an example where the package fetches all transcript data from a channel using get_transcripts():

from youtube_channel_transcript_api import YoutubeChannelTranscripts

channel_getter = YoutubeChannelTranscripts('A Youtube Channel', 'Youtube Data API Key here')

videos_data, videos_errored = channel_getter.get_transcripts()

In this instance, videos_data will look like

{
 'video id 1': 
	{ 'title': 'videos title 1',
	  'captions': [
			{
				'text': 'Hey there',
				'start': 7.58,
				'duration': 6.13
			},
			{
				'text': 'how are you',
				'start': 14.08,
				'duration': 7.58
			},
			# ...
		]
	},
 'video id 2': 
	{ 'title': 'videos title 2',
	  'captions': [
			{
				'text': 'Hola there',
				'start': 5.1,
				'duration': 6.13
			},
			{
				'text': 'how are I',
				'start': 12.08,
				'duration': 3.58
			},
			# ...
		]
	},
 #...
}

And videos_errored will look like

[ ['bad video title 1', 'bad video id 1'], ['bad video title 2', 'bad video id 2'] ]

Write Transcripts

The function write_transcripts() will write each transcript out to file in json format. It has one required parameter, file_path, which is where the function will create the directories and files necessary. It writes all the files to the same location. Each file is named after the video's title. It returns a list of videos that have errored, in the format above.

An example would be

from youtube_channel_transcript_api import YoutubeChannelTranscripts

channel_getter = YoutubeChannelTranscripts('A Youtube Channel', 'Youtube Data API Key here')

videos_errored = channel_getter.write_transcripts('/home/user/blah/here/') # don't forget to have that last /

Shared Parameters

Both get_transcripts() and write_transcripts() have the same, optional parameters.

Languages

youtube-channel-transcripts-api supports users trying to get their desired language from a channel's videos. To do this you can add a languages parameter to the call (it defaults to english).

You can also add the languages param if you want to make sure the transcripts are retrieved in your desired language (it defaults to english).

channel_getter = YoutubeChannelTranscripts('A Youtube Channel', 'Youtube Data API Key here')

videos_data, videos_errored = channel_getter.get_transcripts(languages=['de', 'en'])

It's a list of language codes in a descending priority. In this example it will first try to fetch the german transcript ('de') and then fetch the english transcript ('en') if it fails to do so.

Cookies

Some videos are age restricted, so this module won't be able to access those videos without some sort of authentication. To do this, you will need to have access to the desired video in a browser. Then, you will need to download that pages cookies into a text file. You can use the Chrome extension cookies.txt or the Firefox extension cookies.txt.

Once you have that, you can use it with the module to access age-restricted videos' captions like so.

channel_getter = YoutubeChannelTranscripts('A Youtube Channel', 'Youtube Data API Key here')

videos_data, videos_errored = channel_getter.get_transcripts(cookies='/path/to/your/cookies.txt')

Proxies

You can specify a https/http proxy, which will be used during the requests to YouTube:

channel_getter = YoutubeChannelTranscripts('A Youtube Channel', 'Youtube Data API Key here')

videos_data, videos_errored = channel_getter.get_transcripts(proxies={"http": "http://user:pass@domain:port", "https": "https://user:pass@domain:port"})  

As the proxies dict is passed on to the requests.get(...) call, it follows the format used by the requests library.

Just Text

You can specify for the responses to not include timestamp information in the videos_data returned, or in the files written out to memory. By default, just_text is set to False

channel_getter = YoutubeChannelTranscripts('A Youtube Channel', 'Youtube Data API Key here')

videos_data, videos_errored = channel_getter.get_transcripts(just_text=True)

In this example, videos_data will now look like

{
 'video id 1': 
	{ 'title': 'videos title 1',
	  'captions': 'Hey there how are you ...',
	},
 'video id 2': 
	{ 'title': 'videos title 2',
	  'captions': 'Hola there how are I ...',
	},
 #...
}

Warning

This code, in part, uses an undocumented part of the YouTube API, which is called by the YouTube web-client. So there is no guarantee that it won't stop working tomorrow if they change how things work. It also uses the Youtube Data API v3, so it is up to you that you are following all of that API's rules. In addition, you will have to worry about managing your own Quota for the YouTube Data API, its resource for limiting calls.

youtube-channel-transcript-api's People

Contributors

danielcliu avatar dependabot[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

youtube-channel-transcript-api's Issues

OSError: [Errno 22] Invalid argument

#I can get transcripts from some channels really smoothly, but sometimes I get this error message:

Traceback (most recent call last):
File "C:\Users\coope\PycharmProjects\HelloWorld\app.py", line 7, in
videos_errored = channel_getter.write_transcripts('C:/Users/coope/Desktop/HHClubNite_Transcripts/') # don't forget to have that last /
File "C:\Users\coope\PycharmProjects\HelloWorld\venv\lib\site-packages\youtube_channel_transcript_api\transcripts.py", line 52, in write_transcripts
with open(filepath, 'w') as f:
OSError: [Errno 22] Invalid argument: 'C:/Users/coope/Desktop/HHClubNite_Transcripts/Spencer_Jones_asks_โ€˜Do_You_Remember_My_Mum?โ€™.json'

#Other examples:

OSError: [Errno 22] Invalid argument: 'C:/Users/coope/Desktop/PeepShow_Transcripts/Mark_and_Jez_Play_Doubles_Tennis_|_Peep_Show.json'

OSError: [Errno 22] Invalid argument: "C:/Users/coope/Desktop/Comedyon4_Transcripts/BAD_DAY??_๐Ÿค”_This_will_make_you_laugh_so_hard_you'll_cry_๐Ÿคฃ๐Ÿ”ฅ_Ep._20.json"

#Is the problem possibly that the video titles include irregular characters like emojis, '|' and '?'

#I am new to python, so I could be making a basic error, this is the code I used (the same code works perfectly with some channels):

from youtube_channel_transcript_api import YoutubeChannelTranscripts
channel_getter = YoutubeChannelTranscripts('UCEZVbBq-gdk0_P369jZv0Vw', 'My_API_Key')
videos_errored = channel_getter.write_transcripts('C:/Users/coope/Desktop/HHClubNite_Transcripts/', just_text=True)

Get transcriptions from channel/playlist

Hi,

I would like to get the transcription of this channel/playlist (explore>news>worldnews) : https://www.youtube.com/channel/UCvAvFl2OGsuDSoOo93Kd0nA

I tried :
from youtube_channel_transcript_api import YoutubePlaylistTranscripts channel_getter = YoutubePlaylistTranscripts('World News','UCvAvFl2OGsuDSoOo93Kd0nA', 'myapikey') videos_data, videos_errored = channel_getter.get_transcripts()

But it didn't work, I have this following error:
404 Client Error: Not Found for url: https://www.googleapis.com/youtube/v3/playlistItems?part=snippet&maxResults=50&playlistId=UCvAvFl2OGsuDSoOo93Kd0nA&key=myapikey&pageToken=

Can you help me please ? Is it because this page https://www.youtube.com/channel/UCvAvFl2OGsuDSoOo93Kd0nA use a lot of playlist, with videos of others channels ?

Thanks,
Cheers,
Camille

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.