glut23 / webvtt-py Goto Github PK

Read, write, convert and segment WebVTT caption files in Python.

License: MIT License

Python 99.86% SRecode Template 0.01% Makefile 0.14%

python webvtt hls srt sbv subtitles subtitles-parsing captions

webvtt-py's Introduction

webvtt-py

webvtt-py is a Python library for reading, writing and converting WebVTT caption files. It also features caption segmentation useful when captioning HLS videos.

Documentation is available at http://webvtt-py.readthedocs.io.

Installation

$ pip install webvtt-py

Usage

import webvtt

for caption in webvtt.read('captions.vtt'):
    print(caption.identifier)  # cue identifier if any
    print(caption.start)       # cue start time
    print(caption.end)         # cue end time
    print(caption.text)        # cue payload
    print(caption.voice)       # cue voice span if any

Segmenting for HLS

import webvtt

webvtt.segment('captions.vtt', 'output/path')

Converting captions from other formats

Supported formats:

SubRip (.srt)
YouTube SBV (.sbv)

import webvtt

webvtt = webvtt.from_srt('captions.srt')
webvtt.save()

# alternatively in a single line
webvtt.from_sbv('captions.sbv').save()

CLI

Caption segmentation is also available from the command line:

$ webvtt segment captions.vtt --output output/path

License

Licensed under the MIT License.

webvtt-py's People

Contributors

Stargazers

Watchers

webvtt-py's Issues

webvtt from SRT missing --> arrows between start/end times

I'm converting an SRT file to WebVTT.

SRT looks like this:
1
0:00:00,000 --> 0:00:05,000
The halo effect that I was going every metric possible seems like really ambitious.

Converting to WebVTT via the library function returns this:
00:00:00.000 00:00:05.000 The halo effect that I was going every metric possible seems like really ambitious.

Any ideas? Thanks

Python2.7 support?

Hi, any idea what it would take to make this support python2.7? I could potentially do the work but wanted to get your thoughts first.

MalformedCaptionError - After upgrade to 0.4.0

Hi , I recently upgraded the package to 4.0 and started getting this error while reading the youtube VTT file. And different file is giving different MalformedCaptionErrors . One has : 'Standalone cue identifier' error while the other gets 'webvtt.exceptions.MalformedCaptionError: --> found in line 471' . I did not face this issue with old version.

Attaching a couple VTT files i'm trying to read for your reference. (since vtt file cannot be attached here , copied to text file)

NEW_ DELL Inspiron 15 5000 series 2-in-1 Unboxing!-2VioHGnyAG0.txt
'Review' Dell Inspiron 15 5000 - 5559 _ Best video editing and gaming laptop review 2016-joCJVrcZnxM.en.txt

Please let me know if you need more info.
Thanks !

MalformedCaptionError

MalformedCaptionError: Standalone cue identifier in line 7.

this error is showing reading a vtt file.

Setup statuses on PRs

Currently you have travis-ci running on all pull requests:

https://travis-ci.org/glut23/webvtt-py

But the travis-ci hooks aren't setup to report statuses back to pull requests right now. This would help us from merging PRs that break the tests.

MalformedCaptionError

Sometimes there are empty timestamps in the .vtt. The script errors out on them.

For example:
00:22:21.320 --> 00:22:26.520
00:21:13.720 --> 00:21:15.360 line:90% position:50% align:middle

Can this error somehow be captured or ignore the empty timestamps?

BOM in Caption - WEBVTT

Hi,

I would like to thank you first for this handy library.

Like the title suggests, is there a way to generate WEBVTT text/content with BOM?

I followed the documentation but the result was always without any byte order mark.

Caption.text showing timestamps and the cue

Hi @glut23 , I just had a question . I am trying to read a vtt file i downloaded from youtube using youtube-dl :

code :
for caption in webvtt: if 'cars' in caption.text: print(caption.text)

Output :

mileage<00:10:43.230> cars<c.colorE5E5E5><00:10:43.350> obviously<00:10:44.130> don't<00:10:44.310> stay<00:10:44.430> low

The timestamps and the cue are also getting printed instead of just text. Am I missing something in my code ? Would really appreciate your help .

Thanks in advance !

webvtt.read_buffer doesnt work after upgrading to 0.4.4

Hi,
Here's the full script i've ran with webvtt-py version 0.4.4

from io import StringIO
import urllib.request
import webvtt

url = 'https://course-recording-q1-2020-taii.s3.eu-west-3.amazonaws.com/us/GMT20200117-205611_AI-Inst--U.transcript.vtt'
response = urllib.request.urlopen(url)
data = response.read() 
text = data.decode('utf-8')
buffer = StringIO(text)

for l in webvtt.read_buffer(buffer):
    print(l.text)

this script shows nothing, but when i print the variable text, it actually shows a lot of content. I think there's some problem with the function read_buffer in version 0.4.4. That is because when I just downgraded the version to 0.4.3 then everything worked fine.
Please review this!

Parser doesn't handle BOM in SRT file

I noticed that trying to parse and SRT file I would get the following error:

----> 1 WebVTT().from_srt('5_IS_URC.srt')

C:\Users\sam\scoop\apps\python\3.5.2\lib\site-packages\webvtt\main.py in f(self, file)
     50         def f(self, file):
     51             self.file = file
---> 52             self._captions = parser_class().read(file).captions
     53             return self
     54

C:\Users\sam\scoop\apps\python\3.5.2\lib\site-packages\webvtt\generic.py in read(self, file)
    116
    117         content = self._read_content(file)
--> 118         self._validate(content)
    119         self._parse(content)
    120

C:\Users\sam\scoop\apps\python\3.5.2\lib\site-packages\webvtt\parsers.py in _validate(self, lines)
     86     def _validate(self, lines):
     87         if len(lines) < 2 or lines[0] != '1' or not self._validate_timeframe_line(lines[1]):
---> 88             raise MalformedFileError('The file does not have a valid format.')
     89
     90     def _is_timeframe_line(self, line):

MalformedFileError: The file does not have a valid format.

On further inspection, this is because my file begins with a BOM:

    def _validate(self, lines):
        if len(lines) < 2 or lines[0] != '1' or not self._validate_timeframe_line(lines[1]):
            import pdb; pdb.set_trace()
            raise MalformedFileError('The file does not have a valid format.')

(Pdb) lines[0]
'\ufeff1'
(Pdb) lines[0] == '1'
False

This could be fixed by stripping the BOM.

Opposite of `Timestamp.to_seconds`

Timestamp.to_seconds is there to convert the timestamp into seconds. It might be helpful if there was also the opposite method, to build a timestamp from seconds — for example, when you need to do some artihmetic on timestamps, like shift the timings. Maybe something like

@classmethod
def from_seconds(cls, secs):
    hours, secs = divmod(secs, 3600)
    mins, secs = divmod(secs, 60)
    secs, msecs = divmod(secs, 1)
    return cls(hours, mins, secs, msecs)

EDIT: Also, is there a reason why milliseconds are left out of to_seconds? timedelta.total_seconds returns sub-second precision. It is easy enough to round when one needs integral seconds.

identified is None when a srt file is parsed.

I have run the following code

srt_obj = webvtt.from_srt('dracula.srt')
# srt_obj.save('test.vtt')
print(srt_obj.captions[10].identifier)

At first, I found the identifier in srt file is not saved in vtt file, then I did a little inspection. The identifier is None according to the output.

BTW, I have checked the issue#14. I am using 0.4.3.

BytesIO support

Hello,

Could we get support for BytesIO ?

Example:


import webvtt
import BytesIO

with open('test.srt', 'rb') as fh:
    buf = BytesIO(fh.read())

webvtt.from_srt(buf)

This is not use case but example of what i mean.
Also would be nice to have save to BytesIO too.

Thanks and best all.

Malformed VTT Cue treated as ok data

https://w3c.github.io/webvtt/#webvtt-cue-block explicitly specifies that VTT Cue ends with the WebVTT line terminator. However, in the test_parse_captions_with_bom test case the last caption (without WebVTT line terminator) is treated as ok data, but is definitely malformed, because there is no WebVTT line terminator at the end.

Reading the Metadata

Hi, I have a few WebVtt files which have metadata at the beginning which looks like this
WEBVTT Kind: captions Language: en Style: ::cue(c.colorCCCCCC) { color: rgb(204,204,204); } ::cue(c.colorE5E5E5) { color: rgb(229,229,229); } ##
caption.start throws an error : webvtt.exceptions.MalformedCaptionError: Caption missing timeframe in line 2.

I have too many files so cannot edit them manually to read the file. Can you please help ?

MalformedCaptionError

Hi, I'm not sure whether this is related to the other issues so I opened a new one. I have issues with loadin in a youtube VTT file:

MalformedCaptionError                     Traceback (most recent call last)
<ipython-input-10-cf8d0e78a7c6> in <module>()
----> 1 WebVTT().read('PdUpXrgzSrI.de.vtt')

~/anaconda3/lib/python3.6/site-packages/webvtt/main.py in f(self, file)
     50         def f(self, file):
     51             self.file = file
---> 52             self._captions = parser_class().read(file).captions
     53             return self
     54 

~/anaconda3/lib/python3.6/site-packages/webvtt/generic.py in read(self, file)
    117         content = self._read_content(file)
    118         self._validate(content)
--> 119         self._parse(content)
    120 
    121         return self

~/anaconda3/lib/python3.6/site-packages/webvtt/parsers.py in _parse(self, lines)
     68                     continue
     69                 if not c.lines:
---> 70                     raise MalformedCaptionError('Caption missing text in line {}.'.format(index + 1))
     71 
     72                 self.captions.append(c)

MalformedCaptionError: Caption missing text in line 12.

WebVTT version is 0.3.3

Could you put a source package on pypi?

Hello, thanks for the package! It looks like since version 0.3.1 you've only been putting binary wheels on pypi, even though it's a pure python package. I wonder if you could upload a source distribution as well? I use some tools that always build from source (as a policy decision), and they end up pulling version 0.3.0 from pypi because it's the last version with a source distribution. "python setup.py sdist" seems to produce a working distribution, so I hope it's not much extra work for you.

MalformedCaptionError: Standalone cue identifier in line 975

hey @glut23
i used webvtt-py 0.4.6 and i get this error
webvtt.errors.MalformedCaptionError: Standalone cue identifier in line 975.

it's that a bug in library right?
what can i do for it ?

according to WebVTT specification hours field is optional when hours is zero

Greetings,

I just started looking at your module. Thanks for writing free software!

I attempted to parse a vtt file that ffmpeg generated and an exception was raised.

The ffmpeg generated file doesn't have the hours field of the timestamp if the hours is zero.

For instance:

01:11.913 --> 01:13.346

whereas the SRT file does:

00:01:11,913 --> 00:01:13,346

This seems to be within the WebVTT spec. From:

https://www.w3.org/TR/webvtt1/

we see:

"""
A WebVTT timestamp consists of the following components, in the given order:

Optionally (required if hours is non-zero):
Two or more ASCII digits, representing the hours as a base ten integer.
A U+003A COLON character (:)
"""

Looking at the regex in your code, the hours field (and its separation colon) is required.

Thanks for looking into this.

-m

.save() does not save identifiers

Hi,
I am experimenting with doing automatic translations on some VTT files and I noticed when I save out the file, it does save the identifier lines in the new file. Even if I do not change the file at all they are not present in the saved file.

I will take a look and see if I can fix it and let you know.

Transcript file metadata missing

When exporting a transcript of a conversation in Teams as a .vtt file some 'voice' metadata containing the speaker's screen name is present for each caption.

e.g.

WEBVTT

00:00:00.000 --> 00:00:00.800
<v Lisa Simpson>Knock knock</v>

00:00:02.100 --> 00:00:06.500
<v Homer Simpson>Who's there?</v>

00:00:10.530 --> 00:00:11.090
<v Lisa Simpson>Atish</v>

When I use webvtt to convert these captions to jsonl for analysis I'd like to preserve this metadata for context.

current output:

{"start": "00:00:00.000", "end": "00:00:00.800", "text": "Knock knock"}
{"start": "00:00:02.100", "end": "00:00:06.500", "text": "Who's there?"}
{"start": "00:00:10.530", "end": "00:00:11.090", "text": "Atish"}

desired output:

{"start": "00:00:00.000", "end": "00:00:00.800", "text": "Knock knock", "sender_name": "Lisa Simpson"}
{"start": "00:00:02.100", "end": "00:00:06.500", "text": "Who's there?", "sender_name": "Homer Simpson"}
{"start": "00:00:10.530", "end": "00:00:11.090", "text": "Atish", "sender_name": "Lisa Simpson"}

Sample code:

def vtt_to_jsonl(vtt_file, jsonl_file):
  captions = webvtt.read(vtt_file)

  with open(jsonl_file, 'w') as f:
    for caption in captions:
      caption_json = {
        'start': caption.start,
        'end': caption.end,
        'text': caption.text
        #'sender_name': caption.voice
      }
      json.dump(caption_json, f)
      f.write('\n')

MalformedCaptionError: Missing timing cue in line ...

Hello there, I often run into that error (blank line between time-code and text):

00:45:15.640 --> 00:45:16.760 line:91% align:center
Quelqu'un en a parlé ?

00:45:16.960 --> 00:45:20.600 line:79% align:center

Haddock l'a vu sur le pont 4
et m'a demandé mon impression.

00:45:20.920 --> 00:45:22.040 line:91% align:center
Qui est Haddock ?

I did not check yet whether it is a break in the standard from my sources or if your implementation did not take that situation into account.

Better parsing of srt subtitles to remove double newlines/breaks

I am getting Malformed Exception in some of my srt files due to them having weird double line breaks which breaks your parser I think.
I tried fixing it by replacing 2 or 3 linebreaks with a single linebreak but it wasn't as accurate as regex or a proper approach would be, would appreciate if you can add it.

Example subtitle (part of it)

00:01:10.733 --> 00:01:12.272
Aren't you excited?

00:01:14.143 --> 00:01:17.942
Let's find another place 

to hide out this year,

and play video 
games until it blows over.

00:01:17.943 --> 00:01:19.942

That'll get us through half a day, no problem.

IndexError: list index out of range?

% webvtt segment introduction-1.webvtt --output t1
Traceback (most recent call last):
File "/usr/local/bin/webvtt", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.9/site-packages/webvtt/cli.py", line 27, in main
segment(
File "/usr/local/lib/python3.9/site-packages/webvtt/cli.py", line 47, in segment
WebVTTSegmenter().segment(f, output, target_duration, mpegts)
File "/usr/local/lib/python3.9/site-packages/webvtt/segmenter.py", line 93, in segment
self._slice_segments(captions)
File "/usr/local/lib/python3.9/site-packages/webvtt/segmenter.py", line 45, in _slice_segments
self.segments[i].append(c)
IndexError: list index out of range

does not support unicode characters

hi,
I have been using this library for some time and this library has difficulty reading Unicode characters.

For example, in languages like fa, ko, ar and el, we get the following error :

raise MalformedFileError('The file does not have a valid format') webvtt.errors.MalformedFileError: The file does not have a valid format

I do a lot of work so that I can finally use the subtitles of these languages with this library.

Can this problem be solved? :)

srtwriter is writing cuetag into srt

Similar to #4

changing
f.writelines(['{}\n'.format(l) for l in caption.lines])

f.write('{}\n'.format(caption.text))

works for me

how can i use that ?

can i decode vtt encrypted file to SRT ?

existing metadata header not preserved when writing

Would love to be able to read, update, and write vtt with the metadata header intact. E.g.

WEBVTT
X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:126000

00:00:00.000 --> 00:00:02.000
 Hello there

Standalone cue identifier in line 7

Hi, all validators are ok with the file.
Kindly please check. Thank you.
1999-0624-Wie-wir-uns-zu-einer-besseren-Personlichkeit-entwickeln.vtt.zip

Is there a way to access/modify positioning?

webvtt.errors.MalformedCaptionError: Standalone cue identifier in line 26.

Getting this error. I have countless other files of the same type that do not give an error. I don't see anything wrong with the file.

0365_An overview of financial functions in Excel Online.DE_DE.mp4_de.zip

MalformedCaptionError: Invalid Time Format

Hey everyone - I just wanted to share a quick fix for a problem where I noticed webvtt-py does not do well when timestamps are in the format of 0:1:5.2 as opposed to 00:01:05:002.

I have written a piece of regex find replace to convert the format that I've shared over here on this repo https://github.com/ZhijingEu/VTT_File_Cleaner and also accompanied by a video tutorial https://www.youtube.com/watch?v=iZ0pOSL8JZw

Hope this helps someone out there in the future facing this issue

SRT parser cannot handle empty captions.

If there is a double empty line, this should indicate that the caption slot is empty. Software such as Camtasia will generate SRTs with such empty slots.

The a more compatible behaviour would be to parse them as captions with empty content.

MalformedCaptionError: Standalone cue identifier in line 1149

Hi! I've got an error below trying to read vtt file.
Standalone cue identifier in line 1149.
File "/Users/anton/repositories/stt-tts/step_3_generate_audio.py", line 54, in
vtt = webvtt.read("seamanship_1_transformed.vtt")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
webvtt.errors.MalformedCaptionError: Standalone cue identifier in line 1149.

glut23 / webvtt-py Goto Github PK

webvtt-py's Introduction

webvtt-py

Installation

Usage

Segmenting for HLS

Converting captions from other formats

CLI

License

webvtt-py's People

Contributors

Stargazers

Watchers

Forkers

webvtt-py's Issues

Recommend Projects

Recommend Topics

Recommend Org