Git Product home page Git Product logo

captionstransformer's Introduction

Introduction

This package is a set of tools to transform captions from one format to another. You will find Writer and Reader for each format and a script if you want to use it in command line.

Supported Format:

  • sbv Reader and Writer
  • srt Reader and Writer
  • ttml Reader and Writer
  • transcript Reader and Writer

How to use (API)

You can read the provided unittest to have complete examples:

from captionstransformer.sbv import Reader
from captionstransformer.ttml import Writer
from StringIO import StringIO
test_content = StringIO(u"""
0:00:03.490,0:00:07.430
>> FISHER: All right. So, let's begin.
This session is: Going Social

0:00:07.430,0:00:11.600
with the YouTube APIs. I am
Jeff Fisher,

0:00:11.600,0:00:14.009
and this is Johann Hartmann,
we're presenting today.

0:00:14.009,0:00:15.889
[pause]
""")
reader = Reader(test_content)

captions = reader.read()
len(captions) == 4
first = captions[0]
type(first.text) == unicode
first.text == u">> FISHER: All right. So, let's begin.\nThis session is: Going Social\n"

# next get a writer
filelike = StringIO()
writer = Writer(filelike)
writer.set_captions(captions)
text = writer.captions_to_text()
text.startswith(u"""<tt xml:lang="" xmlns="http://www.w3.org/ns/ttml"><body><div>""")
writer.write()
writer.close()

About Formats

This quite hard to find simple documentation about existing caption format. Here is a set of existing named caption format:

SubViewer (SUB):

00:04:35.03,00:04:38.82
Hello guys... please sit down...

00:05:00.19,00:05:03.47
M. Franklin,[br]are you crazy?

Youtube (SBV):

0:00:03.490,0:00:07.430
FISHER: All right. So, let's begin.
This session is: Going Social

0:00:07.430,0:00:11.600
with the YouTube APIs. I am
Jeff Fisher,

0:00:11.600,0:00:14.009
and this is Johann Hartmann,
we're presenting today.

0:00:14.009,0:00:15.889
[pause]

SubRip (SRT):

1
00:00:03,490 --> 00:00:07,430
FISHER: All right. So, let's begin.
This session is: Going Social

00:00:07,430 --> 00:00:11,600
with the YouTube APIs. I am
Jeff Fisher,

2
00:00:11,600 --> 00:00:14,009
and this is Johann Hartmann,
we're presenting today.

3
00:00:14,009 --> 00:00:15,889
[pause]

Timed Text Markup Language (TTML):

<tt xml:lang="" xmlns="http://www.w3.org/ns/ttml">
  <body region="subtitleArea">
    <div>
      <p xml:id="subtitle1" begin="0.76s" end="3.45s">
        It seems a paradox, does it not,
      </p>
      <p xml:id="subtitle2" begin="5.0s" end="10.0s">
        that the image formed on<br/>
        the Retina should be inverted?
      </p>
    </div>
  </body>
</tt>

Transcript

<?xml version="1.0" encoding="utf-8" ?>
<transcript>
    <text start="10" dur="2">Hi, I&amp;#39;m Emily from Nomensa</text>
    <text start="12" dur="3">and today I&amp;#39;m going to be talking about the order of content on your pages.</text>
    <text start="16" dur="6">Making sure the content on your web pages is presented logically is a really important part of web accessibility.</text>
    <text start="23" dur="2">Page content should be ordered so it makes sense</text>
</transcript>

Microsoft SAMI (SAMI, SMI):

<SAMI>
<Head>
   <Title>President John F. Kennedy Speech</Title>
   <SAMIParam>
      Copyright {(C)Copyright 1997, Microsoft Corporation}
      Media {JF Kennedy.wav}
      Metrics {time:ms; duration: 73000;}
      Spec {MSFT:1.0;}
   </SAMIParam>
</Head>

<Body>
   <SYNC Start=0>
      <P Class=ENUSCC ID=Source>Pres. John F. Kennedy
   <SYNC Start=10>
      <P Class=ENUSCC>Let the word go forth,
         from this time and place to friend and foe
         alike that the torch
</Body>
</SAMI>

Credits

Companies

cirb CIRB / CIBG

makinacom

Authors

captionstransformer's People

Contributors

57uff3r avatar brennanyoung avatar toutpt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

captionstransformer's Issues

Library chokes on milliseconds/frames/ticks

It's neither unusual, nor beyond the specifications for ttml time attributes to include sub-second values (milliseconds, frames, ticks...), but captionstransformer fails to parse these.

Adobe Premiere (for example) includes sub-second values in its timecodes. And while it's not optimized for captions, it's in very common use, and not a wholly unusual source of ttml files.

Looking under the hood, I can see that ttml.py is only looking for '%H:%M:%S' causing the exception "ValueError unconverted data remains".

The %f macro was added to strftime/strptime in python 2.6 to handle these units.

You can change ttml.py so that if the conversion fails it will try to convert again, using the %f macro. Something like this:

    def get_date(self, time_str):
        try:
            convertedTime = datetime.strptime(time_str, '%H:%M:%S')
        except ValueError as v:
            ulr = len(v.args[0].partition('unconverted data remains: ')[2])
            if ulr:
                convertedTime = datetime.strptime(time_str, "%H:%M:%S.%f")
            else:
                raise v
        return convertedTime

It would be wise to add the %f macro to the Writer output as well

class Writer(core.Writer):
    DOCUMENT_TPL = u"""<tt xml:lang="" xmlns="http://www.w3.org/ns/ttml"><body><div>%s</div></body></tt>"""
    CAPTION_TPL = u"""<p begin="%(start)s" end="%(end)s">%(text)s</p>"""

    def format_time(self, caption):
        """Return start and end time for the given format"""
        #milliseconds now given (remove the [:-3] for microseconds)
        return {'start': caption.start.strftime('%H:%M:%S.%f')[:-3],
                'end': caption.end.strftime('%H:%M:%S.%f')[:-3]}

AttributeError: 'str' object has no attribute 'decode'

Hi,

I'm trying to use the captionstransformer as I've a need to convert SRT to TTML. I'm using the following example code but I'm getting the following error,

Error trace:
Traceback (most recent call last):
File "srt_to_ttml.py", line 15, in
captions = reader.read()
File "C:\Python34\lib\site-packages\captionstransformer-1.2.1-py3.4.egg\captionstransformer\core.py", line 13, in read
self.rawcontent = self.rawcontent.decode(self.encoding)
AttributeError: 'str' object has no attribute 'decode'

The following is the example code that I'm using,
from captionstransformer.srt import Reader
from captionstransformer.ttml import Writer
from io import StringIO
test_content = StringIO(u"""
1
00:00:03,490 --> 00:00:07,430
FISHER: All right. So, let's begin.
This session is: Going Social

00:00:07,430 --> 00:00:11,600
with the YouTube APIs. I am
Jeff Fisher,
""")
reader = Reader(test_content)
captions = reader.read()

len(captions) == 4

first = captions[0]

type(first.text) == unicode

first.text == u"Jellyfish at the Monterey Aquarium"

next get a writer

filelike = StringIO()
writer = Writer(filelike)
writer.set_captions(captions)
text = writer.captions_to_text()
text.startswith(u"""

""")
writer.write()
writer.close()

Could you please help me fix this error? I'm using ActiveState Python 3.4.1

Thanks,
Prem

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.