Git Product home page Git Product logo

subby's People

Contributors

rlaphoenix avatar vevv avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

subby's Issues

SDH stripping leftovers

Some SDH lines are left over after stripping

If a sound effect comes directly after a character name, only the character name is stripped
WOMAN: (chuckles) The radio. -> (chuckles) The radio.
MONTENEGRO: (exhales) Okay. -> (exhales) Okay.

Character names that don't begin with a new line don't get stripped
BRENNAN: Cam? MONTENEGRO: Cam? -> Cam? MONTENEGRO: Cam?
the apartment was sanitized;\Nthat's what he does. SWEETS: Oh.
at the bar. BOOTH:\NOh. What do you got?

Character names with alternating capitalisation don't get stripped
McCORD: And fear is what stands
McINTYRE: She couldn't stick\Nthat four months ago.

If a sound effect is followed by colon, it doesn't get stripped
The Philippines? (chuckling):\NYou know what, Harriet?
So? (whispers):\NSean Nolan

Sound effects that don't occur at the start or the end of a line aren't stripped
The sun is brighter, the air\Nis... (sniffs) is crisper.

If a character name was the only text on a line, it results in a double new line
Now the rules have changed.\NSWEETS:\NThe two of you getting engaged, -> Now the rules have changed.\N\NThe two of you getting engaged,

Italic tag apple

This two problematic type of italic in Monarch.Legacy.of.Monsters.S01E01 appletv

When multiple style in one line.

00:23:58.021 --> 00:23:59.189 align:middle line:85%,start position:50%,middle
<c.styledotAB9216>- </c><c.styledotAB9216dotitalic>Cate?</c>
<c.styledotAB9216>- Mennem kell.</c>

--->

00:23:58,021 --> 00:23:59,189
- Cate?
- Mennem kell.

And when they us one style for two lines.

00:10:35.469 --> 00:10:38.639 align:middle line:85%,start position:50%,middle
<c.styledotAB9216dotitalic>Bámuljuk itthon a falat,
hogy sose derüljön ki semmi?</c>

--->

00:10:35,469 --> 00:10:38,639
Bámuljuk itthon a falat,
hogy sose derüljön ki semmi?

WEBVTT: https://rentry.co/qhmxpg/raw
srt: https://rentry.co/wi9wow/raw

Add subtitle capitalization (English)

This would be a nice to have, but it's error-prone. Would require a dictionary, similar to SubtitleEdit, as well as a list of user-provided names, which could be specific to given content (e.g. from IMDB).

Issues with double newlines

Subtitles with double/triple newlines lead to issues with the SDH stripper and certain players

For example

Is that work?
&nbsp;
&nbsp;
Yeah.

and

<i>[ Man ] Really.</i>
&nbsp;
You're welcome.
God.

Which when converted to SRT turn into

Is that work?


Yeah.

and

<i>[ Man ] Really.</i>

You're welcome.
God.

As expected, however when using the SDH stripper everything after the newlines is stripped so it becomes

Is that work?

and

<i> Really.</i>

It also seems to cause issues with players such as MPV, which also will not display the subs after the newlines either (though funnily enough it works if the srt is external?)

Add more selective processing

Currently CommonIssuesFixer is pretty large in scope. It would be good to break it up. Another useful feature would be per-language processing, for example spaces before question marks are always a mistake in English, but not in French.

`SDHStripper`: Remove an extra space if there's a space before and also after `[]` and remove hyphen if there's only one left after stripping.

Input:

1
00:02:37,159 --> 00:02:39,159
- [Something] Lorem ipsum dolor sit amet
- [something] consectetur adipiscing elit.

2
00:02:39,240 --> 00:02:41,360
- [something] Quisque ornare dictum metus,
- [something] vitae molestie tortor.

3
00:03:09,080 --> 00:03:14,039
- [grunts]
- [something] vitae molestie tortor.

4
00:13:16,799 --> 00:13:18,759
- Quisque ornare dictum metus,
- [grunts]

Output:

1
00:02:37,159 --> 00:02:39,159
-  Lorem ipsum dolor sit amet
-  consectetur adipiscing elit.

2
00:02:39,240 --> 00:02:41,360
-  Quisque ornare dictum metus,
-  vitae molestie tortor.

3
00:03:09,080 --> 00:03:14,039
-  vitae molestie tortor.

4
00:13:16,799 --> 00:13:18,759
- Quisque ornare dictum metus,

Expected output:

1
00:02:37,159 --> 00:02:39,159
- Lorem ipsum dolor sit amet
- consectetur adipiscing elit.

2
00:02:39,240 --> 00:02:41,360
- Quisque ornare dictum metus,
- vitae molestie tortor.

3
00:03:09,080 --> 00:03:14,039
vitae molestie tortor.

4
00:13:16,799 --> 00:13:18,759
Quisque ornare dictum metus,

I know CommonIssueFixer do these, but a user may only want to use the stripper, without fixing all sorts of stuff.

Reimplement flowing subs correction

Some providers use terrible flowing subtitles, probably converted from NA broadcast captions. Previous attempt was highly imperfect and corrupted some otherwise good subs, so it has been removed in 94e2b96.

Example:

Original (WebVTT)
00:00:12.812 --> 00:00:13.079 line:84% align:center
(Audience laughter)             
>> Okay, I                      

00:00:13.079 --> 00:00:13.346 line:84% align:center
(Audience laughter)             
>> Okay, I found one of ou      

00:00:13.346 --> 00:00:13.613 line:84% align:center
(Audience laughter)             
>> Okay, I found one of our     

00:00:13.613 --> 00:00:13.880 line:84% align:center
(Audience laughter)             
>> Okay, I found one of our     

00:00:13.880 --> 00:00:14.147 line:84% align:center
(Audience laughter)             
>> Okay, I found one of our     

00:00:14.147 --> 00:00:14.414 line:84% align:center
(Audience laughter)             
>> Okay, I found one of our     

00:00:14.414 --> 00:00:14.681 line:84% align:center
(Audience laughter)             
>> Okay, I found one of our     

00:00:14.681 --> 00:00:14.948 line:84% align:center
(Audience laughter)             
>> Okay, I found one of our     

00:00:14.948 --> 00:00:15.215 line:84% align:center
(Audience laughter)             
>> Okay, I found one of our     

00:00:15.215 --> 00:00:15.482 line:84% align:center
(Audience laughter)             
>> Okay, I found one of our     

00:00:15.482 --> 00:00:15.749 line:84% align:center
(Audience laughter)             
>> Okay, I found one of our     

00:00:15.749 --> 00:00:16.016 line:84% align:center
>> Okay, I found one of our     
bags.                           

00:00:16.016 --> 00:00:16.282 line:84% align:center
>> Okay, I found one of our     
bags.                           

00:00:16.282 --> 00:00:16.549 line:84% align:center
>> Okay, I found one of our     
bags.                           

00:00:16.549 --> 00:00:16.816 line:84% align:center
>> Okay, I found one of our     
bags.                           

00:00:16.816 --> 00:00:17.083 line:84% align:center
bags.                           
>> Oh, that'                    

00:00:17.083 --> 00:00:17.350 line:84% align:center
bags.                           
>> Oh, that's not ours.         

00:00:17.350 --> 00:00:17.617 line:84% align:center
bags.                           
>> Oh, that's not ours.         

00:00:17.617 --> 00:00:17.884 line:84% align:center
bags.                           
>> Oh, that's not ours.         

00:00:17.884 --> 00:00:18.151 line:84% align:center
bags.                           
>> Oh, that's not ours.         

00:00:18.151 --> 00:00:18.418 line:84% align:center
bags.                           
>> Oh, that's not ours.         

00:00:18.418 --> 00:00:18.685 line:84% align:center
bags.                           
>> Oh, that's not ours.         

00:00:18.685 --> 00:00:18.952 line:84% align:center
>> Oh, that's not ours.         

00:00:18.952 --> 00:00:19.219 line:84% align:center
>> Oh, that's not ours.         
>> Sorry.                       

00:00:19.219 --> 00:00:19.486 line:84% align:center
>> Oh, that's not ours.         
>> Sorry.                       

00:00:19.486 --> 00:00:19.753 line:84% align:center
>> Oh, that's not ours.         
>> Sorry.                       
Old subby clean up attempt
6
00:00:12,812 --> 00:00:15,749
(Audience laughter)
Okay, I found one of our

7
00:00:15,749 --> 00:00:16,816
Okay, I found one of our bags.

8
00:00:16,816 --> 00:00:18,685
bags.
Oh, that's not ours.

9
00:00:18,685 --> 00:00:19,753
Oh, that's not ours. Sorry.

Switch off BeautifulSoup4

BeautifulSoup4 is currently used for TTML subtitle conversion, and while it handles this job well, it's certainly not the fastest option.

incorrect italics tags

Hi, we found some bugs with the italics tags. Could you please fix it?

1.

Original VTT file:

<font color="styledotitalic">Irányítóközpont
az </font>Apollo 12-<font color="styledotitalic">nek. Kilövés engedélyezve.</font>

SRT file:

<i>Irányítóközpont</i>
<i>az Apollo 12-<i>nek. Kilövés engedélyezve.</i</i>

And it should be:

<i>Irányítóközpont
az </i>Apollo 12-<i>nek. Kilövés engedélyezve.</i>

2.

Original VTT file:

- <font color="styledotitalic">Kilövés: tíz, kilenc…</font>
- Visszaszámlálás!

SRT file:

<i>- <i>Kilövés: tíz, kilenc…</i</i>
<i>- Visszaszámlálás!</i>

And it should be:

- <i>Kilövés: tíz, kilenc…</i>
- Visszaszámlálás!

3.

Original VTT file:

Apollo 12<font color="styledotitalic">, visszatérésre felkészülni!</font>

SRT file:

<i>Apollo 12<i>, visszatérésre felkészülni!</i</i>

And it should be:

Apollo 12<i>, visszatérésre felkészülni!</i>

Problem with *

Hey, when converting from ttml2 to srt, subby does that:

<p begin="00:00:00.430" end="00:00:01.820" region="AmazonDefaultRegion" style="AmazonDefaultStyle">*Schnaub*</p>

--->

1 00:00:00,430 --> 00:00:01,820 *Schnaub ♪

Add hyphens in place of speaker names in SDH stripped subtitles

Some SDH subtitles prefix one sentence with a speaker name, and the second one with nothing (or both with a speaker name). After stripping, those look like a single sentence, rather two separate lines. Subtitle Edit already does this, see example below:

Source:

1
00:07:24,744 --> 00:07:27,950
MANTIS:
What are you going to do?
Me? He's your brother.

2
00:08:58,109 --> 00:08:59,138
MAN: What was that?
WOMAN: What was that?

3
00:08:59,274 --> 00:09:00,941
NEBULA: What the hell?
BOY: Oh, my God!

subby:

1
00:07:24,744 --> 00:07:27,950
What are you going to do?
Me? He's your brother.

2
00:08:58,109 --> 00:08:59,138
What was that?
What was that?

3
00:08:59,274 --> 00:09:00,941
What the hell?
Oh, my God!

Subtitle Edit:

1
00:07:24,744 --> 00:07:27,950
- What are you going to do?
- Me? He's your brother.

2
00:08:58,109 --> 00:08:59,138
- What was that?
- What was that?

3
00:08:59,274 --> 00:09:00,941
- What the hell?
- Oh, my God!

Expand test coverage

Currently tests are pretty rudimentary and only cover some of the processing. One thing which would particularly benefit from them is SDH stripping.

Switch to srt

srt appears to be a superior library to currently used pysrt. This would also solve the problem described in #12.

new bug with italics tags

Hi, something went wrong with this fix, could you please check it?
#24

There are extraneous opening italics tags at the beginning of a lot of lines (but not all of them).
I think the error occurs when only part of the text is italicized in the subtitle event instead of the entire text, and this affects subsequent subtitle events as well.

Original VTT file:

63
00:03:31.378 --> 00:03:34.965 align:middle line:85%,start position:50%,middle
1. RÉSZ – <c.italic>ŐRÜLT, BOLOND, ŐRÜLT, ZSENIÁLIS</c>

SRT

<i>1. RÉSZ – <i>ŐRÜLT, BOLOND, ŐRÜLT, ZSENIÁLIS</i>

Original VTT file:

106
00:06:00.819 --> 00:06:02.779 align:middle line:85%,start position:50%,middle
A <c.italic>Bárbarátokat</c>
és a <c.italic>Barry Seal: A beszállítót.</c>

SRT

<i>A <i>Bárbarátokat</i>
<i>és a <i>Barry Seal: A beszállítót.</i>

The next subtitle event in the original VTT file:

107
00:06:03.864 --> 00:06:05.490 align:middle line:85%,start position:50%,middle
Azt mondta, egy filmen dolgozik,

SRT

<i>Azt mondta, egy filmen dolgozik,

WebVTT: support styles

Some subtitles use styles, rather than simply italics tags, for formatting.

Example
WEBVTT
X-TIMESTAMP-MAP=MPEGTS:900000,LOCAL:00:00:00.000

STYLE
::cue(.styledotitalic) { font-style:italic }

1
00:00:02.628 --> 00:00:04.546 align:middle line:85%,start position:50%,middle
<c.styledotitalic>Welcome to London Heathrow.</c>

Some problems with SDHStripper and SMPTEConverter

SDHStripper:

For some reason in this sample, the SDH part ends up being completely removed.

Details about the version and libraries of Python:

Version:
Python 3.10.11
Libraries:
beautifulsoup4 4.12.2
chardet        5.2.0
click          8.1.6
colorama       0.4.6
construct      2.8.8
lxml           4.9.3
pymp4          1.4.0
pysrt          1.1.2
soupsieve      2.4.1
subby          0.1.15
tinycss        0.4

Code executed for this example:

from subby import WebVTTConverter, CommonIssuesFixer, SDHStripper
from pathlib import Path

converter = WebVTTConverter()
fixer = CommonIssuesFixer()
stripper = SDHStripper()

file = Path('test_accessibility.vtt')
file_sdh = Path('test_accessibility_sdh.srt')
file_stripped = Path('test_accessibility_stripped.srt')
srt, _ = fixer.from_srt(converter.from_file(file))

srt.save(file_sdh)
# saved to file_sdh.srt

stripped, status = stripper.from_srt(srt)
if status is True:
    print('stripping successful')
    stripped.save(file_stripped)
    # saved to file_stripped.srt

Input(test_accessibility.vtt):

WEBVTT

1
00:01:00.000 --> 00:01:01.000 align:middle line:85%,start position:50%,middle
- [Kirk] Spock!
- [blowing whistle]

2
00:02:00.000 --> 00:02:01.000 align:middle line:85%,start position:50%,middle
- [Uhura] Bones.
- [continues shushing]

Output(test_accessibility_stripped.srt) is empty:

Expected Output(test_accessibility_stripped.srt):

1
00:01:00,000 --> 00:01:01,000
-  Spock!

2
00:02:00,000 --> 00:02:01,000
-  Bones.

This seems to happen because of this line. This same line was also changed in a recent commit here.

SMPTEConverter:

For some reason the character & is removed from text.

Details about the version and libraries of Python:

Version:
Python 3.10.11
Libraries:
beautifulsoup4 4.12.2
chardet        5.2.0
click          8.1.6
colorama       0.4.6
construct      2.8.8
lxml           4.9.3
pymp4          1.4.0
pysrt          1.1.2
soupsieve      2.4.1
subby          0.1.15
tinycss        0.4

Code executed for this example:

from subby import SMPTEConverter
from pathlib import Path

converter = SMPTEConverter()
file = Path('test_subtitle.ttml2')

# All statements below are equivalent
srt = converter.from_file(file)

# srt is pysrt.SubRipFile

output = Path('test_subtitle.srt')
srt.save(output)
# saved to file.srt

Input(test_subtitle.ttml2):

<?xml version="1.0" encoding="utf-8"?>
<tt xmlns="http://www.w3.org/ns/ttml" xmlns:tts="http://www.w3.org/ns/ttml#styling" ttp:version="2" xmlns:ttm="http://www.w3.org/ns/ttml#metadata" xmlns:ttp="http://www.w3.org/ns/ttml#parameter" xml:lang="en-US">
 <head>
  <styling>
   <style xml:id="s0" tts:fontFamily="sansSerif" tts:fontStyle="italic" tts:color="white" tts:fontWeight="normal" tts:fontSize="100%"></style>
   <style xml:id="s1" tts:fontFamily="sansSerif" tts:fontStyle="normal" tts:color="white" tts:fontWeight="normal" tts:fontSize="100%"></style>
  </styling>
  <layout>
   <region xml:id="r0" tts:extent="100% 15%" tts:origin="0% 85%" tts:displayAlign="after" tts:textAlign="center"></region>
   <region xml:id="r1" tts:extent="100% 15%" tts:origin="0% 0%" tts:displayAlign="before" tts:textAlign="center"></region>
  </layout>
 </head>
 <body style="s1">
  <div>
   <p begin="00:01:00.000" end="00:01:01.000" region="r0">The two most important days in your life are<br />the day you are born & the day you find out why.</p>
  </div>
 </body></tt>

Output(test_subtitle.srt):

1
00:01:00,000 --> 00:01:01,000
The two most important days in your life are
the day you are born  the day you find out why.

Expected Output(test_subtitle.srt):

1
00:01:00,000 --> 00:01:01,000
The two most important days in your life are
the day you are born & the day you find out why.

This seems to be related to using html.unescape here. Using data directly on the line instead of unescaped, it keeps the character & but I also don't know if it's correct to do it this way.

Well, that's all I've found so far, I hope it helps in some way.

Unnecessary new line

I bump into this issue. 3rd new line , double hyphen and unnecessary hyphen.

Original:

1
00:05:26,800 --> 00:05:31,200
- Öhm... Bocsánat... Mr.Teufel...
de a...   <i>- Mi a baj, Safranek?!</i>

"Fixed":

1
00:05:26,800 --> 00:05:31,200
- - Öhm... Bocsánat... Mr.
- Teufel...
de a... <i>- Mi a baj, Safranek?!</i>

ellipsis characters and three dots

Over the past years the ellipsis character (…) became standard instead of the three separate dots (...) for some languages on multiple streaming platforms.

If there are ellipses characters in the text Subby changes them to three dots but it would be better to keep the original characters.

Could you please make this happen?

Thank you.

Issue on Disney attempts to make srt an ssa sub

In a couple of animes, Disney chose to add overlapping sub lines with different positioning to "make a srt a ssa" sub.

00:05:38.129 --> 00:05:43.092 line:5%,start
This goes on top, usually as a forced sub to describe something in the screen
00:05:38.338 --> 00:05:39.339 line:95%,end
Line of dialogue that goes in the bottom at the same time the "forced" sub is on display
00:05:39.340 --> 00:05:42.111 line:95%,end
Next line of dialogue that also goes in the bottom

The combine timecodes function end up merging the subs in a short timespan, which would completely wipe lines, or at least make them look wiped.

Possible solution is verifying neighboor lines so they won't get merged if they have different positioning.

Funfact: some of these 'hacky' aka 'bugged' subs won't even show on legit devices with the D+ app

deleted line breaks

Hi, I found an issue. Subby deletes some line breaks which shouldn't be changed in two line subtitle events.

Original 1

Támogatják az abortusz jogokat
Massachusettsben

New 1

Támogatják az abortusz jogokat Massachusettsben

Original 2

{\an8}JACK HARPER
RIPORTER

New 2

{\an8}JACK HARPER RIPORTER

Original 3

FEHÉR LAKOSOK
EGYENLŐTLENSÉG

New 3

FEHÉR LAKOSOK EGYENLŐTLENSÉG

In professional subtitles line breaks are intentionally and carefully placed based on the style guide rules of the clients so it would be better to keep them as they are.
Subby creates:

  • too long lines (like New 1)
  • and even nonsensical texts when the two lines are not part of the same sentence, just two seperate on screen texts in a two line forced subtitle event. (Like New 2 and 3)

Could you please disable this change in Subby?

Thank you so much.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.