irt-open-source / scf Goto Github PK

View Code? Open in Web Editor NEW

52.0 52.0 18.0 6.19 MB

Subtitling Conversion Framework

License: Apache License 2.0

XSLT 78.06% Python 6.66% XQuery 14.88% Dockerfile 0.30% CSS 0.10%

scf's People

Contributors

Stargazers

Watchers

Forkers

jonahkoo pavanpalli dominikgarsche bbc marcantoine-arnaud songbaoqiang 1div0 drummer3333 braincoded isabellreitler removogel rundfunk-berlin-brandenburg basicmaster gizmo84 gitphilw fjktyii mpierresilas subliminalguy

scf's Issues

Change requirement for StartBox and EndBox mapping

Refers to module: STLXML2EBU-TT

Currently the processing is not consistent, when handling teletext control codes (CC) in the 'TF' (text field of the TTI block). According to EBU Tech 3360, StartBox/EndBox should close the current span and open a new one. Same for color CCs (AplphaBlack, AlphaGreen, ...). For StadtBox/EndBox the requirements reflect exactly that. However, for color CCs the requirements state that a new span is created only when the CC changes the subtitle style (foreground or background color). That means there will be no two consecutive spans with the same style.

For StartBox/EndBox the processing should be the same as for color CCs.

For most real world STLXML files this will not be an issue since every line uses only one box.

STL2STLXML: Translation of "$" character is missing for ISO6937/2

The Dollar sign "$" is not translated in the Text Field element if an STL file uses as encoding Character Code Table "00" (from ISO 6937/2-1983 for the TTI blocks). No other code point is generated at all.

STL2STLXML: mapping of TTI Blocks containing user data

In the current scf version 0.2.8, TTI blocks from an STL file that contain user data are ignored when transforming from STL to STLXML. To preserve user data (e.g. to allow for round-tripping scenarios) these TTI blocks should be mapped as well to the STLXML format. User data is signalled by an Extension Block Number (EBN) value of FEh. TTI blocks with an EBN set to a reserved value (F0h-FDh) should keep beeing ignored as described in requirement no. 208.

The content of the text field (TF) cannot be mapped (proprietary data) and thus should be encoded, e.g. using base64.

Extension Block Number (EBN): FEh --> TTI block contains user data
Extension Block Number (EBN): F0h-FDh --> Reserved Codes

The description of the requirement no. 208 should be updated accordingly.

STL2STLXML: add option to discard UDA field + User Data TTI blocks

There should be an option to discard the user data that can be contained within a TTI block.

STLXML2STL: missing characters

When certain special characters like e.g. °, ² or ³ are used, they are not converted from STLXML to STL.

EBU-TT2EBU-TT-D Subtitle content conversion fails if tt:metadata present

The XSLT EBU-TT2EBU-TT-D calls only the first child of a tt:p element and then the child element applies templates to it's sibling. This works fine if the first child of the tt:p is a tt:span element (because the applied template follows this "sibling" strategy). But if the first child is for example tt:metadata this fails and no further content is processed. For a quick fix in the tt:metadata template the next sibling strategy needs to added. In the long run this sibling strategy needs to be factored out.

STLXML2EBU-TT: handling of content out of boxing

When a subtitle's TF contains content that is not enclosed by StartBox/EndBox element pairs, currently spaces (in the form of space elements) are discarded - but any text is copied. E.g. in case of This is a test. in STL (without enclosing boxing), this results in Thisisatest. in EBU-TT.

For consistency space/text content outside of boxing should be treat equally i.e. text should be discarded as well in that case.

An alternative way is to discard neither spaces nor text in case the TF does not use boxing at all (which is sometimes seen in STL files).

STLXML2EBU-TT: fix wrong TCP length check

The current release 1.1 introduced a bug that aborts the conversion when the parameter offsetTCP is used, due to an incorrect length check.

Applying offset when 'subtitle zero' is present leads to termination

Looking at EBU-TT2EBU-TT-D.xslt line 771:

                <xsl:if test="$mediaHours &lt; 0 or $mediaMinutes &lt; 0 or $mediaSeconds &lt; 0 or $mediaFrames &lt; 0">
                    <xsl:message terminate="yes">
                        The chosen offset would result in a negative timestamp for a time value.
                    </xsl:message>
                </xsl:if>

If the source STL file has subtitle zero and the relevant offset is applied this always results in a termination. The preferred behaviour here should be to omit the content elements with negative timestamps from the file and possibly issue a warning message.

STL2STLXML: specific chars with diacritical chars not mapped

A few characters with diacritical characters are not mapped from STL to STLXML. This includes e.g.:

J́
j́
J̃
L̃
M̃
R̃
j̃
l̃
m̃
r̃
E̊
e̊

This is caused by the following conditional check:

scf/modules/STL2STLXML/stl2stlxml.py

Line 182 in da049f4

if combined and len(combined) == 1:

STL2STLXML: Set executable bit for stl2stlxml.py

The executable bit of stl2stlxml.py should be set, as it eases invocation of the script under Linux.

STLXML2EBU-TT and EBU-TT-D2EBU-TT-D: Bug termination based on offsetInSecond parameter

When setting the offsetInSeconds parameter to specific values the transformation incorrectly terminates with an message that the value leads to negative time expressions.

STLXML2EBU-TT: set documentCreationDate/documentRevisionDate

The two EBU-TT fields ebuttm:documentCreationDate and ebuttm:documentRevisionDate shall be set/initialized with the current date.

STL2STLXML: Separation of associated TTI blocks due to EBN 0xFE (User Data)

If a TTI block has an EBN of 0xFE (User Data), this causes the last TTI block of the regarding subtitle set (which has an EBN of 0xFF) to be stored in a different STLXML TTI block. But it should rather be stored in the same one.

TT-Edit-List: timecode format support

Currently only timecodes in a certain format (e.g. specific number of decimal places) is supported; this limitation should be removed.

STL2SLTXML: Mapping of TTI blocks containing comments

In the current scf verison 0.2.8, TTI blocks from an STL file that contain comments are ignored when transforming from STL to STLXML. These comments should be mapped as well in order to achieve a "better" XML representation of the STL file.

CF set to 00h --> TTI Block contains subtitle data
CF set to 01h --> TTI Block contains comments

The mapping of the text field (TF) of a comment should be handled is the same way that a normal subtitle is.

The description of the requirement no. 214 "Comment Flag mapping" should be updated accordingly.

EBU-TT to DXFP

It would be useful for an XSLT translation for EBU-TT to DXFP for deployments wishing to use Microsoft Smooth streaming of subtitles.
We believe this would be a pretty simple edit of the EBU-TT to EBU-T-D template to exchange the namespaces of EBU-TT-D and take out few extensions.

STL2STLXML: don't abort on CPNs not allowed by EBU STL

Currently if a set CPN value is used which is not supported by EBU STL, the behaviour of STL2STLXML depends on Python:

when Python supports the encoding, the conversion continues
when Python doesn't support it, an exception occurs

To unify the behaviour, the conversion shall fall back to CPN 850 if Python does not support the specified CPN (despite whether allowed by EBU STL or not).

STLXML2EBU-TT: Bug unnecessary tt:span creation

NewBackground element writes an unnecessary tt:span element even if the active background is the same as the new background.

STL2STLXML: Translation of "~" character is missing for ISO6937/2

The character for tilde "~" is not translated in the textfield element if a STL file uses as encoding character code table "00" (from ISO 6937/2-198 for the TTI blocks). No other code point is generated at all.

STL2STLXML: consider Teletext control codes/0x8F also for non-850 CPNs

When a STL file has a CPN other than 850, the conversion does not handle the Teletext control codes (0x00 to 0x1F) or the 0x8F code; they instead are mapped to STLXML without further processing.

This shall be fixed so that these characters get the same processing like with CPN 850.

STLXML2EBU-TT transformation produces XSD invalid documents

The result of the STLXML2EBU-TT transformation produces content that does not validate against the EBU-TT Part 1 XML Schema.

STLXML2EBU-TT: consider also TCP when applying timecode offset

When the parameter offsetInSeconds is used to specify the used offset within the STLXML input file, that offset is subtracted from the TCI/TCO values when the EBU-TT file is written.

While the TCP relates to the TCI/TCO values, it currently is not modified during the conversion. This should be changed so that the offset parameter also affects the TCP field value.

STLXML2EBU-TT: map CD/RD/RN fields

Currently the fields CD/RD/RN are not mapped from STLXML to EBU-TT. So they should be mapped according to EBU-TT Part 2 and EBU-TT Part 1 v1.0.

STLXML2EBU-TT: add offsetInFrames parameter

The STLXML2EBU-TT conversion currently supports the offsetInSeconds parameter, which allows to specify an offset in seconds by that all TCI/TCO values (compare #32) of the input file are affected. In the EBU-TT result the mentioned values are then written after the specified offset has been subtracted respectively.

In addition a similar paramter offsetInFrames shall be added which has the same effect but is specified as an SMPTE timecode.

STLXML2EBU-TT: terminate on non-merged TTI blocks

As #31 proposes an option for STL2STLXML to not merge TTI blocks of the same subtitle, such a resulting file shall not be used with STLXML2EBU-TT.

To enforce this, STLXML2EBU-TT shall terminate when a TTI block is found with an EBN other than FE (user data) or FF (last TTI block of subtitle set).

STLXML2EBU-TT BUG: HANDLING OF MISSING ENDBOX

When TF field in the TTI Block of in STL does not finish with the Endbox Controlcode the content is discarded.

STLXML2EBU-TT: chars before first space/control code

In certain cases, characters at the beginning of a text field are not processed by the transformation from STLXML to EBU-TT. This seem to affect characters that occur before the first space or control code of a subtitle.

This problem should only involve files with Open Subtitles. Teletext subtitles nowadays usually contain a Double Height control code before any subtitle text, so such files are not affected.

EBU-TT-D2EBU-TT-D-Basic-DE: Missing style reference for text not in a tt:span

When in a EBU-TT-D document text nodes are direct children of the tt:p element they should be "wrapped" by a tt:span element through the transformation to EBU-TT-Basic-DE. Although this is done correctly the resulting span has a style attribute with no value.

Example:

EBU-TT-D Source:

     <tt:p 
            xml:id="sub1"
            region="defaultRegion"
            begin="00:00:00.000"
            end="00:00:02.000">Test text<tt:br/><tt:span>Test text 2nd line</tt:span></tt:p>

Result

       <tt:p xml:id="sub1" style="textCenter" region="bottom" begin="00:00:00.000"
            end="00:00:02.000">
            <tt:span style="">Test text</tt:span>
            <tt:br/>
            <tt:span style="textWhite">Test text 2nd line</tt:span>

Should be

       <tt:p xml:id="sub1" style="textCenter" region="bottom" begin="00:00:00.000"
            end="00:00:02.000">
            <tt:span style="textWhite">Test text</tt:span>
            <tt:br/>
            <tt:span style="textWhite">Test text 2nd line</tt:span>

STL2STLXML: Empty subtitles get lost

In EBU STL files containing teletext closed captions often empty subtitles occur on purpose.

a) Background
If there are longer periods without dialog or noises to be described editors place an empty subtitle to signal all receivers of the viewers who might switched to a channel in the meantime that closed captions are transmitted. Otherwise the search for teletext page 888, 777, 150 etc. can run much longer.

b) Problem introduced by SCF
Omitting empty subtitles leads to two main problems:

Round tripping issues due to differences in total no. of subtitles, TC of first/last subtitle cross referenced in metadata workflows, etc.
Communication issues between editors creating/checking a file due to the different numbering in the source file and the converted/round tripped one.

c) Expected behavior
SCF should respect every single subtitle no matter whether it is empty or not.

STL2STLXML: Incorrect Unicode mapping of some CCT 00 character code points

Some character code points of character code table 00 are incorrectly mapped to Unicode values. The following list contains the correct mapping:

0xE0 -> 0x03A9 ("Ω" = ohm sign/omega letter)
0xE3 -> 0x1EA1 ("ạ" = "a" with dot below)
0xEB -> 0x1ECD ("ọ" = "o" with dot below)

STL2STLXML: Alternative option for handling TTI Blocks

The STLXML format may be used in different scenarios:
a) as an intermediate format when converting from STL to EBU-TT
b) as a human readable version of the STL file (e.g. for error checking)

According to EBU-TT Part 2, subtitles that span over several TTI blocks are merged into one p-element when an STL file is converted to EBU-TT. In scf, the merging process is done in the STL2STLXML module. That is fine for use case a), but the resulting STLXML file is not an exact XML-representation of the source STL file. This may be a problem, e.g. when comparing the metadata "Total Number of Subtitles" with the number of TTI blocks in the STLXML file.

To support various use cases it may be good to implement an option that allows for an exact 1:1 translation of the TTI blocks.

Calculate offset based on `ebuttm:documentStartOfProgramme`

When converting from timebase="smpte" to timebase="media" it would be ideal to offer a setting (or indeed make it the default) to use the value of TCP or equivalently ebuttm:documentStartOfProgramme as the offset and discard any content that falls before that timecode.

For example, an EBU-TT document is created from an STL document and has:

timebase="smpte",
a 'subtitle zero' at 00:00:00,
a ebuttm:documentStartOfProgramme="10:00:00" and
most of the content falls after 10:00:00.

This should generate an EBU-TT-D document with:

timebase="media",
no subtitle zero
the content is offset backwards by 10:00:00: an element whose begin="10:00:00" in the EBU-TT should have begin="00:00:00" in the EBU-TT-D.

Piping STLXML2STL result into file does not work on Windows

The STLXML2STL result is output to console though the pipe operator > is used to save the result to a file.

STLXML2EBU-TT: PARAMETER IS CALLED timecodeFormat INSTEAD of timebase

The STLXML2EBUTT-XSLT accepts one parameter to set the timebase of the documents. Although EBU-TT is restricted in a way that the timebase is equivalent with a specific timecode format (timebase "smpte" only uses frames and "media" milliseconds) this is not the case for TTML in general. Here a subtitle document could have the timebase "media" and nonetheless have a time expression based on frames.

As the current naming of the parameter is misleading this has to be changed.

STLXML2EBU-TT: NEEDLESS ALIGNMENT tt:span

The style creation for inline elements (tt:span) includes the style attribute tts:textAlign.

The attribute tts:textAlign applies only to tt:p elements but not to tt:span elements. Therefore no distinction of text alignment has to be made creating style references for a tt:span. Consequently styles
that apply only to tt:span do not need any information about text alignment.

Although the current code do not result in incorrect rendering and still produces conformant XML it needs to be refactored because it is misleading.

STLXML2EBU-TT: use separate tt:div per SGN value

Currently STLXML2EBU-TT ignores the SGN field and puts all subtitles into a single tt:div element.

Therefore the SGN field shall be processed and a separate tt:div element be used per SGN value.

STLXML2STL: composite sequences not correctly mapped

Composite sequences with diacritical characters are not correctly mapped from STLXML to STL. This includes e.g.:

J́
j́
J̃
L̃
M̃
R̃
j̃
l̃
m̃
r̃
E̊
e̊

The reason is the different order of the diacritical combining character. While in Unicode it is a suffix, in EBU STL it is a prefix. So the char order has to be switched in such a case.

xslt error

Hi,

Using scf version 0.9.2, I have a problem with a STL :

conversion of STL to STLXML works fine
conversion of STLXML to TTML crashes with

scf-0.9.2/modules/STLXML2EBU-TT/STLXML2EBU-TT.xslt:615: validity error : xml:id : attribute value {concat('SGN', .)} is not an NCName
                <tt:div style="defaultStyle" xml:id="{concat('SGN', .)}">
                                                                        ^
scf-0.9.2/modules/STLXML2EBU-TT/STLXML2EBU-TT.xslt:892: validity error : xml:id : attribute value {concat('sub', $SN)} is not an NCName
            end="{$end}">
                        ^
xmlXPathCompOpEval: function current-date not found
XPath error : Unregistered function
xmlXPathCompiledEval: evaluation failed
runtime error: file scf-0.9.2/modules/STLXML2EBU-TT/STLXML2EBU-TT.xslt line 287 element value-of
XPath evaluation returned no result.

I am not familiar with xslt transformations so I am kind of lost here.

Using scf version 0.2.4 (the version I usually use), it produces only the following error :

xsltproc scf-0.2.4/modules/STLXML2EBU-TT/STLXML2EBU-TT.xslt 1491467708_36188FRA_ST.stlxml.xml
scf-0.2.4/modules/STLXML2EBU-TT/STLXML2EBU-TT.xslt:879: validity error : xml:id : attribute value {concat('sub', $SN)} is not an NCName
            end="{$end}">

And the output TTML is much bigger but way far from being usable (some text is missing)

The STL is here : https://drive.google.com/open?id=0B60JiOl5bvMNc09ISUd6em9jMTg
A shell script of what I am doing : https://drive.google.com/open?id=0B60JiOl5bvMNZXRYWmZNT1pqWE0
Regards.

STL2STLXML/STLXML2EBU-TT: don't convert trailing spaces in UDA field

If the UDA field not completely consists of meaningful data, the remaining bytes are filled up with spaces. According to EBU Tech 3360 (ch. 3.10 in v0.9) such trailing spaces should not be converted.

Support for embedding STL files

It would be a very useful feature for SCF to add support for handling embedded STL files (BASE64 encoding) during the conversion from STL to EBU-TT and to be able to extract embedded files.

http://bbc.github.io/subtitle-guidelines/#Embedded-STL

Mapping of user data from the GSI block

The user-defined area in the GSI block of an STL file should be mapped to STLXML and further to EBU-TT. The information in this user data field may be worth to preserve. Additionally this is a requirement for lossless round-tripping.

STLXML2EBU-TT error with xsltproc

I'm trying to do the XSLT conversion step from STLXML -> EBU-TT using xsltproc under Ubuntu 16.04 by calling:

xsltproc STLXML2EBU-TT.xslt STLXML.xml > EBU-TT.xml

xsltproc --version
Using libxml 20903, libxslt 10128 and libexslt 817
xsltproc was compiled against libxml 20903, libxslt 10128 and libexslt 817
libxslt 10128 was compiled against libxml 20902
libexslt 817 was compiled against libxml 20902

However, the process terminates with the following error messages:

STLXML2EBU-TT.xslt:669: validity error : xml:id : attribute value {concat('SGN', .)} is not an NCName
        <tt:div style="defaultStyle" xml:id="{concat('SGN', .)}">
                                                                ^
STLXML2EBU-TT.xslt:969: validity error : xml:id : attribute value {concat('sub', $SN)} is not an NCName
            end="{$end}">
                        ^
XPath error : Invalid expression
number($offsetTCP) eq 1
                   ^
compilation error: file STLXML2EBU-TT.xslt line 874 element when
xsl:when : could not compile test expression 'number($offsetTCP) eq 1'
XPath error : Invalid expression
string-length($tcp) ne 8
                    ^
compilation error: file STLXML2EBU-TT.xslt line 876 element if
xsl:if : could not compile test expression 'string-length($tcp) ne 8'

Are there any suggestions on how to solve this issue?

THX!

STL2STLXML: Translation of characters with macron below is incorrect for ISO6937/2

Characters having a macron below (and therefore prefixed with 0xCC) are incorrectly translated in the Text Field element if an STL file uses as encoding Character Code Table "00" (from ISO 6937/2-1983 for the TTI blocks). Instead the code point for the plain character (without the diacritical mark) is generated.

Only output the styles that are actually used

The current STLXML2EBU-TT stylesheet always outputs the whole set of styles for all the different foreground and background colour combinations even if only a subset of those styles are used. Those are then copied across by EBU-TT2EBU-TT-D and are therefore preserved. It would be a good improvement to insert only those styles into the EBU-TT that are actually used.

TT-Edit-List: default timeBase not supported

The implicit default value media of the ttp:timeBase attribute has to be supported, but currently isn't.

STL2STLXML transformations produce XSD invalid documents

The recent change of the stl2stlxml transformation adds a UDA as new element. As this is not added to the STLXML XML Schema the output of the transformation currently does not validate.

STLXML2STL: conversion aborts if more than two subtitle lines are used

When a subtitles uses more than two lines and for every line the 40 bytes of a Teletext line are used (e.g. 3 lines resulting in 3*40=120 bytes, plus line breaks), the capacity of a single TTI block (112 bytes) is exceeded. Hence the conversion aborts, as the case of multiple TTI block is currently not covered. Instead the conversion tries to apply a TTI block padding with a negative length, e.g.:

Stopped at stlxml2stl.xqm, 176/20:
[bin:negative-size] Size '-9' is negative.

STLXML2EBU-TT TYPO STYLE

When choosing the option to not trim white space the inserted XML contains a typo. The style attribute is called stype instead of style.

STLXML-SplitBlocks: User Data TTI blocks not correctly handled

TTI blocks with User Data (EBN 0xFE) is currently not handled correctly, but instead processed like normal subtitle content. Thus the Base64 encoded User Data is currently converted to subtitle text.