irt-open-source / scf Goto Github PK
View Code? Open in Web Editor NEWSubtitling Conversion Framework
License: Apache License 2.0
Subtitling Conversion Framework
License: Apache License 2.0
Refers to module: STLXML2EBU-TT
Currently the processing is not consistent, when handling teletext control codes (CC) in the 'TF' (text field of the TTI block). According to EBU Tech 3360, StartBox/EndBox should close the current span and open a new one. Same for color CCs (AplphaBlack, AlphaGreen, ...). For StadtBox/EndBox the requirements reflect exactly that. However, for color CCs the requirements state that a new span is created only when the CC changes the subtitle style (foreground or background color). That means there will be no two consecutive spans with the same style.
For StartBox/EndBox the processing should be the same as for color CCs.
For most real world STLXML files this will not be an issue since every line uses only one box.
The Dollar sign "$" is not translated in the Text Field element if an STL file uses as encoding Character Code Table "00" (from ISO 6937/2-1983 for the TTI blocks). No other code point is generated at all.
In the current scf version 0.2.8, TTI blocks from an STL file that contain user data are ignored when transforming from STL to STLXML. To preserve user data (e.g. to allow for round-tripping scenarios) these TTI blocks should be mapped as well to the STLXML format. User data is signalled by an Extension Block Number (EBN) value of FEh. TTI blocks with an EBN set to a reserved value (F0h-FDh) should keep beeing ignored as described in requirement no. 208.
The content of the text field (TF) cannot be mapped (proprietary data) and thus should be encoded, e.g. using base64.
Extension Block Number (EBN): FEh --> TTI block contains user data
Extension Block Number (EBN): F0h-FDh --> Reserved Codes
The description of the requirement no. 208 should be updated accordingly.
There should be an option to discard the user data that can be contained within a TTI block.
When certain special characters like e.g. °
, ²
or ³
are used, they are not converted from STLXML to STL.
The XSLT EBU-TT2EBU-TT-D calls only the first child of a tt:p
element and then the child element applies templates to it's sibling. This works fine if the first child of the tt:p
is a tt:span element (because the applied template follows this "sibling" strategy). But if the first child is for example tt:metadata
this fails and no further content is processed. For a quick fix in the tt:metadata
template the next sibling strategy needs to added. In the long run this sibling strategy needs to be factored out.
When a subtitle's TF contains content that is not enclosed by StartBox
/EndBox
element pairs, currently spaces (in the form of space
elements) are discarded - but any text is copied. E.g. in case of This is a test.
in STL (without enclosing boxing), this results in Thisisatest.
in EBU-TT.
For consistency space/text content outside of boxing should be treat equally i.e. text should be discarded as well in that case.
An alternative way is to discard neither spaces nor text in case the TF does not use boxing at all (which is sometimes seen in STL files).
The current release 1.1 introduced a bug that aborts the conversion when the parameter offsetTCP
is used, due to an incorrect length check.
Looking at EBU-TT2EBU-TT-D.xslt line 771:
<xsl:if test="$mediaHours < 0 or $mediaMinutes < 0 or $mediaSeconds < 0 or $mediaFrames < 0">
<xsl:message terminate="yes">
The chosen offset would result in a negative timestamp for a time value.
</xsl:message>
</xsl:if>
If the source STL file has subtitle zero and the relevant offset is applied this always results in a termination. The preferred behaviour here should be to omit the content elements with negative timestamps from the file and possibly issue a warning message.
A few characters with diacritical characters are not mapped from STL to STLXML. This includes e.g.:
J́
j́
J̃
L̃
M̃
R̃
j̃
l̃
m̃
r̃
E̊
e̊
This is caused by the following conditional check:
scf/modules/STL2STLXML/stl2stlxml.py
Line 182 in da049f4
The executable bit of stl2stlxml.py
should be set, as it eases invocation of the script under Linux.
When setting the offsetInSeconds parameter to specific values the transformation incorrectly terminates with an message that the value leads to negative time expressions.
The two EBU-TT fields ebuttm:documentCreationDate
and ebuttm:documentRevisionDate
shall be set/initialized with the current date.
If a TTI block has an EBN of 0xFE (User Data), this causes the last TTI block of the regarding subtitle set (which has an EBN of 0xFF) to be stored in a different STLXML TTI block. But it should rather be stored in the same one.
Currently only timecodes in a certain format (e.g. specific number of decimal places) is supported; this limitation should be removed.
In the current scf verison 0.2.8, TTI blocks from an STL file that contain comments are ignored when transforming from STL to STLXML. These comments should be mapped as well in order to achieve a "better" XML representation of the STL file.
CF set to 00h --> TTI Block contains subtitle data
CF set to 01h --> TTI Block contains comments
The mapping of the text field (TF) of a comment should be handled is the same way that a normal subtitle is.
The description of the requirement no. 214 "Comment Flag mapping" should be updated accordingly.
It would be useful for an XSLT translation for EBU-TT to DXFP for deployments wishing to use Microsoft Smooth streaming of subtitles.
We believe this would be a pretty simple edit of the EBU-TT to EBU-T-D template to exchange the namespaces of EBU-TT-D and take out few extensions.
Currently if a set CPN value is used which is not supported by EBU STL, the behaviour of STL2STLXML depends on Python:
To unify the behaviour, the conversion shall fall back to CPN 850 if Python does not support the specified CPN (despite whether allowed by EBU STL or not).
NewBackground element writes an unnecessary tt:span element even if the active background is the same as the new background.
The character for tilde "~" is not translated in the textfield element if a STL file uses as encoding character code table "00" (from ISO 6937/2-198 for the TTI blocks). No other code point is generated at all.
When a STL file has a CPN other than 850, the conversion does not handle the Teletext control codes (0x00
to 0x1F
) or the 0x8F
code; they instead are mapped to STLXML without further processing.
This shall be fixed so that these characters get the same processing like with CPN 850.
The result of the STLXML2EBU-TT transformation produces content that does not validate against the EBU-TT Part 1 XML Schema.
When the parameter offsetInSeconds
is used to specify the used offset within the STLXML input file, that offset is subtracted from the TCI/TCO values when the EBU-TT file is written.
While the TCP relates to the TCI/TCO values, it currently is not modified during the conversion. This should be changed so that the offset parameter also affects the TCP field value.
Currently the fields CD/RD/RN are not mapped from STLXML to EBU-TT. So they should be mapped according to EBU-TT Part 2 and EBU-TT Part 1 v1.0.
The STLXML2EBU-TT conversion currently supports the offsetInSeconds
parameter, which allows to specify an offset in seconds by that all TCI/TCO values (compare #32) of the input file are affected. In the EBU-TT result the mentioned values are then written after the specified offset has been subtracted respectively.
In addition a similar paramter offsetInFrames
shall be added which has the same effect but is specified as an SMPTE timecode.
As #31 proposes an option for STL2STLXML to not merge TTI blocks of the same subtitle, such a resulting file shall not be used with STLXML2EBU-TT.
To enforce this, STLXML2EBU-TT shall terminate when a TTI block is found with an EBN other than FE
(user data) or FF
(last TTI block of subtitle set).
When TF field in the TTI Block of in STL does not finish with the Endbox Controlcode the content is discarded.
In certain cases, characters at the beginning of a text field are not processed by the transformation from STLXML to EBU-TT. This seem to affect characters that occur before the first space or control code of a subtitle.
This problem should only involve files with Open Subtitles. Teletext subtitles nowadays usually contain a Double Height control code before any subtitle text, so such files are not affected.
See also #46.
When in a EBU-TT-D document text nodes are direct children of the tt:p element they should be "wrapped" by a tt:span element through the transformation to EBU-TT-Basic-DE. Although this is done correctly the resulting span has a style attribute with no value.
Example:
EBU-TT-D Source:
<tt:p
xml:id="sub1"
region="defaultRegion"
begin="00:00:00.000"
end="00:00:02.000">Test text<tt:br/><tt:span>Test text 2nd line</tt:span></tt:p>
Result
<tt:p xml:id="sub1" style="textCenter" region="bottom" begin="00:00:00.000"
end="00:00:02.000">
<tt:span style="">Test text</tt:span>
<tt:br/>
<tt:span style="textWhite">Test text 2nd line</tt:span>
Should be
<tt:p xml:id="sub1" style="textCenter" region="bottom" begin="00:00:00.000"
end="00:00:02.000">
<tt:span style="textWhite">Test text</tt:span>
<tt:br/>
<tt:span style="textWhite">Test text 2nd line</tt:span>
In EBU STL files containing teletext closed captions often empty subtitles occur on purpose.
a) Background
If there are longer periods without dialog or noises to be described editors place an empty subtitle to signal all receivers of the viewers who might switched to a channel in the meantime that closed captions are transmitted. Otherwise the search for teletext page 888, 777, 150 etc. can run much longer.
b) Problem introduced by SCF
Omitting empty subtitles leads to two main problems:
c) Expected behavior
SCF should respect every single subtitle no matter whether it is empty or not.
Some character code points of character code table 00 are incorrectly mapped to Unicode values. The following list contains the correct mapping:
0xE0 -> 0x03A9
("Ω" = ohm sign/omega letter)
0xE3 -> 0x1EA1
("ạ" = "a" with dot below)
0xEB -> 0x1ECD
("ọ" = "o" with dot below)
The STLXML format may be used in different scenarios:
a) as an intermediate format when converting from STL to EBU-TT
b) as a human readable version of the STL file (e.g. for error checking)
According to EBU-TT Part 2, subtitles that span over several TTI blocks are merged into one p-element when an STL file is converted to EBU-TT. In scf, the merging process is done in the STL2STLXML module. That is fine for use case a), but the resulting STLXML file is not an exact XML-representation of the source STL file. This may be a problem, e.g. when comparing the metadata "Total Number of Subtitles" with the number of TTI blocks in the STLXML file.
To support various use cases it may be good to implement an option that allows for an exact 1:1 translation of the TTI blocks.
When converting from timebase="smpte"
to timebase="media"
it would be ideal to offer a setting (or indeed make it the default) to use the value of TCP
or equivalently ebuttm:documentStartOfProgramme
as the offset and discard any content that falls before that timecode.
For example, an EBU-TT document is created from an STL document and has:
timebase="smpte"
,ebuttm:documentStartOfProgramme="10:00:00"
andThis should generate an EBU-TT-D document with:
timebase="media"
,begin="10:00:00"
in the EBU-TT should have begin="00:00:00"
in the EBU-TT-D.The STLXML2STL result is output to console though the pipe operator >
is used to save the result to a file.
The STLXML2EBUTT-XSLT accepts one parameter to set the timebase of the documents. Although EBU-TT is restricted in a way that the timebase is equivalent with a specific timecode format (timebase "smpte" only uses frames and "media" milliseconds) this is not the case for TTML in general. Here a subtitle document could have the timebase "media" and nonetheless have a time expression based on frames.
As the current naming of the parameter is misleading this has to be changed.
The style creation for inline elements (tt:span) includes the style attribute tts:textAlign.
The attribute tts:textAlign applies only to tt:p elements but not to tt:span elements. Therefore no distinction of text alignment has to be made creating style references for a tt:span. Consequently styles
that apply only to tt:span do not need any information about text alignment.
Although the current code do not result in incorrect rendering and still produces conformant XML it needs to be refactored because it is misleading.
Currently STLXML2EBU-TT ignores the SGN field and puts all subtitles into a single tt:div
element.
Therefore the SGN field shall be processed and a separate tt:div
element be used per SGN value.
Composite sequences with diacritical characters are not correctly mapped from STLXML to STL. This includes e.g.:
J́
j́
J̃
L̃
M̃
R̃
j̃
l̃
m̃
r̃
E̊
e̊
The reason is the different order of the diacritical combining character. While in Unicode it is a suffix, in EBU STL it is a prefix. So the char order has to be switched in such a case.
Hi,
Using scf version 0.9.2, I have a problem with a STL :
scf-0.9.2/modules/STLXML2EBU-TT/STLXML2EBU-TT.xslt:615: validity error : xml:id : attribute value {concat('SGN', .)} is not an NCName
<tt:div style="defaultStyle" xml:id="{concat('SGN', .)}">
^
scf-0.9.2/modules/STLXML2EBU-TT/STLXML2EBU-TT.xslt:892: validity error : xml:id : attribute value {concat('sub', $SN)} is not an NCName
end="{$end}">
^
xmlXPathCompOpEval: function current-date not found
XPath error : Unregistered function
xmlXPathCompiledEval: evaluation failed
runtime error: file scf-0.9.2/modules/STLXML2EBU-TT/STLXML2EBU-TT.xslt line 287 element value-of
XPath evaluation returned no result.
I am not familiar with xslt transformations so I am kind of lost here.
Using scf version 0.2.4 (the version I usually use), it produces only the following error :
xsltproc scf-0.2.4/modules/STLXML2EBU-TT/STLXML2EBU-TT.xslt 1491467708_36188FRA_ST.stlxml.xml
scf-0.2.4/modules/STLXML2EBU-TT/STLXML2EBU-TT.xslt:879: validity error : xml:id : attribute value {concat('sub', $SN)} is not an NCName
end="{$end}">
And the output TTML is much bigger but way far from being usable (some text is missing)
The STL is here : https://drive.google.com/open?id=0B60JiOl5bvMNc09ISUd6em9jMTg
A shell script of what I am doing : https://drive.google.com/open?id=0B60JiOl5bvMNZXRYWmZNT1pqWE0
Regards.
If the UDA field not completely consists of meaningful data, the remaining bytes are filled up with spaces. According to EBU Tech 3360 (ch. 3.10 in v0.9) such trailing spaces should not be converted.
It would be a very useful feature for SCF to add support for handling embedded STL files (BASE64 encoding) during the conversion from STL to EBU-TT and to be able to extract embedded files.
The user-defined area in the GSI block of an STL file should be mapped to STLXML and further to EBU-TT. The information in this user data field may be worth to preserve. Additionally this is a requirement for lossless round-tripping.
I'm trying to do the XSLT conversion step from STLXML -> EBU-TT using xsltproc under Ubuntu 16.04 by calling:
xsltproc STLXML2EBU-TT.xslt STLXML.xml > EBU-TT.xml
xsltproc --version
Using libxml 20903, libxslt 10128 and libexslt 817
xsltproc was compiled against libxml 20903, libxslt 10128 and libexslt 817
libxslt 10128 was compiled against libxml 20902
libexslt 817 was compiled against libxml 20902
However, the process terminates with the following error messages:
STLXML2EBU-TT.xslt:669: validity error : xml:id : attribute value {concat('SGN', .)} is not an NCName
<tt:div style="defaultStyle" xml:id="{concat('SGN', .)}">
^
STLXML2EBU-TT.xslt:969: validity error : xml:id : attribute value {concat('sub', $SN)} is not an NCName
end="{$end}">
^
XPath error : Invalid expression
number($offsetTCP) eq 1
^
compilation error: file STLXML2EBU-TT.xslt line 874 element when
xsl:when : could not compile test expression 'number($offsetTCP) eq 1'
XPath error : Invalid expression
string-length($tcp) ne 8
^
compilation error: file STLXML2EBU-TT.xslt line 876 element if
xsl:if : could not compile test expression 'string-length($tcp) ne 8'
Are there any suggestions on how to solve this issue?
THX!
Characters having a macron below (and therefore prefixed with 0xCC
) are incorrectly translated in the Text Field element if an STL file uses as encoding Character Code Table "00" (from ISO 6937/2-1983 for the TTI blocks). Instead the code point for the plain character (without the diacritical mark) is generated.
The current STLXML2EBU-TT stylesheet always outputs the whole set of styles for all the different foreground and background colour combinations even if only a subset of those styles are used. Those are then copied across by EBU-TT2EBU-TT-D and are therefore preserved. It would be a good improvement to insert only those styles into the EBU-TT that are actually used.
The implicit default value media
of the ttp:timeBase
attribute has to be supported, but currently isn't.
The recent change of the stl2stlxml transformation adds a UDA as new element. As this is not added to the STLXML XML Schema the output of the transformation currently does not validate.
When a subtitles uses more than two lines and for every line the 40 bytes of a Teletext line are used (e.g. 3 lines resulting in 3*40=120 bytes, plus line breaks), the capacity of a single TTI block (112 bytes) is exceeded. Hence the conversion aborts, as the case of multiple TTI block is currently not covered. Instead the conversion tries to apply a TTI block padding with a negative length, e.g.:
Stopped at stlxml2stl.xqm, 176/20:
[bin:negative-size] Size '-9' is negative.
When choosing the option to not trim white space the inserted XML contains a typo. The style attribute is called stype instead of style.
TTI blocks with User Data (EBN 0xFE
) is currently not handled correctly, but instead processed like normal subtitle content. Thus the Base64 encoded User Data is currently converted to subtitle text.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.