fraunhoferisst / trend Goto Github PK
View Code? Open in Web Editor NEWTraceability Enforcement of Datatransfers (TREND)
Home Page: https://fraunhoferisst.github.io/TREND/
License: Other
Traceability Enforcement of Datatransfers (TREND)
Home Page: https://fraunhoferisst.github.io/TREND/
License: Other
The compression of a watermark can be used as an optional feature inside the watermarker library. There are a lot of other, partly better compression algorithms available.
It should be analyzed and checked if other compression algorithms are available in scientific publications or state-of-the-art implementations that can be used.
Notice: Remember that the build target of the watermarker library can be set to Java or JavaScript. When implementing an external library, it must be available in Java and JavaScript.
There should be a pipeline to create auto-generated documentation for the watermarker library. The documentation should be created based on the comments in the source code.
It should further be checked if publishing it via GitHub pages in a gh-pages
branch makes sense or if there are better possibilities.
In the webinterface, the kvisionVersion
is defined two times:
As shown by #25 , Dependabot only updates one of the version numbers.
Steps to reproduce the behavior:
kvisionVersion
update PR from dependabot for the webinterfaceBoth version numbers should be updated.
Doesn't matter in this case.
./.
Currently, this repository uses a squash and merge strategy with conventional commits and pull requests linked to issues. All of those information are missing in the CONTRIBUTING.md file.
Update the CONTRIBUTING.md file and add all relevant information beside the code style.
./.
The StartEndSeparatorChars
separator strategy is not very convenient to use as the users of the library have to define the start- and end-separators themselves.
It would be nice to have default start- end end-separator characters defined, such that the users of the library do not have to look for promising white space charaters themselves. I am thinking of something like two additional constants in the DefaultTranscoding
object (there is already one for the SingleSeparatorChar
strategy).
Possible extension:
const val SEPARATOR_CHAR = '\u2004' // Three-per-em space
const val START_SEPARATOR_CHAR = '\u2004' // Three-per-em space
const val END_SEPARATOR_CHAR = '\u2005' // Four-per-em space (or whatever replaces normal white spaces best)
That's where, in my opinion, a small refactoring of the code would also contribute to its convenience. The SingleSeparatorChar
and StartEndSeparatorChars
should get default arguments in their constructor in order to easily get a default instance of these classes.
StartEndSeparatorChars
strategy at all?Adding watermarks to a text using the default SingleSeparatorChar
separator strategy is relatively inflexible in use cases like watermarking of mails or collaboratively created document where there are mutliple parties that all might want to add their own watermark: When two watermarked texts are merged, the last repetition (fragment) of the first watermark and the first repetition of the second watermark "blur" into one another, as the last repetition of the first watermark most likely does not perfectly end on its separation character.
For example when reading the watermark from this merged text ...
# Contains (complete) watermarks ["AA", "AA"] + fragment
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.
# Contains (complete) watermarks ["BB", "BB"] + fragment
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.
... it does not result in ["AA", "AA", "BB", "BB"]
but in ["AA", "AA", "A\t\t\u0001", "BB"]
. This is due to the last repetition of the first watermark being fragmented. It does not conclude with a separation character and is thus interpreted as part of the next watermark.
Now one could propose a solution to this problem that includes removing the watermarks from both texts and re-adding them to the whole text in a combined way (e.g. "AA;BB"
, but in my eyes that procedure has 2 main disadvatages:
That's where the StartEndSeparatorChars
strategy comes in handy, as it prevents watermarks from blurring into one another in the first place.
Currently, the watermarker library is able to watermark Strings / .txt
files and .zip
files. However, the webinterface only supports watermarking for Strings. Adding a watermark to a .zip
or .txt
file is impossible.
Add an upload functionality to be able to watermark all supported file types of the watermarker library in the webinterface.
./.
The current implementation of the watermarker library has a workaround implemented to prevent a crash caused by a bug in Kotlin. This workaround should be removed once the bug is fixed in Kotlin.
Adjust the usage of the Pako library in jsMain/kotlin/helper/Compression.kt
as follows as soon as this bug is fixed:
external object Pako
should change:
fun deflateRaw(data: IntArray, options: Any? = definedExternally): IntArray
->
fun deflateRaw(data: UByteArray, options: Any? = definedExternally): UByteArray
fun inflateRaw(data: IntArray, options: Any? = definedExternally): IntArray
->
fun inflateRaw(data: UByteArray, options: Any? = definedExternally): UByteArray
Compression.{inflate, deflate}
must be changed accordinglyCurrently the library has specific Watermark classes depending on the source of the watermark (e.g., TextWatermark
for Watermarks extracted from text files) that can contain additional information (e.g., the positions of the watermarks in the text file).
These specific watermarks can lead to confusion (e.g., is a TextWatermark
a watermark from a text file or a watermark from any file that contains plain text?). It also leads to naming problems (e.g., How should we name a watermark containing plain text when we already have a TextWatermark
?).
Remove the file specific watermarks. The additional information are currently not used anywhere. We can add them later in a better way if required.
Once this is done we can rename Textmark
to TextWatermark
.
The watermarker library is usable in Java and Kotlin. It is possible to use the library in the browser by using a Kotlin frontend like KVision. However, it is currently not possible to import the library into plain JavaScript.
Evaluate what is required to export the library as JavaScript library
More information: https://kotlinlang.org/docs/js-to-kotlin-interop.html
Kover is currently used to generate test coverage reports. Codecov should be added to better display the reports and directly include them in pull requests.
This issue is currently blocked by the following upstream issue for integrating Kover with Codecov: Kotlin/kotlinx-kover#16
When changing a watermarked text, the watermark inside the cover text can get destroyed. This can occur by moving sentences inside a text, deleting content, adding new content, or copying existing content.
The overall robustness of the watermarker library needs to be increased.
One possible example:
If a small watermark is included inside an extended cover text (e.g., 10 times), the watermarker library should be able to extract the watermark even if 4 of the 10 watermark repetitions got destroyed.
If a control char is implemented (see #18), the watermarker library needs a strategy if this control char gets destroyed.
The following information is displayed in the GitHub actions, like the build of the GitHub page for the documentation:
Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: gradle/gradle-build-action@v2. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/.
See: https://github.com/FraunhoferISST/TREND/actions/runs/9204230070
Update the workflows so that the latest versions are used.
Note that there are changes in the gradle-build-action
. See the README of their GitHub Repo for more information.
Currently, the documentation of the watermarker library is only available as in-line code blocks. People unfamiliar with the project don't know how to use the library in Java and JS code.
Besides the automated documentation generation from code (as discussed in #9), there should be text-based documentation. It should contain additional details and all relevant information on using the watermarker library. Therefore, a Getting Started Guide or Quick Start Guide is needed for newcomers.
Further, it needs to be checked if the structure should be based on existing templates like arc42.
The documentation should be published via GitHub Pages and made available directly via this repository. It might make sense to use the gh-pages
branch for it.
A framework is needed to create a baseline for the documentation. Existing frameworks like Docusaurus, Just the Docs, Docsify, Nextra, etc., should be checked and evaluated. A necessary requirement is that the documentation should be written in Markdown so that it is easy to integrate inline source code and be independent of the framework itself.
Additionally, a GitHub action pipeline might be needed to deploy and update the documentation directly without manual work.
After the first version of the documentation is created, the pull request template should be updated with the checklist (like a definition of done or acceptance criteria) to check that the documentation is still up-to-date after every code change.
During the development of this project, different architectural decisions were made. For external people, it is hard to understand why different decisions are made and what the background is. Further, discussions can start in the future from aspects already discussed in the past.
To prevent duplicate discussions and be transparent, architecture decision records (ADRs) should be used. There should be an ADR template with a location where all ADRs are stored (like in a docs/adr
folder). All ADRs should have the same structure and be easily accessible (in Markdown format).
Existing open source available examples or templates should be used.
In order to increase the runtime security of GitHub actions, it should be checked if harden-runner can be used.
GitHub issue templates should be used and integrated for this repository. There should be at least three different templates for:
Some of the types we expose are not supported by JavaScript. These types can only be passed to other functions that take them as arguments, but it is not possible to access or modify their data.
To improve the usability of our library in JavaScript it should be evaluated if some unsupported types can be replaced by supported types.
Such a change was already done by changing the type of placement
in TextWatermarker
from Sequence<..>
to List<..>
(see here)
Related: #40
The watermarker library is able to add a watermark with or without compression. When analyzing a watermarked text or file, the library needs to know which type of watermark is used (compressed, uncompressed, specific format, etc.). This is currently not possible. There might be specific use cases that have specific requirements towards the style, compression, linting or format of the watermark.
Every watermark should start with a 2-digit control character (like a number) that identifies the type of watermark. Using a 2-digit control char instead of a 1-digit allows to have a bigger namespace for future formats.
Example:
Instead of adding Test
as a watermark, 00Test
will be watermarked if the watermark is uncompressed, 01Test
will be added as a watermark if the watermark is compressed.
The first control char must be inserted without compression to get it working.
The other components, like the CLI tool and the webinterface, need to be updated after the issue is implemented since it is a breaking change.
Further, a table in the documentation is needed to document the control char and its meaning, for example:
Control Char | Meaning |
---|---|
00 |
Uncompressed Watermark |
01 |
Compressed Watermark using X compression technique |
02 |
Specialized compression for use case Y |
03 |
... |
... | ... |
When trying unzip a watermarked zip file the unzip
command shows warnings and sometimes even errors:
Archive: multiple_files_watermarked.zip
warning [multiple_files_watermarked.zip]: 32 extra bytes at beginning or within zipfile
(attempting to process anyway)
file #1: bad zipfile offset (local header sig): 32
(attempting to re-compensate)
extracting: a.txt
A
error: invalid zip file with overlapped components (possible zip bomb)
To unzip the file anyway, rerun the command with UNZIP_DISABLE_ZIPBOMB_DETECTION=TRUE environmnent variable
unzip version:
UnZip 6.00 of 20 April 2009, by Info-ZIP. Maintained by C. Spieler. Send
bug reports using http://www.info-zip.org/zip-bug.html; see README for details.
Steps to reproduce the behavior:
samples/
unzip -c multiple_files_watermarked.zip
Files are extracted without warnings or errors
It might not be possible to prevent a warning. It depends on the implementation of the specific application.
Details about the file format: https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
In order to build and run the CLI tool or the webinterface, the watermarker library has to be built locally and published to maven local.
It should be checked whether it makes sense to publish the watermarker library directly (in GitHub, Docker Hub, etc.) to make it directly usable without an additional build and publish step.
After the watermarker library is published, the mavenLocal()
repository can be removed from the CLI tool and the webinterface.
In order to strengthen security, actions-permissions should be configured to assign the correct permissions to GitHub actions.
Currently, while there is a function to create the required insert positions for a watermark, there is none to calculate the available insert positions in a text.
TREND/watermarker/src/commonMain/kotlin/fileWatermarker/TextWatermarker.kt
Lines 438 to 442 in 752be7c
Such a function would come in handy, especially when needing to calculate whether or how often a given watermark fits in a given text.
Add a function that calculates the available insert positions in a text to the TextWatermarker class. This function could look like this:
/** Counts the available number of insert positions in a [file] */
fun getAvailableInsertPositions(file: TextFile): Int {
return placement(file.content).count()
}
When adding the text Hello World
as a Watermark to the text
Test ads asd asd asd as dasmlkjl lk lklk j lkafdas fsdbfsdhf k kjh kjh hkjfhf kjhkj hdkjahsdkj hadkahd kjhaskjhd kjashfdhiu u hj h hahdkja kj kjh kjn nkashdkjhwkjhwhw wqe qw ejkds,m askjhd,mandhakjd asdhc,mxyncndsa da sd asd as asd sa d d d d d d d d d d
and extracting the watermark from the generated watermarked text
Test ads asd asd asd as dasmlkjl lk lklk j lkafdas fsdbfsdhf k kjh kjh hkjfhf kjhkj hdkjahsdkj hadkahd kjhaskjhd kjashfdhiu u hj h hahdkja kj kjh kjn nkashdkjhwkjhwhw wqe qw ejkds,m askjhd,mandhakjd asdhc,mxyncndsa da sd asd as asd sa d d d d d d d d d d
the watermark Hello Worl$
is shown.
The problem couldn't be reproduced with other texts or watermarks.
The webinterface only returns the successful watermarked text or an error/warning message. Cases exist where it is possible to extract a problematic watermark.
This should be changed so that the frontend can return a warning/error message and a watermarked text, not only one of them.
./.
When first using the webinterface, it is hard for newcomers to understand how it works. Sometimes, people wonder why using the "Add Watermark" button is impossible. Further, it is hard to understand the percentage slider.
Different aspects can be implemented to improve the overall usability (incl. some styling adjustments):
./.
The ZipWatermarker currently adds the watermark directly inside the .zip
file (the archive itself). After a user extracts the archive, the watermark is removed.
To solve this problem, the watermarker library should check all files inside the ZIP archive and include the watermark in all files that have a supported file type.
./.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.