Git Product home page Git Product logo

duplicatedzoneinclinicaltext's Introduction

Duplicate Text Finder

duptextfinder is a python library to detect duplicated zones in text. Primarily meant to detect copy/paste across medical documents. Should be faster than python's built-in difflib algorithm and more robust to whitespace, newlines and other irrelevant characters.

Installation

duptextfinder can be installed through pip:

pip install duptextfinder

Usage

from pathlib import Path
from duptextfinder import CharFingerprintBuilder, DuplicateFinder

# load some text files
texts = [p.read_text() for p in Path("some/dir").glob("*.txt")]

# init fingerprint and duplicate finder
fingerprintBuilder = CharFingerprintBuilder(fingerprintLength=15)
duplicateFinder = DuplicateFinder(fingerprintBuilder, minDuplicateLength=15)

# call findDuplicates() on each file
for i, text in enumerate(texts):
    id = f"D{i}"
    duplicates = duplicateFinder.findDuplicates(id, text)
    for duplicate in duplicates:
        print(
            f"sourceDoc={duplicate.sourceDocId}, "
            f"sourceStart={duplicate.sourceSpan.start}, "
            f"sourceEnd={duplicate.sourceSpan.end}, "
            f"targetStart={duplicate.targetSpan.start}, "
            f"targetEnd={duplicate.targetSpan.end}"
        )
        duplicated_text = text[duplicate.targetSpan.start : duplicate.targetSpan.end]
        print(duplicated_text)

WordFingerprintBuilder can be used instead of CharFingerprintBuilder. For more details, refer to the docstrings of DuplicateFinder, CharFingerprintBuilder and WordFingerprintBuilder.

How to run tests

  1. Install package in editable mode with test and extra dependencies by running pip install -e ".[tests, ncls, intervaltree]" in the repo directory
  2. Launch pytest tests/

About ncls and intervaltree

This tool can be used without any additional dependencies, but performance can be improved when using interval trees. To benefit from this you well need to install either the ncls package or the intervaltree package.

References

duplicatedzoneinclinicaltext's People

Contributors

bastienrance avatar ghisvail avatar drfabach avatar olvb avatar

Stargazers

 avatar

Watchers

James Cloos avatar  avatar  avatar Antoine Neuraz avatar  avatar  avatar

Forkers

drfabach

duplicatedzoneinclinicaltext's Issues

Request for change to more permissive licensing terms

duptextfinder is currently licensed under the AGPLv3 and is a direct dependency to medkit, which is licensed under MIT.

As such, both licenses are not compatible and a deployment of medkit would be available under the AGPLv3 by propagation, instead of MIT. This is a problem for medkit which can be solved with either two ways : turning duptexfinder into a soft dependency, or changing the licensing terms for duptextfinder in favor of an alternative compatible with the MIT license. From a medkit perspective, the former requires more work than the latter.

I was wondering whether the choice of the AGPLv3 for duptextfinder was deliberate, considering it is a library with such a narrow scope. If the copyleft aspect is important, may I suggest to use a weaker copyleft license such as the LGPLv3 or the MPLv2? This way, the library will still benefit from modifications of its own codebase released under reciprocal terms, whilst allowing downstream projects like medkit to be released under their terms of choice.

Thanks.

Improved speed of implementation

I had some problems using this tool, the execution time was too long.
I chose to set the fingerprints only from the beginning of the words.

Are you interested in this implementation?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.