belambert / edit-distance Goto Github PK
View Code? Open in Web Editor NEWPython library for computing edit distance between arbitrary Python sequences.
License: Apache License 2.0
Python library for computing edit distance between arbitrary Python sequences.
License: Apache License 2.0
Instead of outputting an integer edit distance value, is it possible to decompose the edit distance so as to see how many insertions, deletions and substitutions are needed?
I get an extreme memory consumption when trying to align very long sequences with Python 3.6 (strings of about 10k characters). Calling SequenceMatcher.get_opcodes()
never terminates, allocating more and more (up to 28 GB resident) until interrupted. This happens even with identical a
and b
, as long as the sequence is long enough.
Minimal example:
from edit_distance.code import SequenceMatcher
a = u"x" * 10000 # dummy text
b = u"x" * 10000 # same
matcher = SequenceMatcher(a, b)
matcher.get_opcodes()
Thanks for the code,
Any idea on how to implement list of list functionality?
Instead of:
['a','b','c']
['a','b','c','d']
This
['a','b',['c']]
['a','b',['c','d','e']]
And the algorithm is allowed to add 'd' and 'e' inside the third element.
The insert index returned by get_opcodes() appears to be off by -1.
Example:
from difflib import SequenceMatcher
sm = SequenceMatcher(a='abc', b='abdc')
print(sm.get_opcodes())
from edit_distance import SequenceMatcher
sm = SequenceMatcher(a='abc', b='abdc')
print(sm.get_opcodes())
output:
[('equal', 0, 2, 0, 2), ('insert', 2, 2, 2, 3), ('equal', 2, 3, 3, 4)]
[['equal', 0, 1, 0, 1], ['equal', 1, 2, 1, 2], ['insert', 1, 1, 2, 3], ['equal', 2, 3, 3, 4]]
the "insert" opcode index should be 2 (not 1).
Hi there,
I have an AssertionError
thrown by this line of code. What does it mean? Shall I be worried? Is there any way to inspect what is happening in depth?
The two arrays of symbols I am comparing are the following:
['that', 'continuous', 'sanction', ':=', '(', 'flee', 'U', 'complain', ')', 'E', 'attendance', 'eye', '^', 'flowery', 'revelation', '^', 'ridiculous', 'destination', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>']
['continuous', ':=', '(', 'sanction', '^', 'flee', '^', 'attendance', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>', '<EOS>']
Thanks in advance,
Giulio
P.S. I know my data look weird, please don't ask what they are about :-)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.