Comments (25)
Changed label from "type-bug" to "performance" and "type-feature". difflib
doesn't make any promises about speed, and finding "bad cases" is expected - that's not "bug" territory.
In your proposal. please add some high-level comments about the meaning of "_gravity" and what range of values might be useful. The name on its own is just a mystery.
from cpython.
Micro-optimizations like that are increasingly out of style in core Python code. They generally don't help (or even hurt) when running under PyPy, and CPython is making real progress at speeding the common cases too.
Relatedly, nobody ever abuses the default-argument mechanism to add local_name=some_global_name_or_const
"arguments" anymore, unless it's absolutely time-critical in a very heavily used library.
Else clarity is valued more.
from cpython.
Unfortunately, smallest_ratio
is not about alo
or ahi
: it is about the smallest difference across unrelated comparisons (thus, I was also wrong with my 500k symbol estimation). It has to be much smaller; this one is probably a correct (upper bound):
len_a = max(map(len, a))
len_b = max(map(len, b))
smallest_ratio_delta = 2 * 0.99 / (lena + lenb) ** 2
I plan to play around and open PR. My first one here!
from cpython.
Unless I'm hallucinating again, there's a simpler way, which doesn't add new floating-point stress, or even require computing "epsilon". Store the ratio instead as a 2-tuple, (ratio_just_as_now, distsum)
. Here distsum
is an int giving the sum of the distance of i
from its closest endpoint, and likewise for j
.
Tuple comparison does the rest. If the first components differ (ratios we use today differ), that settles it. Else (ratios the same), the larger is whichever has the larger distsum
.
Quick coding of that worked fine. Strength reduction in the inner loop becomes just
distsum += 1 if i <= ihalf else -1
It's fine to use a more long-winded way in the outer loop, though. In real life, the most important line of the inner loop is the real_quick_ratio()
"early out", and I'm loathe to add cycles that slow down getting to that very fast test (under PyPy, the method call may even be inlined).
from cpython.
I updated the branch with rolling weights. I think I like it the most; the only subtle point is that the weight is now an integer number: instead of computing delta_a / len_a + delta_b / len_b
I compute the numerator only delta_a * len_b + delta_b * len_a
. I thought about further simplifying it to delta_a + delta_b
: is normalization that important?
I've put an assert weight == 0
at the end of the loop (not in the branch) to check if everything is correct.
from cpython.
Suppose we have two sequences of wildly different lengths (so normalization could make most difference) but all element pairs have the same .ratio().
I agree: there is no normalization needed in this case. But suppose we have just two competitors with indices short_1, long_1
and short_2, long_2
. We are generally more concerned about equally splitting the shorter text. Then, accurate weighting may be important. Without weights, the values of long_*
are, in general, bigger, thus, more impactful. We may want the opposite.
If (I, J) is one of those pairs where I and J are both closest to their respective index midpoints, by convexity the weights of I and of J alone are each as large as possible.
This one also caught my attention! But this is not always the case as I hopefully explained above. If we are choosing the best_i
from a set of candidates {i}
and the best_j
from a set of candidate {j}
separately (as it happens in the code snippet above) then we do not need any weights. But if we are choosing the best pair from the set {(i, j)}
which is not necessarily a product of some sets {i}
and {j}
then the result does depend on the assigned weights. And the issue is not in the form of the weight function; it is rather the search space may not be trivial.
And this may not be the end of the story. After some thinking I came up with this statement:
Deep enough into the recursion tree we are guaranteed to have comparable (by size) sub-sequences (provided the specific choice of the weight function -abs(i - mid_a) - abs(j - mid_b)
for example). Thus, from the point of fixing the complexity scaling, weights are not important indeed.
The O(N^3)
complexity scaling and the recursion blow up may happen only if
- the search space
{i, j}
includes a specific structure(i1, j1), (i2, j2), ... (in, jn)
where both indexes are strictly growingi1 < i2 < i3 < ... < in
andj1 < j2 < j3 < ... < jn
and - the weight function promotes the pairs sequentially: we split at
(i1, j1)
, then at(i2, j2)
, etc
And it is particularly difficult to find an example where the weight function -abs(i - mid_a) - abs(j - mid_b)
would do such a cruel thing! The best I came up with is where the long sequence is exponentially bigger than the short one, thus, "deep enough" is actually deeper than the number of elements in the short sequence. But it is still very optimal from the long sequence point of view.
from cpython.
On a related note, there might be some room for further improvement in the above. I think in the case discussed here it is much faster to split into more than two chunks at the same recursion depth. Technically, we should find the longest strictly growing sequence (i1, j1), (i2, j2), ... (in, jn)
such that i1 < i2 < i3 < ... < in
and j1 < j2 < j3 < ... < jn
and do recursive calls n+1
times rather than 2 times only. For the case in the header it is going to be a massive improvement.
Even more, we could avoid recursive calls completely by computing all pairs above the 0.75
threshold and finding the longest chain above (some research needed with regard to fast ratio lower bounds optimizations).
Yet, I believe this PR is good enough.
from cpython.
All true. I think it won't be irrelevant to study other diff libs.
Just to mention, the code merged does change some diff outputs. For example, this one will find only one match, instead of two (before).
cat
dog
close match 1
close match 2
vs
close match 3
close match 4
kitten
puppy
If we want to restore it back (and to keep the recursion under control) we can take the approach above for ratio == max(ratio)
and using the original (greedy) algorithm. Then, we may consider loosening towards other ratio >= .75
and using a more educated guess to form the chain (i1, j1) < (i2, j2) < ... < (in, jn)
. If we properly implement the foundation then both changes are not so massive any longer, whether they go in or not.
I, too, have concerns about considering all ratio >= .75
at once. Not that the diff changes but computing it might take a lot more time because the "slow" .ratio
will be called more often in this case.
from cpython.
BTW, I agree this class of inputs is worth addressing π. Please open a Pull Request?
Offhand, it seems to me that this would be a faster and much more obvious way to get a value that's 0 at i=0
and i=hi-1
, and a maximum of 0.5 at the midpoint:
min(i - lo, hi - 1 - i) / (hi - lo)
from cpython.
Sure that will probably do as well; was just avoiding function calls.
from cpython.
Let's stay away from hard-coded constants, too - especially ones with "mystery names" π. Constants frequently bite us later, and are rarely needed.
In this case, .ratio()
is defined to return 2.0*M / T
, where M
is the number of matches and T
is the total number of characters. So code like this, once at the start, is cheap to determine what's actually required:
lena = ahi - alo
lenb = bhi - blo
smallest_ratio_delta = 2 * 0.99 / (lena + lenb)
Defining lena
and lenb
locals may also simplify later code. Multiplying by 0.99 is to make this a lower bound, to be on the safe side. Again resist micro-optimization: while years ago it helped a little bit to avoid the multiply and type in "1.98" by hand, those days are long over. Even CPython does that arithmetic at compile-time now.
Then if you have two "midrange bonus" functions that max out at 0.5 for i and j being at the midpoints of their ranges,
midrange_total_bonus = (midrange_a_bonus + midrange_b_bonus) * smallest_ratio_delta
will yield a score bonus less than the difference between any two distinct raw scores (the sum of the two bonuses can't be greater than 1.0).
from cpython.
"Strength reduction" looks like a worthwhile optimization. Short course: all function calls, and even multiplies, can be removed from the inner loop, by keeping the bonus as a running total, and adjusting it near the top of the loop like so:
bonus += a_inc if i <= ihalf else -a_inc
where the other names are function-level invariants:
ihalf = (alo + ahi - 1) / 2
a_inc = smallest_ratio_delta / lena
So the bonus keeps going up around the loop, until the midpoint is reached, after which it keeps going down.
Similarly for the outer loop. Some details may be tricky. Better names could be picked π.
If you don't want to bother, let me know here, and I'll open a pull request.
from cpython.
Wow - I was hallucinating badly, wasn't I
It's all yours - I look forward to the PR π
Unless I'm still hallucinating, the square in the new expression isn't needed, is it? If the longest elements ever compared have lengths len_a
and len_b
, the smallest non-zero .ratio()
possible is 2 over their sum.
from cpython.
No problem: I often skip over large chunks and forget to come back and to explain.
Given the ratio 2 * some_integer / (len(ai) + len(bj))
,
the differences between two arbitrary ratios are simply
2 * some_integer / (len(ai1) + len(bj1))
- 2 * other_integer / (len(ai2) + len(bj2)) =
2 * [some_integer * (...) - other_integer * (...)]
/ [(len(ai1) + len(bj1)) * (len(ai2) + len(bj2))] =
2 * still_some_integer
/ [(len(ai1) + len(bj1)) * (len(ai2) + len(bj2))]
Trying to minimize the above by the absolute value (and requiring it to be non-zero) we just discard whatever the details of the numerator except that it is a non-zero integer: thus, we take it still_some_integer = 1
. For the denominator we should technically take largest and next-largest len(ai) + len(bj)
but we just take it the largest possible sum len(ai) + len(bj)
twice to simplify computations and slightly worsen our lower bound.
Example (rather my rough understanding than the actual computation):
a1, b1 = "abc", "a"
a2, b2 = "abcd", "a"
ratio_1 = 2 * 1/4
ratio_2 = 2 * 1/5
ratio_1 - ratio_2 = 2 * 1/20 # our estimation is 2 * 1/25 instead
from cpython.
Thanks! Got it now. The smallest delta between .ratio()
s of two sequences of the same lengths may indeed be much larger than the delta between .ratio()
s of sequences with different (but no larger) lengths.
Your more careful analysis looks right. Using the square to simplify reasoning is fine. It's not like we're pushing the limits of float precision here π.
from cpython.
I am slightly concerned by how small weights can become (but only slightly). We now try to make differences as small as 1 / line_size ** 2 / number_of_lines
. For a file with 10k
lines and 1k
symbols per line this is 1e-11
which is ok-ish, yet, not so distant from the double precision 1e-17
on the logarithmic scale.
from cpython.
If we end up computing differences so small that they up getting totally lost to rounding when added to the ratio, I don't think it much matters - just a waste of time.
The interesting cases are near the midpoints of the index ranges, where your factor number_of_lines
is close to 1 instead.
Alternatives include using fractions.Fraction
to retain exact ratios and weights internally, or (faster) to use decimal
configured to use "sufficient" precision.
from cpython.
I think I'll back off on "strength reduction". The full-blown min(i - alo, ahi-1 - i)
is certainly easier to grasp, and not all that much slower. Although I'd personally precompute ahi-1
π - saving trips around the eval loop can still matter in innermost loops.
from cpython.
I thought about further simplifying it to
delta_a + delta_b
: is normalization that important?
I don['t even know of a case where "normalization" is helpful π. Do you? Suppose we have two sequences of wildly different lengths (so normalization could make most difference) but all element pairs have the same .ratio()
. Without normalization, the best-scoring pair will come from the midpoints of both sequences. Nothing can improve on that, but adding normalization makes it more of a puzzle to figure out what happens.
I've put an
assert weight == 0
at the end of the loop (not in the branch)
Why not in the code? Checking for 0 is cheap.
Note a subtlety with "rolling weights": in the innermost loop, the adjustment has to be done very near the top of the loop. Else there's a conditional (very rarely executed) continue
that jumps back to the top of the loop and so skip that iteration's adjustment. Something the aforementioned assert
would have caught π.
from cpython.
FYI, this is the cheapest way I've thought of to get a non-rolling weight:
# Function-level invariantd.
midi = (alo + ahi-1) / 2
midj = (blo + bhi-1) / 2
# Outer loop.
b_weight = -abs(j - midj)
...
# Inner loop.
weight = b_weight - abs(i - midi)
It's a float (but with fractional part 0 or 0.5). If there are an odd number of possible indices, it's exactly 0 at the middle one, and gets smaller(more negative) at the same rate for indices moving away from the midpoint.
If there are an even number of indices, the two center ones get the largest weight, -0.5.
I don't think signs, or possible fractions of 0.5, matter. As in your original post, "convex" is what matters: msx out at the central indices, and get smaller the farther from the center.
from cpython.
I think reasoning rules out that normalization can do any good. Consider what happens without it, and assuming weights are "convex".
The only interesting cases are where the weights make a difference, at identical raw .ratio()
s. Restrict attention to only those (i, j)
pairs that reach the maximum raw .ratio()
. If (I, J)
is one of those pairs where I
and J
are both closest to their respective index midpoints, by convexity the weights of I
and of J
alone are each as large as possible. No other (i, j)
pair can exceed its weight.
Convexity is scale-invariant, so normalization doesn't hurt this: we can multiply the i
scores by any non-zero constant, and the j
scores by any other, and it doesn't make any difference to that outcome.
So don't bother with multiplying at all π.
from cpython.
You're right again π. I over-generalized from your specific test case, assuming the full Cartesian product of the index ranges would be in play in bad cases.
Well, maybe they are in worst current cases. The sparser the subset of the full product in play, the harder it looks to contrive bad cases. I'll settle for a massive improvement in the worst cases.
I think in the case discussed here it is much faster to split into more than two chunks at the same recursion depth.
Absolutely so. The recursive calls are doing purely redundant work, finding the same max ratio at (a subset of) the same line pairs over & over & over again.
I knew that at the time, but an obvious quick way to exploit it didn't come to mind. As is, short-circuiting so that only a brand new max could make it through the gauntlet of "if" tests paid off a lot. Relaxing that to also find pairs that equal the current max was unattractive - and "bad cases" never showed up in real life. But, at the time, the library was overwhelmingly used only to compare files of Python and C code, and human-written prose.
If you'd like to pursue it, have at it!
Yet, I believe this PR is good enough.
No question that it's a huge improvement as is.
from cpython.
Something like this (wholly untested) might do the trick, preserving short-circuiting as much as possible without missing an "equals the best" case. For each j
, it remembers the smallest i
with an equal ratio, provided that i
is greater than the last such i
remembered (and not since thrown away by finding a strictly larger ratio).
from operator import gt, ge
max_pairs = [] # index pairs achieving best_ratio
maxi = -1 # `i` index of last pair in max_pairs
for j in range_b:
searching = True
for i in range_a:
cmp = ge if searching and i > maxi else gt
if cmp(real_quick_ratio, best_ratio) and ...:
searching = False
if ratio > best_ratio:
max_pairs.clear()
best_ratio = ratio
max_pairs.append((i, j))
maxi = i
# The indices in max_pairs all have the same, maximal ratio,
# and are strictly ascending in both positions.
It's still a "greedy" approach. In your original test case, I expect max_pairs
to end up as [(alo, blo), (alo + 1, blo + 1), (alo + 2, blo + 2), ...]
.
from cpython.
we could avoid recursive calls completely by computing all pairs above the 0.75 threshold and finding the longest chain above
I'm not sanguine about that. The "synch pair" establishes a very high wall, with no cross-talk between the "before" and "after" segments of the inputs. This is mostly driven by the linear nature of diff output, restricted to "matches", "insert", "delete", and "replace", from first to last lines of both inputs.
For example, suppose we have two thousand-line inputs. The first line of a
is a very close match to the last line of b
, but all other pairs are only (say) 90% similar. Ideally it could be useful to show how those 90% similar lines are related to each other, but the output has no way to do that. Instead the best(*) that can be done is:
- The first 999 lines of
b
were inserted. - Then there's one very similar line.
- Then the last 999 lines of
a
were deleted.
Any ratio less than the unique very best is irrelevant to that outcome.
If the synch pair occurred more toward the middle of the inputs, the recursive calls could do a lot of redundant work, but again ratios between the "before" elements of one input and the "after" elements of the other are useless after the synch pair is picked.
There's also that difflib intends never to create a data structure whose size is O(len(a) * len(b))
. In the pseudo-code posted before this, len(max_pairs) <= min(ahi - alo, bhi - blo)
, which is minor.
(*) Of course that's arguable. Could be that it's the "very close match" that should be discounted, in favor of producing markup for the 999 "pretty close" matches. But that would be a Big Change, and this isn't Hard Scienceβ’.
from cpython.
I'm not determined to keep all results the same, but your example bothers me enough that I'm going to re-open this issue.
The fundamental goal of this library was to create diffs that, to the extent possible, "make sense" to human eyes. Because, at the time, all Unix-y diff programs too often produced diffs for Python's C code that were more baffling than helpful (e.g., synching up on garbage like blank lines and lone curly braces far apart in the inputs).
Locality seems to matter most to people, so the core approach builds on finding long stretches of consecutive matching lines. Another that matters to people is "top to bottom". It just makes "no sense" to my eyes that given two blocks of closely matching lines, the program would match the first line of one input's block with the last line of the other's. "Well, that's because they're both closest to the center of their index ranges" sounds like a geek making up technobabble excuses for absurd output π.
I believe the pseudocode I gave would restore "first to first, second to second, ..." results, so I'm keen now to pursue it.
About other diff libraries, over the years I haven't found that studying their details was helpful. At a high level, if I had it to do over again I'd start with git's "histogram" diff. Another thing that catches eyes is rare lines, and "histogram" addresses that directly. I think difflib's "junk" heuristics were a mostly-failed attempt toward the same end, by ignoring "too common" elements instead of identifying the rare ones. People generally don't know to use it, and even I rarely made the effort required.
But "histogram" isn't a full diff algorithm either, just an early pass for identifying high-value "synch points". Producing diffs between synch points has to be done by some other algorithm.
And for some tasks, like comparing genome sequences, "histogram" is useless (it won't find any rare elements). Then again, difflib is mostly useless for that too (and completely useless unless "autojunk" is disabled - in which case its generally quadratic runtime is too slow for that task).
from cpython.
Related Issues (20)
- pyrepl: Cursor behaviour during tab-completion HOT 1
- Tier 2 Optimizer Eliminate Type Version Guards
- is_dataclass() returns True for non-dataclass subclass of dataclass HOT 5
- test_ioctl is skipped because of setsid()
- Update type checking conditions in new repl module
- `contextlib.suppress` converts instances of a subtype of `ExceptionGroup` to an instance of the `ExceptionGroup` class HOT 2
- Add job to `jit.yml` to build and test with `--disable-gil`
- Incompatibility between _decimal and _pydecimal: tp_name for Decimal
- Python float calculation error HOT 2
- IDLE: Menu fonts too small HOT 1
- `get_origin(str|None) != get_origin(Union[str,None])` HOT 4
- Break up _pyrepl tests HOT 1
- '\040' instead of space in repl history on macOS with 3.13 or main branch HOT 2
- Unexpected name mangling behavior with generics HOT 1
- Resolve deprecation warnings in Docs/tools HOT 1
- REPL cursor is rendered offset in VSCode on WSL HOT 1
- Add c-api to set callback function on enter and exit of PyContext
- '_PyLong_NumBits': identifier not found in 3.13 HOT 3
- Out-of-memory when loading a Plist
- Make `Py_BEGIN_CRITICAL_SECTION()` and `Py_END_CRITICAL_SECTION()` public in the non-limited C API
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cpython.