Bug report Bug deion: <div class="highlight highlight-

Changed label from "type-bug" to "performance" and "type-feature". <code class="notran

Unfortunately, smallest_ratio is not about <code clas

difflib.py Differ.compare is too slow [for degenerate cases],about python/cpython

Comments (25)

tim-one commented on June 12, 2024 1

Changed label from "type-bug" to "performance" and "type-feature". difflib doesn't make any promises about speed, and finding "bad cases" is expected - that's not "bug" territory.

In your proposal. please add some high-level comments about the meaning of "_gravity" and what range of values might be useful. The name on its own is just a mystery.

from cpython.

tim-one commented on June 12, 2024 1

Micro-optimizations like that are increasingly out of style in core Python code. They generally don't help (or even hurt) when running under PyPy, and CPython is making real progress at speeding the common cases too.

Relatedly, nobody ever abuses the default-argument mechanism to add local_name=some_global_name_or_const "arguments" anymore, unless it's absolutely time-critical in a very heavily used library.

Else clarity is valued more.

from cpython.

pulkin commented on June 12, 2024 1

Unfortunately, smallest_ratio is not about alo or ahi: it is about the smallest difference across unrelated comparisons (thus, I was also wrong with my 500k symbol estimation). It has to be much smaller; this one is probably a correct (upper bound):

len_a = max(map(len, a))
len_b = max(map(len, b))
smallest_ratio_delta = 2 * 0.99 / (lena + lenb) ** 2

I plan to play around and open PR. My first one here!

from cpython.

tim-one commented on June 12, 2024 1

Unless I'm hallucinating again, there's a simpler way, which doesn't add new floating-point stress, or even require computing "epsilon". Store the ratio instead as a 2-tuple, (ratio_just_as_now, distsum). Here distsum is an int giving the sum of the distance of i from its closest endpoint, and likewise for j.

Tuple comparison does the rest. If the first components differ (ratios we use today differ), that settles it. Else (ratios the same), the larger is whichever has the larger distsum.

Quick coding of that worked fine. Strength reduction in the inner loop becomes just

                distsum += 1 if i <= ihalf else -1

It's fine to use a more long-winded way in the outer loop, though. In real life, the most important line of the inner loop is the real_quick_ratio() "early out", and I'm loathe to add cycles that slow down getting to that very fast test (under PyPy, the method call may even be inlined).

from cpython.

pulkin commented on June 12, 2024 1

I updated the branch with rolling weights. I think I like it the most; the only subtle point is that the weight is now an integer number: instead of computing delta_a / len_a + delta_b / len_b I compute the numerator only delta_a * len_b + delta_b * len_a. I thought about further simplifying it to delta_a + delta_b: is normalization that important?

I've put an assert weight == 0 at the end of the loop (not in the branch) to check if everything is correct.

from cpython.

pulkin commented on June 12, 2024 1

Suppose we have two sequences of wildly different lengths (so normalization could make most difference) but all element pairs have the same .ratio().

I agree: there is no normalization needed in this case. But suppose we have just two competitors with indices short_1, long_1 and short_2, long_2. We are generally more concerned about equally splitting the shorter text. Then, accurate weighting may be important. Without weights, the values of long_* are, in general, bigger, thus, more impactful. We may want the opposite.

If (I, J) is one of those pairs where I and J are both closest to their respective index midpoints, by convexity the weights of I and of J alone are each as large as possible.

This one also caught my attention! But this is not always the case as I hopefully explained above. If we are choosing the best_i from a set of candidates {i} and the best_j from a set of candidate {j} separately (as it happens in the code snippet above) then we do not need any weights. But if we are choosing the best pair from the set {(i, j)} which is not necessarily a product of some sets {i} and {j} then the result does depend on the assigned weights. And the issue is not in the form of the weight function; it is rather the search space may not be trivial.

And this may not be the end of the story. After some thinking I came up with this statement:

Deep enough into the recursion tree we are guaranteed to have comparable (by size) sub-sequences (provided the specific choice of the weight function -abs(i - mid_a) - abs(j - mid_b) for example). Thus, from the point of fixing the complexity scaling, weights are not important indeed.

The O(N^3) complexity scaling and the recursion blow up may happen only if

the search space {i, j} includes a specific structure (i1, j1), (i2, j2), ... (in, jn) where both indexes are strictly growing i1 < i2 < i3 < ... < in and j1 < j2 < j3 < ... < jn and
the weight function promotes the pairs sequentially: we split at (i1, j1), then at (i2, j2), etc

And it is particularly difficult to find an example where the weight function -abs(i - mid_a) - abs(j - mid_b) would do such a cruel thing! The best I came up with is where the long sequence is exponentially bigger than the short one, thus, "deep enough" is actually deeper than the number of elements in the short sequence. But it is still very optimal from the long sequence point of view.

from cpython.

pulkin commented on June 12, 2024 1

On a related note, there might be some room for further improvement in the above. I think in the case discussed here it is much faster to split into more than two chunks at the same recursion depth. Technically, we should find the longest strictly growing sequence (i1, j1), (i2, j2), ... (in, jn) such that i1 < i2 < i3 < ... < in and j1 < j2 < j3 < ... < jn and do recursive calls n+1 times rather than 2 times only. For the case in the header it is going to be a massive improvement.

Even more, we could avoid recursive calls completely by computing all pairs above the 0.75 threshold and finding the longest chain above (some research needed with regard to fast ratio lower bounds optimizations).

Yet, I believe this PR is good enough.

from cpython.

pulkin commented on June 12, 2024 1

All true. I think it won't be irrelevant to study other diff libs.

Just to mention, the code merged does change some diff outputs. For example, this one will find only one match, instead of two (before).

cat
dog
close match 1
close match 2

close match 3
close match 4
kitten
puppy

If we want to restore it back (and to keep the recursion under control) we can take the approach above for ratio == max(ratio) and using the original (greedy) algorithm. Then, we may consider loosening towards other ratio >= .75 and using a more educated guess to form the chain (i1, j1) < (i2, j2) < ... < (in, jn). If we properly implement the foundation then both changes are not so massive any longer, whether they go in or not.

I, too, have concerns about considering all ratio >= .75 at once. Not that the diff changes but computing it might take a lot more time because the "slow" .ratio will be called more often in this case.

from cpython.

tim-one commented on June 12, 2024

BTW, I agree this class of inputs is worth addressing 😄. Please open a Pull Request?

Offhand, it seems to me that this would be a faster and much more obvious way to get a value that's 0 at i=0 and i=hi-1, and a maximum of 0.5 at the midpoint:

min(i - lo, hi - 1 - i) / (hi - lo)

from cpython.

pulkin commented on June 12, 2024

Sure that will probably do as well; was just avoiding function calls.

from cpython.

tim-one commented on June 12, 2024

Let's stay away from hard-coded constants, too - especially ones with "mystery names" 😉. Constants frequently bite us later, and are rarely needed.

In this case, .ratio() is defined to return 2.0*M / T, where M is the number of matches and T is the total number of characters. So code like this, once at the start, is cheap to determine what's actually required:

lena = ahi - alo
lenb = bhi - blo
smallest_ratio_delta = 2 * 0.99 / (lena + lenb)

Defining lena and lenb locals may also simplify later code. Multiplying by 0.99 is to make this a lower bound, to be on the safe side. Again resist micro-optimization: while years ago it helped a little bit to avoid the multiply and type in "1.98" by hand, those days are long over. Even CPython does that arithmetic at compile-time now.

Then if you have two "midrange bonus" functions that max out at 0.5 for i and j being at the midpoints of their ranges,

midrange_total_bonus = (midrange_a_bonus + midrange_b_bonus) * smallest_ratio_delta

will yield a score bonus less than the difference between any two distinct raw scores (the sum of the two bonuses can't be greater than 1.0).

from cpython.

tim-one commented on June 12, 2024

"Strength reduction" looks like a worthwhile optimization. Short course: all function calls, and even multiplies, can be removed from the inner loop, by keeping the bonus as a running total, and adjusting it near the top of the loop like so:

                bonus += a_inc if i <= ihalf else -a_inc

where the other names are function-level invariants:

        ihalf = (alo + ahi - 1) / 2
        a_inc = smallest_ratio_delta / lena

So the bonus keeps going up around the loop, until the midpoint is reached, after which it keeps going down.

Similarly for the outer loop. Some details may be tricky. Better names could be picked 😉.

If you don't want to bother, let me know here, and I'll open a pull request.

from cpython.

tim-one commented on June 12, 2024

Wow - I was hallucinating badly, wasn't I ☹️?

It's all yours - I look forward to the PR 😄

Unless I'm still hallucinating, the square in the new expression isn't needed, is it? If the longest elements ever compared have lengths len_a and len_b, the smallest non-zero .ratio() possible is 2 over their sum.

from cpython.

pulkin commented on June 12, 2024

No problem: I often skip over large chunks and forget to come back and to explain.

Given the ratio 2 * some_integer / (len(ai) + len(bj)),

the differences between two arbitrary ratios are simply

  2 * some_integer / (len(ai1) + len(bj1))
- 2 * other_integer / (len(ai2) + len(bj2)) =

2 * [some_integer * (...) - other_integer * (...)]
/ [(len(ai1) + len(bj1)) * (len(ai2) + len(bj2))] =

2 * still_some_integer
/ [(len(ai1) + len(bj1)) * (len(ai2) + len(bj2))]

Trying to minimize the above by the absolute value (and requiring it to be non-zero) we just discard whatever the details of the numerator except that it is a non-zero integer: thus, we take it still_some_integer = 1. For the denominator we should technically take largest and next-largest len(ai) + len(bj) but we just take it the largest possible sum len(ai) + len(bj) twice to simplify computations and slightly worsen our lower bound.

Example (rather my rough understanding than the actual computation):

a1, b1 = "abc", "a"
a2, b2 = "abcd", "a"
ratio_1 = 2 * 1/4
ratio_2 = 2 * 1/5
ratio_1 - ratio_2 = 2 * 1/20  # our estimation is 2 * 1/25 instead

from cpython.

tim-one commented on June 12, 2024

Thanks! Got it now. The smallest delta between .ratio()s of two sequences of the same lengths may indeed be much larger than the delta between .ratio()s of sequences with different (but no larger) lengths.

Your more careful analysis looks right. Using the square to simplify reasoning is fine. It's not like we're pushing the limits of float precision here 😉.

from cpython.

pulkin commented on June 12, 2024

I am slightly concerned by how small weights can become (but only slightly). We now try to make differences as small as 1 / line_size ** 2 / number_of_lines. For a file with 10k lines and 1k symbols per line this is 1e-11 which is ok-ish, yet, not so distant from the double precision 1e-17 on the logarithmic scale.

from cpython.

tim-one commented on June 12, 2024

If we end up computing differences so small that they up getting totally lost to rounding when added to the ratio, I don't think it much matters - just a waste of time.

The interesting cases are near the midpoints of the index ranges, where your factor number_of_lines is close to 1 instead.

Alternatives include using fractions.Fraction to retain exact ratios and weights internally, or (faster) to use decimal configured to use "sufficient" precision.

from cpython.

tim-one commented on June 12, 2024

I think I'll back off on "strength reduction". The full-blown min(i - alo, ahi-1 - i) is certainly easier to grasp, and not all that much slower. Although I'd personally precompute ahi-1 😉 - saving trips around the eval loop can still matter in innermost loops.

from cpython.

tim-one commented on June 12, 2024

I thought about further simplifying it to delta_a + delta_b: is normalization that important?

I don['t even know of a case where "normalization" is helpful 😉. Do you? Suppose we have two sequences of wildly different lengths (so normalization could make most difference) but all element pairs have the same .ratio(). Without normalization, the best-scoring pair will come from the midpoints of both sequences. Nothing can improve on that, but adding normalization makes it more of a puzzle to figure out what happens.

I've put an assert weight == 0 at the end of the loop (not in the branch)

Why not in the code? Checking for 0 is cheap.

Note a subtlety with "rolling weights": in the innermost loop, the adjustment has to be done very near the top of the loop. Else there's a conditional (very rarely executed) continue that jumps back to the top of the loop and so skip that iteration's adjustment. Something the aforementioned assert would have caught 😉.

from cpython.

tim-one commented on June 12, 2024

FYI, this is the cheapest way I've thought of to get a non-rolling weight:

# Function-level invariantd.
midi = (alo + ahi-1) / 2
midj = (blo + bhi-1) / 2

    # Outer loop.
    b_weight = -abs(j - midj)
    ...
        # Inner loop.
        weight = b_weight - abs(i - midi)

It's a float (but with fractional part 0 or 0.5). If there are an odd number of possible indices, it's exactly 0 at the middle one, and gets smaller(more negative) at the same rate for indices moving away from the midpoint.

If there are an even number of indices, the two center ones get the largest weight, -0.5.

I don't think signs, or possible fractions of 0.5, matter. As in your original post, "convex" is what matters: msx out at the central indices, and get smaller the farther from the center.

from cpython.

tim-one commented on June 12, 2024

I think reasoning rules out that normalization can do any good. Consider what happens without it, and assuming weights are "convex".

The only interesting cases are where the weights make a difference, at identical raw .ratio()s. Restrict attention to only those (i, j) pairs that reach the maximum raw .ratio(). If (I, J) is one of those pairs where I and J are both closest to their respective index midpoints, by convexity the weights of I and of J alone are each as large as possible. No other (i, j) pair can exceed its weight.

Convexity is scale-invariant, so normalization doesn't hurt this: we can multiply the i scores by any non-zero constant, and the j scores by any other, and it doesn't make any difference to that outcome.

So don't bother with multiplying at all 😄.

from cpython.

tim-one commented on June 12, 2024

You're right again 😄. I over-generalized from your specific test case, assuming the full Cartesian product of the index ranges would be in play in bad cases.

Well, maybe they are in worst current cases. The sparser the subset of the full product in play, the harder it looks to contrive bad cases. I'll settle for a massive improvement in the worst cases.

I think in the case discussed here it is much faster to split into more than two chunks at the same recursion depth.

Absolutely so. The recursive calls are doing purely redundant work, finding the same max ratio at (a subset of) the same line pairs over & over & over again.

I knew that at the time, but an obvious quick way to exploit it didn't come to mind. As is, short-circuiting so that only a brand new max could make it through the gauntlet of "if" tests paid off a lot. Relaxing that to also find pairs that equal the current max was unattractive - and "bad cases" never showed up in real life. But, at the time, the library was overwhelmingly used only to compare files of Python and C code, and human-written prose.

If you'd like to pursue it, have at it!

Yet, I believe this PR is good enough.

No question that it's a huge improvement as is.

from cpython.

tim-one commented on June 12, 2024

Something like this (wholly untested) might do the trick, preserving short-circuiting as much as possible without missing an "equals the best" case. For each j, it remembers the smallest i with an equal ratio, provided that i is greater than the last such i remembered (and not since thrown away by finding a strictly larger ratio).

from operator import gt, ge
max_pairs = [] # index pairs achieving best_ratio
maxi = -1 # `i` index of last pair in max_pairs

for j in range_b:
    searching = True
    for i in range_a:
        cmp = ge if searching and i > maxi else gt
        if cmp(real_quick_ratio, best_ratio) and ...:
            searching = False
            if ratio > best_ratio:
                max_pairs.clear()
                best_ratio = ratio
            max_pairs.append((i, j))
            maxi = i
    # The indices in max_pairs all have the same, maximal ratio,
    # and are strictly ascending in both positions.

It's still a "greedy" approach. In your original test case, I expect max_pairs to end up as [(alo, blo), (alo + 1, blo + 1), (alo + 2, blo + 2), ...].

from cpython.

tim-one commented on June 12, 2024

we could avoid recursive calls completely by computing all pairs above the 0.75 threshold and finding the longest chain above

I'm not sanguine about that. The "synch pair" establishes a very high wall, with no cross-talk between the "before" and "after" segments of the inputs. This is mostly driven by the linear nature of diff output, restricted to "matches", "insert", "delete", and "replace", from first to last lines of both inputs.

For example, suppose we have two thousand-line inputs. The first line of a is a very close match to the last line of b, but all other pairs are only (say) 90% similar. Ideally it could be useful to show how those 90% similar lines are related to each other, but the output has no way to do that. Instead the best(*) that can be done is:

The first 999 lines of b were inserted.
Then there's one very similar line.
Then the last 999 lines of a were deleted.

Any ratio less than the unique very best is irrelevant to that outcome.

If the synch pair occurred more toward the middle of the inputs, the recursive calls could do a lot of redundant work, but again ratios between the "before" elements of one input and the "after" elements of the other are useless after the synch pair is picked.

There's also that difflib intends never to create a data structure whose size is O(len(a) * len(b)). In the pseudo-code posted before this, len(max_pairs) <= min(ahi - alo, bhi - blo), which is minor.

(*) Of course that's arguable. Could be that it's the "very close match" that should be discounted, in favor of producing markup for the 999 "pretty close" matches. But that would be a Big Change, and this isn't Hard Science™.

from cpython.

tim-one commented on June 12, 2024

I'm not determined to keep all results the same, but your example bothers me enough that I'm going to re-open this issue.

The fundamental goal of this library was to create diffs that, to the extent possible, "make sense" to human eyes. Because, at the time, all Unix-y diff programs too often produced diffs for Python's C code that were more baffling than helpful (e.g., synching up on garbage like blank lines and lone curly braces far apart in the inputs).

Locality seems to matter most to people, so the core approach builds on finding long stretches of consecutive matching lines. Another that matters to people is "top to bottom". It just makes "no sense" to my eyes that given two blocks of closely matching lines, the program would match the first line of one input's block with the last line of the other's. "Well, that's because they're both closest to the center of their index ranges" sounds like a geek making up technobabble excuses for absurd output 😉.

I believe the pseudocode I gave would restore "first to first, second to second, ..." results, so I'm keen now to pursue it.

About other diff libraries, over the years I haven't found that studying their details was helpful. At a high level, if I had it to do over again I'd start with git's "histogram" diff. Another thing that catches eyes is rare lines, and "histogram" addresses that directly. I think difflib's "junk" heuristics were a mostly-failed attempt toward the same end, by ignoring "too common" elements instead of identifying the rare ones. People generally don't know to use it, and even I rarely made the effort required.

But "histogram" isn't a full diff algorithm either, just an early pass for identifying high-value "synch points". Producing diffs between synch points has to be done by some other algorithm.

And for some tasks, like comparing genome sequences, "histogram" is useless (it won't find any rare elements). Then again, difflib is mostly useless for that too (and completely useless unless "autojunk" is disabled - in which case its generally quadratic runtime is too slow for that task).

from cpython.

difflib.py Differ.compare is too slow [for degenerate cases] about cpython HOT 25 OPEN

Comments (25)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent