Git Product home page Git Product logo

Comments (7)

Nytelife26 avatar Nytelife26 commented on June 11, 2024 1

Hi, @patcon. I have reconsidered, indeed. Many other linters provide configurations for special environments, and it can't hurt to broaden proselint's utility. Maybe this could be implemented as a plugin, when such a system is available? Alternatively, if you have suggestions for how this could work in a similarly performant manner to what we have now, I would be open to including this in proselint's core.

from proselint.

ferdnyc avatar ferdnyc commented on June 11, 2024 1

So, with respect to the fine initial work done by @vqle23 in #1351, I have a different implementation of this that I think solves the major issues that came up in that PR's review process.

I started from these basic premises:

  1. We have a relatively small universe of words in the "pass" list
  2. We'll eventually need to find the position of every non-matching word
  3. No matter how long the input text is, we're going to end up searching through it all the way to the end.
  4. (Tangentially and as a caveat/addendum: At least in the first pass, words shorter than 3 characters aren't worth bothering with.)1

So, even on a super long text, because we eventually need to regular-expression-search the whole thing anyway, rather than building searches for any specific word(s), it makes more sense to just search it for every word, using re.compile(r"[\w'-]{3,}").finditer(text).2

Then, we can walk the iterator of matched results (non-overlapping word spans in the input text), and examine each one to (a) exclude any strings containing numerals, then (b) determine whether it appears on the supplied list of permitted words.

Anytime matched in list isn't True, that word can immediately be added to the errors list without any additional processing, because we already have the necessary position data in the .finditer() Match objects we're iterating over.

So, that implementation looks something like this:

def reverse_existence_check(
        text, list, err, msg, ignore_case=True, str=False, offset=0
        ):
    """Find all words in ``text`` that aren't on the ``list``."""
    if ignore_case:
        list = [word.lower() for word in list]

        def allowed_word(match):
            matched = match.string[match.start():match.end()].lower()
            return matched in list
    else:
        def allowed_word(match):
            matched = match.string[match.start():match.end()]
            return matched in list

    # Match all 3+ character words
    tokenizer = re.compile(r"[\w'-]{3,}")
    # Remove any that contain numerals
    exclusions = re.compile(r'[0-9]')

    errors = [(
        m.start() + 1 + offset,
        m.end() + offset,
        err,
        msg.format(m.string[m.start():m.end()]),
        None)
        for m in tokenizer.finditer(text)
        if not exclusions.search(m.string[m.start():m.end()])
        and not allowed_word(m)
    ]
    return errors

(A potential further optimization: The function could actually be written as a generator, by using yield instead of building the full list. That would mean consuming its output as an iterator would allow it to operate incrementally over the length of the input text.)

Iterative processing may not even be necessary, though. I happen to have a text version of "The Hitchhiker's Guide to the Galaxy" (a transcript of the radio plays, I think, not the book version), anyway it's 5019 lines and just shy of 47,000 words.

I imported it as a Python string, and ran the whole thing through the reverse-existence implementations. They produce wildly different results, so it's not an entirely balanced comparison, but running some time trials on how they processed the text at least let me confirm that the basic concept I was working with is sound.

>>> hhgttg = <the text>
>>> # (Note: I shuffled some things around to make the
>>> #  word lists from #1351 accessible to other code)
>>> t1k = proselint.checks.reverse_existence.top1000.WORDS
>>> oldfn = 'proselint.tools.reverse_existence_check'
>>> newfn = 'proselint.tools.my_revex'
>>>
>>> timeit.repeat(
...     f"{oldfn}(hhgttg, t1k, 'error', 'Bad word {{}}')",
...     number=1, repeat=2, globals=locals())
[59.088628315133974, 58.57274700002745]
>>>
>>> timeit.repeat(
...     f"{newfn}(hhgttg, t1k, 'error', 'Bad word {{}}')",
...     number=1, repeat=2, globals=locals())
[0.6903484750073403, 0.6852877610363066]
>>>
>>> orig = proselint.tools.reverse_existence_check(
...     hhgttg, t1k, 'error', 'Bad word {}'))
>>> mine = proselint.tools.my_revex(
...     hhgttg, t1k, 'error', 'Bad word {}'))
>>> len(orig)
10990
>>> len(mine)
13755

So, scanning ~47,000 words in a slightly noisy prose string, it flags 30% of them in a little over 2/3 of a second, up from flagging 23% in just shy of 60 seconds.

The increase in the number of results is a little surprising, especially when you look at what's actually flagged on either end:

>>> from pprint import pp
>>> pp([x[3] for x in orig[:50]])
['Bad word Douglas',
 'Bad word Hitch',
 'Bad word Guide',
 'Bad word Galaxy',
 'Bad word Fantazy.',
 'Bad word 1990.',
 'Bad word Based',
 'Bad word famous',
 'Bad word series',
 'Bad word Douglas',
 'Bad word N.',
 'Bad word Adams',
 'Bad word born',
 'Bad word Cambridge',
 'Bad word 1952.',
 'Bad word Brentwood',
 'Bad word Essex',
 'Bad word St.',
 'Bad word Cambridge',
 'Bad word English.',
 'Bad word graduation',
 'Bad word years',
 'Bad word material',
 'Bad word radio',
 'Bad word television',
 'Bad word shows',
 'Bad word writing,',
 'Bad word performing',
 'Bad word directing',
 'Bad word London,',
 'Bad word Cambridge',
 'Bad word Edinburgh',
 'Bad word worked',
 'Bad word various',
 'Bad word times',
 'Bad word porter,',
 'Bad word chicken',
 'Bad word cleaner,',
 'Bad word bodyguard,',
 'Bad word radio',
 'Bad word producer',
 'Bad word script',
 'Bad word married,',
 'Bad word Surrey.',
 'Bad word Jonny',
 'Bad word Clare',
 'Bad word Arlingtonians',
 'Bad word tea,',
 'Bad word sympathy,',
 'Bad word sofa']
>>>
>>>
>>> pp([x[3] for x in mine[:50]])
['Bad word Douglas',
 'Bad word Adams',
 'Bad word Hitch',
 'Bad word Hikers',
 'Bad word Guide',
 'Bad word Galaxy',
 'Bad word Fantazy',
 'Bad word Based',
 'Bad word famous',
 'Bad word Radio',
 'Bad word series',
 'Bad word Douglas',
 'Bad word Adams',
 'Bad word born',
 'Bad word Cambridge',
 'Bad word educated',
 'Bad word Brentwood',
 'Bad word Essex',
 "Bad word John's",
 'Bad word Cambridge',
 'Bad word English',
 'Bad word graduation',
 'Bad word years',
 'Bad word contributing',
 'Bad word material',
 'Bad word radio',
 'Bad word television',
 'Bad word shows',
 'Bad word writing',
 'Bad word performing',
 'Bad word directing',
 'Bad word stage',
 'Bad word revues',
 'Bad word London',
 'Bad word Cambridge',
 'Bad word Edinburgh',
 'Bad word Fringe',
 'Bad word worked',
 'Bad word various',
 'Bad word times',
 'Bad word porter',
 'Bad word barn',
 'Bad word builder',
 'Bad word chicken',
 'Bad word shed',
 'Bad word cleaner',
 'Bad word bodyguard',
 'Bad word radio',
 'Bad word producer',
 'Bad word script']

The original implementation is flagging words like English. (but note the period captured at the end), 1990., N., and St., all of which except for English I exclude. But it seems to be missing some words, like Adams and Radio early on — I haven't investigated why.3 And it also excludes John's, due to the apostrophe.

The code still needs additional testing and cleanup, but I think it's an approach worth getting in shape to submit as a PR, if @Nytelife26 is still open to incorporating the feature?

...Oh, and after writing this, I realized I could massively further speed up the permitted-word lookups by transforming the input list into a set up front. That reduced the processing time to...

>>> timeit.repeat(
...     f"{newfn}(hhgttg, t1k, 'error', 'Bad word {{}}')",
...     number=1, repeat=2, globals=locals())
[0.1017304239794612, 0.09134247805923223]

...around a tenth of a second.

Notes

  1. That can be revisited, but I think it's a reasonable initial constraint. All of the 1- and 2-letter pronouns, prepositions, and etc. are on the Top-1000 list anyway. But obscure stuff like "Mr." or "OK" or "IV" becomes annoying to deal with, and beyond that the amount of textual "noise" that happens to have 2 letters adjacent to each other, but isn't word content at all, is much higher than if you set the lower limit at 3 letters.

  2. You need the hyphen to treat hyphenated terms as one word, and you need the apostrophe to handle possessives and contractions. (Otherwise — especially if words of less than 3 characters are included — you end up checking endless piles of 's', 'ed', 'nt', etc...)

  3. For the record, here's the first 500 characters of the input text:

    >>> print('\n'.join(
    ... [x.strip() for x in hhgttg[:500].splitlines() if x]))
    Douglas Adams. The Hitch Hikers Guide to Galaxy
    Fantazy. 1990.
    Based on the famous Radio series
    Douglas  N.  Adams was born in Cambridge in 1952. He was
    educated at Brentwood School, Essex  and  St.  John's  College,
    Cambridge where he read English. After graduation he spent several
    years contributing material to radio and television  shows
    as  well  as  writing, performing and sometimes directing stage
    revues in London, Cambridge and on the Edinburgh Fringe. He

from proselint.

Nytelife26 avatar Nytelife26 commented on June 11, 2024 1

@ferdnyc i must say, that's impressive work. the kind of science i know and love.

your performance improvements are beyond what i would've considered acceptable for an implementation. feel free to open a PR, and i'll review it within a few days.

ultimately i am not sure we can make good use of it until the general program and configuration is refactored, but that's for down the road. thank you for your contribution here.

from proselint.

vqle23 avatar vqle23 commented on June 11, 2024

Hey, this sounds interesting. I can probably work on this within the next week if you haven't.

from proselint.

Nytelife26 avatar Nytelife26 commented on June 11, 2024

I can see the potential utility in this, but it feels out of scope for proselint. Can you think of any scenario where this may be useful in checking prose besides in writing tasks that involve special restrictions? I fear this check would be computationally expensive, and unnecessary for the typical use of the program.

from proselint.

patcon avatar patcon commented on June 11, 2024

I feel this sort of processing is most likely to be helpful to people writing linters for things like ESL readers (english-as-second-language) or for scientific communication with very young readers or those with minimal or no high school education.

I don't want to take anyone's time for granted (we should only work on what we enjoy!), but I will admit it feels arbitrary to put these categories of use outside the realm of worth serving. I totally understand if you do not wish to invest your own time, but pls reconsider making a blanket judgement that this is too computationally expensive 🙏🏻🙏🏻🙏🏻🙂

from proselint.

Nytelife26 avatar Nytelife26 commented on June 11, 2024

@patcon i have good news for you. a year and a half later, thanks to the efforts of @ferdnyc and @vqle23, we have a solid implementation of this and two new optional checks to go with it.

from proselint.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.