Comments (7)
Hi, @patcon. I have reconsidered, indeed. Many other linters provide configurations for special environments, and it can't hurt to broaden proselint's utility. Maybe this could be implemented as a plugin, when such a system is available? Alternatively, if you have suggestions for how this could work in a similarly performant manner to what we have now, I would be open to including this in proselint's core.
from proselint.
So, with respect to the fine initial work done by @vqle23 in #1351, I have a different implementation of this that I think solves the major issues that came up in that PR's review process.
I started from these basic premises:
- We have a relatively small universe of words in the "pass" list
- We'll eventually need to find the position of every non-matching word
- No matter how long the input text is, we're going to end up searching through it all the way to the end.
- (Tangentially and as a caveat/addendum: At least in the first pass, words shorter than 3 characters aren't worth bothering with.)1
So, even on a super long text, because we eventually need to regular-expression-search the whole thing anyway, rather than building searches for any specific word(s), it makes more sense to just search it for every word, using re.compile(r"[\w'-]{3,}").finditer(text)
.2
Then, we can walk the iterator of matched
results (non-overlapping word spans in the input text), and examine each one to (a) exclude any strings containing numerals, then (b) determine whether it appears on the supplied list
of permitted words.
Anytime matched in list
isn't True
, that word can immediately be added to the errors list without any additional processing, because we already have the necessary position data in the .finditer()
Match
objects we're iterating over.
So, that implementation looks something like this:
def reverse_existence_check(
text, list, err, msg, ignore_case=True, str=False, offset=0
):
"""Find all words in ``text`` that aren't on the ``list``."""
if ignore_case:
list = [word.lower() for word in list]
def allowed_word(match):
matched = match.string[match.start():match.end()].lower()
return matched in list
else:
def allowed_word(match):
matched = match.string[match.start():match.end()]
return matched in list
# Match all 3+ character words
tokenizer = re.compile(r"[\w'-]{3,}")
# Remove any that contain numerals
exclusions = re.compile(r'[0-9]')
errors = [(
m.start() + 1 + offset,
m.end() + offset,
err,
msg.format(m.string[m.start():m.end()]),
None)
for m in tokenizer.finditer(text)
if not exclusions.search(m.string[m.start():m.end()])
and not allowed_word(m)
]
return errors
(A potential further optimization: The function could actually be written as a generator, by using yield
instead of building the full list. That would mean consuming its output as an iterator would allow it to operate incrementally over the length of the input text.)
Iterative processing may not even be necessary, though. I happen to have a text version of "The Hitchhiker's Guide to the Galaxy" (a transcript of the radio plays, I think, not the book version), anyway it's 5019 lines and just shy of 47,000 words.
I imported it as a Python string, and ran the whole thing through the reverse-existence implementations. They produce wildly different results, so it's not an entirely balanced comparison, but running some time trials on how they processed the text at least let me confirm that the basic concept I was working with is sound.
>>> hhgttg = <the text>
>>> # (Note: I shuffled some things around to make the
>>> # word lists from #1351 accessible to other code)
>>> t1k = proselint.checks.reverse_existence.top1000.WORDS
>>> oldfn = 'proselint.tools.reverse_existence_check'
>>> newfn = 'proselint.tools.my_revex'
>>>
>>> timeit.repeat(
... f"{oldfn}(hhgttg, t1k, 'error', 'Bad word {{}}')",
... number=1, repeat=2, globals=locals())
[59.088628315133974, 58.57274700002745]
>>>
>>> timeit.repeat(
... f"{newfn}(hhgttg, t1k, 'error', 'Bad word {{}}')",
... number=1, repeat=2, globals=locals())
[0.6903484750073403, 0.6852877610363066]
>>>
>>> orig = proselint.tools.reverse_existence_check(
... hhgttg, t1k, 'error', 'Bad word {}'))
>>> mine = proselint.tools.my_revex(
... hhgttg, t1k, 'error', 'Bad word {}'))
>>> len(orig)
10990
>>> len(mine)
13755
So, scanning ~47,000 words in a slightly noisy prose string, it flags 30% of them in a little over 2/3 of a second, up from flagging 23% in just shy of 60 seconds.
The increase in the number of results is a little surprising, especially when you look at what's actually flagged on either end:
>>> from pprint import pp
>>> pp([x[3] for x in orig[:50]])
['Bad word Douglas',
'Bad word Hitch',
'Bad word Guide',
'Bad word Galaxy',
'Bad word Fantazy.',
'Bad word 1990.',
'Bad word Based',
'Bad word famous',
'Bad word series',
'Bad word Douglas',
'Bad word N.',
'Bad word Adams',
'Bad word born',
'Bad word Cambridge',
'Bad word 1952.',
'Bad word Brentwood',
'Bad word Essex',
'Bad word St.',
'Bad word Cambridge',
'Bad word English.',
'Bad word graduation',
'Bad word years',
'Bad word material',
'Bad word radio',
'Bad word television',
'Bad word shows',
'Bad word writing,',
'Bad word performing',
'Bad word directing',
'Bad word London,',
'Bad word Cambridge',
'Bad word Edinburgh',
'Bad word worked',
'Bad word various',
'Bad word times',
'Bad word porter,',
'Bad word chicken',
'Bad word cleaner,',
'Bad word bodyguard,',
'Bad word radio',
'Bad word producer',
'Bad word script',
'Bad word married,',
'Bad word Surrey.',
'Bad word Jonny',
'Bad word Clare',
'Bad word Arlingtonians',
'Bad word tea,',
'Bad word sympathy,',
'Bad word sofa']
>>>
>>>
>>> pp([x[3] for x in mine[:50]])
['Bad word Douglas',
'Bad word Adams',
'Bad word Hitch',
'Bad word Hikers',
'Bad word Guide',
'Bad word Galaxy',
'Bad word Fantazy',
'Bad word Based',
'Bad word famous',
'Bad word Radio',
'Bad word series',
'Bad word Douglas',
'Bad word Adams',
'Bad word born',
'Bad word Cambridge',
'Bad word educated',
'Bad word Brentwood',
'Bad word Essex',
"Bad word John's",
'Bad word Cambridge',
'Bad word English',
'Bad word graduation',
'Bad word years',
'Bad word contributing',
'Bad word material',
'Bad word radio',
'Bad word television',
'Bad word shows',
'Bad word writing',
'Bad word performing',
'Bad word directing',
'Bad word stage',
'Bad word revues',
'Bad word London',
'Bad word Cambridge',
'Bad word Edinburgh',
'Bad word Fringe',
'Bad word worked',
'Bad word various',
'Bad word times',
'Bad word porter',
'Bad word barn',
'Bad word builder',
'Bad word chicken',
'Bad word shed',
'Bad word cleaner',
'Bad word bodyguard',
'Bad word radio',
'Bad word producer',
'Bad word script']
The original implementation is flagging words like English.
(but note the period captured at the end), 1990.
, N.
, and St.
, all of which except for English
I exclude. But it seems to be missing some words, like Adams
and Radio
early on — I haven't investigated why.3 And it also excludes John's
, due to the apostrophe.
The code still needs additional testing and cleanup, but I think it's an approach worth getting in shape to submit as a PR, if @Nytelife26 is still open to incorporating the feature?
...Oh, and after writing this, I realized I could massively further speed up the permitted-word lookups by transforming the input list
into a set
up front. That reduced the processing time to...
>>> timeit.repeat(
... f"{newfn}(hhgttg, t1k, 'error', 'Bad word {{}}')",
... number=1, repeat=2, globals=locals())
[0.1017304239794612, 0.09134247805923223]
...around a tenth of a second.
Notes
-
That can be revisited, but I think it's a reasonable initial constraint. All of the 1- and 2-letter pronouns, prepositions, and etc. are on the Top-1000 list anyway. But obscure stuff like "Mr." or "OK" or "IV" becomes annoying to deal with, and beyond that the amount of textual "noise" that happens to have 2 letters adjacent to each other, but isn't word content at all, is much higher than if you set the lower limit at 3 letters.
-
You need the hyphen to treat hyphenated terms as one word, and you need the apostrophe to handle possessives and contractions. (Otherwise — especially if words of less than 3 characters are included — you end up checking endless piles of
's'
,'ed'
,'nt'
, etc...) -
For the record, here's the first 500 characters of the input text:
>>> print('\n'.join( ... [x.strip() for x in hhgttg[:500].splitlines() if x])) Douglas Adams. The Hitch Hikers Guide to Galaxy Fantazy. 1990. Based on the famous Radio series Douglas N. Adams was born in Cambridge in 1952. He was educated at Brentwood School, Essex and St. John's College, Cambridge where he read English. After graduation he spent several years contributing material to radio and television shows as well as writing, performing and sometimes directing stage revues in London, Cambridge and on the Edinburgh Fringe. He
from proselint.
@ferdnyc i must say, that's impressive work. the kind of science i know and love.
your performance improvements are beyond what i would've considered acceptable for an implementation. feel free to open a PR, and i'll review it within a few days.
ultimately i am not sure we can make good use of it until the general program and configuration is refactored, but that's for down the road. thank you for your contribution here.
from proselint.
Hey, this sounds interesting. I can probably work on this within the next week if you haven't.
from proselint.
I can see the potential utility in this, but it feels out of scope for proselint. Can you think of any scenario where this may be useful in checking prose besides in writing tasks that involve special restrictions? I fear this check would be computationally expensive, and unnecessary for the typical use of the program.
from proselint.
I feel this sort of processing is most likely to be helpful to people writing linters for things like ESL readers (english-as-second-language) or for scientific communication with very young readers or those with minimal or no high school education.
I don't want to take anyone's time for granted (we should only work on what we enjoy!), but I will admit it feels arbitrary to put these categories of use outside the realm of worth serving. I totally understand if you do not wish to invest your own time, but pls reconsider making a blanket judgement that this is too computationally expensive 🙏🏻🙏🏻🙏🏻🙂
from proselint.
@patcon i have good news for you. a year and a half later, thanks to the efforts of @ferdnyc and @vqle23, we have a solid implementation of this and two new optional checks to go with it.
from proselint.
Related Issues (20)
- Bug: config is being ignored, even though confirmed as valid by `--dump-config` HOT 1
- Print out every occurrence of an issue
- Many errors when doing anything HOT 1
- Clarify the message from "weasel_words.very"
- Use "logging" instead of "print" to enable output-control
- Filenames are lost when using as pre-commit hook
- Ignored check names mismatched HOT 2
- Check for proper spaces around punctuation HOT 3
- Simple sample program on readme page fails with KeyError: 'checks' HOT 3
- Useless dependency? HOT 2
- Does not work well with Markdown formatting HOT 3
- Proselint.com is down :/ HOT 5
- Ignore Specific Files
- Retain Project-Specific Configuration Option HOT 2
- [feature-proposal] never mind HOT 1
- [feature-proposal] never mind HOT 1
- [feature-proposal] never mind HOT 2
- Updated Release HOT 13
- set check "typography.symbols.curly_quotes": false not working HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from proselint.