Git Product home page Git Product logo

Comments (7)

chrisjbryant avatar chrisjbryant commented on May 28, 2024 1

Aha nice.

I was actually originally using the Damerau-Levenshtein code bundled with ERRANT (rdlextra.py) to do the character alignment, but found SequenceMatcher to be a lot faster. If python-Levenshtein is faster still, then that might definitely be something worth upgrading.

As for the rationale behind the custom substitution cost, it's easiest to refer you to the alignment paper, particularly section 3.2 and Table 2.

from errant.

sai-prasanna avatar sai-prasanna commented on May 28, 2024 1

Thanks for the reference. We noticed the speed improvements you had done that time (thanks).

BTW I have been cleaning up the code to make it easily usable as a library, making it pip installable with python 3 type annotations etc. If you plan to make this a library published to pypi, will give a pull request. Or I would consider publishing as a separate library.

One another side note, want to understand whether homophones can be added as another static rule for subsitutions, as they are among a easy class of errors people make ...

P.S. I think I we can consider moving the Damerau Levenshtein code to cython for speed improvements. But thats a separate task in itself. Realized how slow python is compared to cython in python-Levenshtein.

from errant.

sai-prasanna avatar sai-prasanna commented on May 28, 2024

@chrisjbryant Check out this refactor, https://github.com/sai-prasanna/errant. Its a bit huge, but just a basic cleanup, adding types and making it pip installable without altering the logic etc. Yet to test the commands though.

from errant.

chrisjbryant avatar chrisjbryant commented on May 28, 2024

Looks like you've been busy! Would be good to make it pip installable, but I've never done that before so have no experience with it.
It would be good to know whether it produces the same results and whether it's faster when you get a chance to test it. I guess it'll also need a new readme on how to use!

from errant.

sai-prasanna avatar sai-prasanna commented on May 28, 2024

Pip Installation:

It is pip installable from the repository (not published to pypi, will do that soon). If we conclude on what else to change we can upload to pypi under name errant(guess it would be available).

Reproduciblity:

I ran the parallel_to_m2 and compare_m2 commands on CoNLL 14 dataset sentences. Got the same results as existing repository for spacy 1.9.

But I have set spacy dependency in setup.py to 2.0 (they use the same flags) .There are some results have some diff in spacy 2.0. It would be very helpful if you could check whether the diff with spacy 2.0 is not degrading the quality.

Using 1.x is not a option for me atleast right now, and it seems spacy 2.1 will be faster when it releases (nightly is out). But if you insist on keeping 1.x we can try something like pip install errant[spacy1x] to support spacy 1.9.

README
The README is updated to include information about pip installation and how to run the cli. I will add example on programmatic access of Errant class.

from errant.

chrisjbryant avatar chrisjbryant commented on May 28, 2024

Aha ok.
From memory, I think spacy 2.0 should technically be slightly more accurate than 1.9 (at least in terms of POS tagging and parsing) but I mainly didn't upgrade because it was slower.

Since spacy 2.1 is supposed to be faster, I was intending to revisit the code when that was released. They also finally fixed the coarse POS mapping in 2.1 apparently, so we'll be able to remove the manually defined tagmap and just use the native spacy token.pos_ property instead I think.

I would insist on keeping the 1.9 version for now as that's also the version we'll be using in the BEA2019 shared task. We can still update ERRANT in the future, but it'd be good to keep the 1.9 version available in case people want to evaluate using the official shared task evaluation procedure.

from errant.

sai-prasanna avatar sai-prasanna commented on May 28, 2024

Closing this, as it has been addressed in my pull request.

from errant.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.