Git Product home page Git Product logo

cjpe's People

Contributors

exploration-lab avatar rishabh261998 avatar shubhamkumarnigam avatar vijit-m avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

cjpe's Issues

consult about the xlnet-based model

Hi, @ShubhamKumarNigam. I am confused about what the model you used in "occ_explanations_hierarchical.ipynb" in Explanation model
model_dir = "pth/XLNet_right_model" tokenizer = XLNetTokenizer.from_pretrained(model_dir) model = XLNetForSequenceClassification.from_pretrained(model_dir, output_hidden_states=True).
That I can not find any model available in your repo.
Thank you very much!!

Incorrect abbreviation substitution/expansion during preprocessing

Hi,
I wanted to work with the ILDC dataset, but I found a serious issue with its preprocessed form.

It seems like during preprocessing, all sequences no<any non-newline character> and co<any non-newline character> have been replaced by the words number and company.

As a result, the dataset looks like:

...stating that the companycept of unequal bargaining power has numberapplication in the case of companymercial companytracts. It then went on to hold- It has been submitted by learned companynsel for the appellant that there should be a cap in the quantum payable in terms of sub-clause 7 of Clause 25-A...

I believe this has a serious effect on the quality and usability of the dataset - not only it likely introduces more noise than it removes by completely skewing the overall n-gram distributions and word frequencies away from reality and corrupts legally relevant terms like counsel (becomes companynsel) or contract (becomes companytract), but, perhaps even more severely, it removes almost all negations like "non", "not" or "no " from the texts by simply converting them to the word number (as in "no application" becomes "numberapplication" or nonstatutory becomes numberstatutory).

I am almost certain the error is in the unescaped period (".") in the regular expression for abbreviation substitution in the file Data/Preprocessing/preprocess.py (at lines 43-46), where
text = re.sub(r" no."," number",text).

Instead of matching the literal sequence no., the regular expression matches no<any non-newline character> and substitutes it for the word number. It is the same case for the regular expressions co., nos., and ltd.

As a fix, I suggest either escaping the period character in the regular expressions or removing these "cleaning" steps from preprocessing altogether.

Since I don't have access to the raw, non-preprocessed data, it would be great if you could fix this error and run the raw data through the preprocessing steps again and provide a revised, more usable version of the dataset.

Thanks a lot!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.