exploration-lab / cjpe Goto Github PK
View Code? Open in Web Editor NEWLicense: GNU General Public License v3.0
License: GNU General Public License v3.0
In the classical models folder Sent2Vec embeddings are provided for ILDC_single instead of ILDC_multi.
Hi,
I wanted to work with the ILDC dataset, but I found a serious issue with its preprocessed form.
It seems like during preprocessing, all sequences no<any non-newline character>
and co<any non-newline character>
have been replaced by the words number
and company
.
As a result, the dataset looks like:
...stating that the companycept of unequal bargaining power has numberapplication in the case of companymercial companytracts. It then went on to hold- It has been submitted by learned companynsel for the appellant that there should be a cap in the quantum payable in terms of sub-clause 7 of Clause 25-A...
I believe this has a serious effect on the quality and usability of the dataset - not only it likely introduces more noise than it removes by completely skewing the overall n-gram distributions and word frequencies away from reality and corrupts legally relevant terms like counsel (becomes companynsel) or contract (becomes companytract), but, perhaps even more severely, it removes almost all negations like "non", "not" or "no " from the texts by simply converting them to the word number (as in "no application" becomes "numberapplication" or nonstatutory becomes numberstatutory).
I am almost certain the error is in the unescaped period (".") in the regular expression for abbreviation substitution in the file Data/Preprocessing/preprocess.py (at lines 43-46), where
text = re.sub(r" no."," number",text)
.
Instead of matching the literal sequence no.
, the regular expression matches no<any non-newline character>
and substitutes it for the word number
. It is the same case for the regular expressions co.
, nos.
, and ltd.
As a fix, I suggest either escaping the period character in the regular expressions or removing these "cleaning" steps from preprocessing altogether.
Since I don't have access to the raw, non-preprocessed data, it would be great if you could fix this error and run the raw data through the preprocessing steps again and provide a revised, more usable version of the dataset.
Thanks a lot!
Hi, @ShubhamKumarNigam. I am confused about what the model you used in "occ_explanations_hierarchical.ipynb" in Explanation model
model_dir = "pth/XLNet_right_model" tokenizer = XLNetTokenizer.from_pretrained(model_dir) model = XLNetForSequenceClassification.from_pretrained(model_dir, output_hidden_states=True)
.
That I can not find any model available in your repo.
Thank you very much!!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.