The cjpe's discuss from exploration-lab

ILDC_single Sent2Vec embeddings are provided when the title says ILDC_multi embeddings

In the classical models folder Sent2Vec embeddings are provided for ILDC_single instead of ILDC_multi.

Incorrect abbreviation substitution/expansion during preprocessing

Hi,
I wanted to work with the ILDC dataset, but I found a serious issue with its preprocessed form.

It seems like during preprocessing, all sequences no<any non-newline character> and co<any non-newline character> have been replaced by the words number and company.

As a result, the dataset looks like:

...stating that the companycept of unequal bargaining power has numberapplication in the case of companymercial companytracts. It then went on to hold- It has been submitted by learned companynsel for the appellant that there should be a cap in the quantum payable in terms of sub-clause 7 of Clause 25-A...

I believe this has a serious effect on the quality and usability of the dataset - not only it likely introduces more noise than it removes by completely skewing the overall n-gram distributions and word frequencies away from reality and corrupts legally relevant terms like counsel (becomes companynsel) or contract (becomes companytract), but, perhaps even more severely, it removes almost all negations like "non", "not" or "no " from the texts by simply converting them to the word number (as in "no application" becomes "numberapplication" or nonstatutory becomes numberstatutory).

I am almost certain the error is in the unescaped period (".") in the regular expression for abbreviation substitution in the file Data/Preprocessing/preprocess.py (at lines 43-46), where
text = re.sub(r" no."," number",text).

Instead of matching the literal sequence no., the regular expression matches no<any non-newline character> and substitutes it for the word number. It is the same case for the regular expressions co., nos., and ltd.

As a fix, I suggest either escaping the period character in the regular expressions or removing these "cleaning" steps from preprocessing altogether.

Since I don't have access to the raw, non-preprocessed data, it would be great if you could fix this error and run the raw data through the preprocessing steps again and provide a revised, more usable version of the dataset.

Thanks a lot!

consult about the xlnet-based model

Hi, @ShubhamKumarNigam. I am confused about what the model you used in "occ_explanations_hierarchical.ipynb" in Explanation model
model_dir = "pth/XLNet_right_model" tokenizer = XLNetTokenizer.from_pretrained(model_dir) model = XLNetForSequenceClassification.from_pretrained(model_dir, output_hidden_states=True).
That I can not find any model available in your repo.
Thank you very much!!

exploration-lab / cjpe Goto Github PK

cjpe's Issues

ILDC_single Sent2Vec embeddings are provided when the title says ILDC_multi embeddings

Incorrect abbreviation substitution/expansion during preprocessing

consult about the xlnet-based model

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent