brucewlee / lingfeat Goto Github PK

[EMNLP 2021] LingFeat - A Comprehensive Linguistic Features Extraction ToolKit for Readability Assessment

License: Creative Commons Attribution Share Alike 4.0 International

Python 100.00%

linguistic-analysis readability-scores flesch-kincaid feature-extraction discourse syntactic-analysis lexical-analysis semantic-analysis spacy nlp

lingfeat's People

Contributors

Stargazers

Watchers

Forkers

jaqujaqu imperialite sidney1994 jgeofil shahrukhx01 kulisap1994 alisonhc eolas-bith gaozhiyan mohdelgaar martysteer sorahsnobody thebugcreator 5l1v3r1 mingoori0512 won21kr

lingfeat's Issues

Provide a link to the paper

I was curious about the advanced features, and the Readme "highly" suggested that I "read Section 2 and 3 in our EMNLP paper."
However, I couldn't find a link to the paper in the README.MD 😩

It's this one right? https://arxiv.org/abs/2109.12258
I think it would be nice to include a link in the README.MD for easy access to other users.
(I could do a simple pull request if it's too troublesome)

Thanks for releasing this here!

KeyError: dtype('float32')

Thank you for your great work. I test with a sentence:
"The commutator is peculiar, consisting of only three segments of a copper ring, while in the simplest of other continuous current generators several times that number exist, and frequently 120! segments are to be found. These three segments are made so as to be removable in a moment for cleaning or replacement. They are mounted upon a metal support, and are surrounded on all sides by a free air space, and cannot, therefore, lose their insulated condition. This feature of air insulation is peculiar to this system, and is very important as a factor in the durability of the commutator. Besides this, the commutator is sustained by supports carried in flanges upon the shaft, which flanges, as an additional safeguard, are coated all over with hard rubber, one of the finest known insulators. It may be stated, without fear of contradiction, that no other commutator made is so thoroughly insulated and protected. The three commutator segments virtually constitute a single copper ring, mounted in free air, and cut into three equal pieces by slots across its face."
When I call 3 API below it's error

WoKF = LingFeat.WoKF_() # Wikipedia Knowledge Features

WBKF = LingFeat.WBKF_() # WeeBit Corpus Knowledge Features

OSKF = LingFeat.OSKF_() # OneStopEng Corpus Knowledge Features

Do you know how to fix this issue. Thank you.
I install library with second option on google colab:
!git clone https://github.com/brucewlee/lingfeat.git
!pip install -r lingfeat/requirements.txt

KeyError Traceback (most recent call last)
in ()
----> 1 WoKF = LingFeat.WoKF_() # Wikipedia Knowledge Features

6 frames
/usr/local/lib/python3.7/dist-packages/gensim/models/ldamodel.py in inference(self, chunk, collect_sstats)
665 # phinorm is the normalizer.
666 # TODO treat zeros explicitly, instead of adding epsilon?
--> 667 eps = DTYPE_TO_EPS[self.dtype]
668 phinorm = np.dot(expElogthetad, expElogbetad) + eps
669

KeyError: dtype('float32')

How to apply function .preprocess and others to Pandas df?

Greetings all,

I have a large corpus zipping into a Pandas dataframe and I'd like to iterate text column to record the results of individual functions to separate columns. As far as I get, extractor only accepts str. I am trying to merge scores with metadata included in the dataframe.

For instance, my dataframe is follows.

df.head()
  docid_field  ...                                         text_field
0    BGSU1001  ...   <ICLE-BG-SUN-0001.1> \nIt is time, that our s...
1    BGSU1002  ...   <ICLE-BG-SUN-0002.1> \nNowadays there is a gr...
2    BGSU1003  ...   <ICLE-BG-SUN-0003.1> \nOnce upon a time there...
3    BGSU1004  ...   <ICLE-BG-SUN-0004.1> \nOur educational system...
4    BGSU1005  ...   <ICLE-BG-SUN-0005.1> \nScience, technology an...

Is there a way to apply LingFeat function to df['text_field'] and record scores (let's say LingFeat.EnDF_()) as tuples into another column?
I did try

df['LingFeat'] = df['text_field'].apply(lambda x: extractor.pass_text(x))

and the result is

0      <lingfeat.extractor.pass_text object at 0x0000...
1      <lingfeat.extractor.pass_text object at 0x0000...
2      <lingfeat.extractor.pass_text object at 0x0000...
3      <lingfeat.extractor.pass_text object at 0x0000...
4      <lingfeat.extractor.pass_text object at 0x0000...
                       
923    <lingfeat.extractor.pass_text object at 0x0000...
924    <lingfeat.extractor.pass_text object at 0x0000...
925    <lingfeat.extractor.pass_text object at 0x0000...
926    <lingfeat.extractor.pass_text object at 0x0000...
927    <lingfeat.extractor.pass_text object at 0x0000...
Name: LingFeat, Length: 928, dtype: object

I couldn't go on any further. How should I do it, if it is possible?

Wrong formula for the Coleman–Liau index

Hi Lee,

There are two places wrong in the formula.

First, the original Coleman-Liau counts the number of letters per 100 words. Whereas in the code, it counts the number of tokens per 100 words.

Second, it is wrong to account for "per 100 words" by dividing the number by 100.
Rather, it should be
$$n_{letters} / (n_{tokens} / 100)$$

As a result, the produced score is always around -15.0.

My installed lingfeat version is 1.00b19. Have you fixed it up ?

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

Hi, I recent reproduced the paper‘s model, I met this problem when run lingfeat(as well as test_lingfeat.py).
it is an environmental problem？

Thank you again for your great contribution.

Error in loading LingFet

/opt/conda/lib/python3.10/site-packages/scipy/init.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
Downloading: https://github.com/yzhangcs/parser/releases/download/v1.1.0/ptb.crf.con.lstm.char.zip to /root/.cache/supar/ptb.crf.con.lstm.char.zip

TypeError Traceback (most recent call last)
Cell In[3], line 1
----> 1 from lingfeat import extractor

File /opt/conda/lib/python3.10/site-packages/lingfeat/init.py:1
----> 1 from lingfeat.extractor import pass_text

File /opt/conda/lib/python3.10/site-packages/lingfeat/extractor.py:63
61 # load models
62 NLP = spacy.load('en_core_web_sm')
---> 63 SuPar = Parser.load('crf-con-en')
66 class pass_text:
68 """
69 Initialize pipeline
70
(...)
76 - self.NLP_doc: spacy pipeline object
77 """

File /opt/conda/lib/python3.10/site-packages/supar/parsers/parser.py:194, in Parser.load(cls, path, reload, src, checkpoint, **kwargs)
192 args = Config(**locals())
193 args.device = 'cuda' if torch.cuda.is_available() else 'cpu'
--> 194 state = torch.load(path if os.path.exists(path) else download(supar.MODEL[src].get(path, path), reload=reload))
195 cls = supar.PARSER[state['name']] if cls.NAME is None else cls
196 args = state['args'].update(args)

File /opt/conda/lib/python3.10/site-packages/supar/utils/fn.py:167, in download(url, reload)
165 sys.stderr.write(f"Downloading: {url} to {path}\n")
166 try:
--> 167 torch.hub.download_url_to_file(url, path, progress=True)
168 except urllib.error.URLError:
169 raise RuntimeError(f"File {url} unavailable. Please try other sources.")

File /opt/conda/lib/python3.10/site-packages/torch/hub.py:630, in download_url_to_file(url, dst, hash_prefix, progress)
628 if hash_prefix is not None:
629 sha256 = hashlib.sha256()
--> 630 with tqdm(total=file_size, disable=not progress,
631 unit='B', unit_scale=True, unit_divisor=1024) as pbar:
632 while True:
633 buffer = u.read(8192)

TypeError: nop() missing 1 required positional argument: 'it'

Thank you for your great work.

Can this be used for Chinese?

Feature Naming issue

lingfeat/lingfeat/_AdvancedSemantic/OSKF.py

Line 90 in 281a9e2

result["BRich"+n_topic_list_for_naming[i]+"_S"] = feature

lingfeat/lingfeat/_AdvancedSemantic/OSKF.py

Line 92 in 281a9e2

result["BClar"+n_topic_list_for_naming[i]+"_S"] = feature

lingfeat/lingfeat/_AdvancedSemantic/OSKF.py

Line 94 in 281a9e2

result["BNois"+n_topic_list_for_naming[i]+"_S"] = feature

lingfeat/lingfeat/_AdvancedSemantic/OSKF.py

Line 96 in 281a9e2

result["BTopc"+n_topic_list_for_naming[i]+"_S"] = feature

Should be ORich, OClar, ONois, OTopic respectively

Duplicate feature names in OSKF and WBKF

Hello, thanks for this great project!

Recently we are trying to reproduce the experimental results in your paper:

Lee, Bruce W., Yoo Sung Jang, and Jason Lee. "Pushing on Text Readability Assessment: A Transformer Meets Handcrafted Linguistic Features." Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021.

Just found that the OSKF method in lingfeat returned exactly the same 16 feature names as in WBKF. Please see examples below:

from lingfeat import extractor

text = "When you see the word Amazon, what’s the first thing that springs to mind – the world’s biggest forest, the longest river or the largest internet retailer – and which do you consider most important?"
LingFeat = extractor.pass_text(text)
LingFeat.preprocess()

WBKF = LingFeat.WBKF_() # WeeBit Corpus Knowledge Features
OSKF = LingFeat.OSKF_() # OneStopEng Corpus Knowledge Features

print('WeeBit Corpus Knowledge Features:', WBKF)
print('OneStopEng Corpus Knowledge Features:', OSKF)

Terminal Output

WeeBit Corpus Knowledge Features:  {'BRich05_S': 1.1274421401321888, 'BRich10_S': 4.858168950304389, 'BRich15_S': 20.647890945896506, 'BRich20_S': 21.932124523445964, 'BClar05_S': 0.5823907653490702, 'BClar10_S': 0.718731752038002, 'BClar15_S': 0.7291195740302404, 'BClar20_S': 0.7486800486626832, 'BNois05_S': 1.5104791224775047, 'BNois10_S': 6.548753840448406, 'BNois15_S': 7.018329580783902, 'BNois20_S': 8.321480132061497, 'BTopc05_S': 3, 'BTopc10_S': 10, 'BTopc15_S': 18, 'BTopc20_S': 23}
OneStopEng Corpus Knowledge Features:  {'BRich05_S': 2.9044833183288574, 'BRich10_S': 3.5476092249155045, 'BRich15_S': 9.398028403520584, 'BRich20_S': 14.846967313438654, 'BClar05_S': 0.00015333294868469238, 'BClar10_S': 0.25143229961395264, 'BClar15_S': 0.6553432226181031, 'BClar20_S': 0.7100768367449443, 'BNois05_S': 1.0000004289882432, 'BNois10_S': 1.4495860709293316, 'BNois15_S': 4.214530509499038, 'BNois20_S': 5.500046277858743, 'BTopc05_S': 2, 'BTopc10_S': 3, 'BTopc15_S': 10, 'BTopc20_S': 15}

According to Appendix B of the above paper, the feature names in OSKF should start with 'O', e.g. 'ORich05_S', 'ORich10_S', etc.

This bug yields 239 distinct feature names (not 255 features as introduced in the paper). Accordingly, in another open-source project of this paper:

https://github.com/brucewlee/pushingonreadability_traditional_ML

The csv files in Research_Data included only 239 linguistic features which we believe were caused by these duplicate feature names.

brucewlee / lingfeat Goto Github PK

lingfeat's People

Contributors

Stargazers

Watchers

Forkers

lingfeat's Issues

WoKF = LingFeat.WoKF_() # Wikipedia Knowledge Features

WBKF = LingFeat.WBKF_() # WeeBit Corpus Knowledge Features

OSKF = LingFeat.OSKF_() # OneStopEng Corpus Knowledge Features

Do you know how to fix this issue. Thank you. I install library with second option on google colab: !git clone https://github.com/brucewlee/lingfeat.git !pip install -r lingfeat/requirements.txt

Recommend Projects

Recommend Topics

Recommend Org

Do you know how to fix this issue. Thank you.
I install library with second option on google colab:
!git clone https://github.com/brucewlee/lingfeat.git
!pip install -r lingfeat/requirements.txt