Git Product home page Git Product logo

brucewlee / lingfeat Goto Github PK

View Code? Open in Web Editor NEW
121.0 1.0 16.0 58.27 MB

[EMNLP 2021] LingFeat - A Comprehensive Linguistic Features Extraction ToolKit for Readability Assessment

License: Creative Commons Attribution Share Alike 4.0 International

Python 100.00%
linguistic-analysis readability-scores flesch-kincaid feature-extraction discourse syntactic-analysis lexical-analysis semantic-analysis spacy nlp

lingfeat's People

Contributors

brucewlee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

lingfeat's Issues

Provide a link to the paper

I was curious about the advanced features, and the Readme "highly" suggested that I "read Section 2 and 3 in our EMNLP paper."
However, I couldn't find a link to the paper in the README.MD 😩

It's this one right? https://arxiv.org/abs/2109.12258
I think it would be nice to include a link in the README.MD for easy access to other users.
(I could do a simple pull request if it's too troublesome)

Thanks for releasing this here!

KeyError: dtype('float32')

Thank you for your great work. I test with a sentence:
"The commutator is peculiar, consisting of only three segments of a copper ring, while in the simplest of other continuous current generators several times that number exist, and frequently 120! segments are to be found. These three segments are made so as to be removable in a moment for cleaning or replacement. They are mounted upon a metal support, and are surrounded on all sides by a free air space, and cannot, therefore, lose their insulated condition. This feature of air insulation is peculiar to this system, and is very important as a factor in the durability of the commutator. Besides this, the commutator is sustained by supports carried in flanges upon the shaft, which flanges, as an additional safeguard, are coated all over with hard rubber, one of the finest known insulators. It may be stated, without fear of contradiction, that no other commutator made is so thoroughly insulated and protected. The three commutator segments virtually constitute a single copper ring, mounted in free air, and cut into three equal pieces by slots across its face."
When I call 3 API below it's error

WoKF = LingFeat.WoKF_() # Wikipedia Knowledge Features

WBKF = LingFeat.WBKF_() # WeeBit Corpus Knowledge Features

OSKF = LingFeat.OSKF_() # OneStopEng Corpus Knowledge Features

Do you know how to fix this issue. Thank you.
I install library with second option on google colab:
!git clone https://github.com/brucewlee/lingfeat.git
!pip install -r lingfeat/requirements.txt

KeyError Traceback (most recent call last)
in ()
----> 1 WoKF = LingFeat.WoKF_() # Wikipedia Knowledge Features

6 frames
/usr/local/lib/python3.7/dist-packages/gensim/models/ldamodel.py in inference(self, chunk, collect_sstats)
665 # phinorm is the normalizer.
666 # TODO treat zeros explicitly, instead of adding epsilon?
--> 667 eps = DTYPE_TO_EPS[self.dtype]
668 phinorm = np.dot(expElogthetad, expElogbetad) + eps
669

KeyError: dtype('float32')

How to apply function .preprocess and others to Pandas df?

Greetings all,

I have a large corpus zipping into a Pandas dataframe and I'd like to iterate text column to record the results of individual functions to separate columns. As far as I get, extractor only accepts str. I am trying to merge scores with metadata included in the dataframe.

For instance, my dataframe is follows.

df.head()
  docid_field  ...                                         text_field
0    BGSU1001  ...   <ICLE-BG-SUN-0001.1> \nIt is time, that our s...
1    BGSU1002  ...   <ICLE-BG-SUN-0002.1> \nNowadays there is a gr...
2    BGSU1003  ...   <ICLE-BG-SUN-0003.1> \nOnce upon a time there...
3    BGSU1004  ...   <ICLE-BG-SUN-0004.1> \nOur educational system...
4    BGSU1005  ...   <ICLE-BG-SUN-0005.1> \nScience, technology an...

Is there a way to apply LingFeat function to df['text_field'] and record scores (let's say LingFeat.EnDF_()) as tuples into another column?
I did try

df['LingFeat'] = df['text_field'].apply(lambda x: extractor.pass_text(x))

and the result is

0      <lingfeat.extractor.pass_text object at 0x0000...
1      <lingfeat.extractor.pass_text object at 0x0000...
2      <lingfeat.extractor.pass_text object at 0x0000...
3      <lingfeat.extractor.pass_text object at 0x0000...
4      <lingfeat.extractor.pass_text object at 0x0000...
                       
923    <lingfeat.extractor.pass_text object at 0x0000...
924    <lingfeat.extractor.pass_text object at 0x0000...
925    <lingfeat.extractor.pass_text object at 0x0000...
926    <lingfeat.extractor.pass_text object at 0x0000...
927    <lingfeat.extractor.pass_text object at 0x0000...
Name: LingFeat, Length: 928, dtype: object

I couldn't go on any further. How should I do it, if it is possible?

Wrong formula for the Coleman–Liau index

Hi Lee,

There are two places wrong in the formula.

First, the original Coleman-Liau counts the number of letters per 100 words. Whereas in the code, it counts the number of tokens per 100 words.

Second, it is wrong to account for "per 100 words" by dividing the number by 100.
Rather, it should be
$$n_{letters} / (n_{tokens} / 100)$$

As a result, the produced score is always around -15.0.

My installed lingfeat version is 1.00b19. Have you fixed it up ?

Error in loading LingFet

/opt/conda/lib/python3.10/site-packages/scipy/init.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
Downloading: https://github.com/yzhangcs/parser/releases/download/v1.1.0/ptb.crf.con.lstm.char.zip to /root/.cache/supar/ptb.crf.con.lstm.char.zip

TypeError Traceback (most recent call last)
Cell In[3], line 1
----> 1 from lingfeat import extractor

File /opt/conda/lib/python3.10/site-packages/lingfeat/init.py:1
----> 1 from lingfeat.extractor import pass_text

File /opt/conda/lib/python3.10/site-packages/lingfeat/extractor.py:63
61 # load models
62 NLP = spacy.load('en_core_web_sm')
---> 63 SuPar = Parser.load('crf-con-en')
66 class pass_text:
68 """
69 Initialize pipeline
70
(...)
76 - self.NLP_doc: spacy pipeline object
77 """

File /opt/conda/lib/python3.10/site-packages/supar/parsers/parser.py:194, in Parser.load(cls, path, reload, src, checkpoint, **kwargs)
192 args = Config(**locals())
193 args.device = 'cuda' if torch.cuda.is_available() else 'cpu'
--> 194 state = torch.load(path if os.path.exists(path) else download(supar.MODEL[src].get(path, path), reload=reload))
195 cls = supar.PARSER[state['name']] if cls.NAME is None else cls
196 args = state['args'].update(args)

File /opt/conda/lib/python3.10/site-packages/supar/utils/fn.py:167, in download(url, reload)
165 sys.stderr.write(f"Downloading: {url} to {path}\n")
166 try:
--> 167 torch.hub.download_url_to_file(url, path, progress=True)
168 except urllib.error.URLError:
169 raise RuntimeError(f"File {url} unavailable. Please try other sources.")

File /opt/conda/lib/python3.10/site-packages/torch/hub.py:630, in download_url_to_file(url, dst, hash_prefix, progress)
628 if hash_prefix is not None:
629 sha256 = hashlib.sha256()
--> 630 with tqdm(total=file_size, disable=not progress,
631 unit='B', unit_scale=True, unit_divisor=1024) as pbar:
632 while True:
633 buffer = u.read(8192)

TypeError: nop() missing 1 required positional argument: 'it'

Feature Naming issue

result["BRich"+n_topic_list_for_naming[i]+"_S"] = feature

result["BClar"+n_topic_list_for_naming[i]+"_S"] = feature

result["BNois"+n_topic_list_for_naming[i]+"_S"] = feature

result["BTopc"+n_topic_list_for_naming[i]+"_S"] = feature

Should be ORich, OClar, ONois, OTopic respectively

Duplicate feature names in OSKF and WBKF

Hello, thanks for this great project!

Recently we are trying to reproduce the experimental results in your paper:

Lee, Bruce W., Yoo Sung Jang, and Jason Lee. "Pushing on Text Readability Assessment: A Transformer Meets Handcrafted Linguistic Features." Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021.

Just found that the OSKF method in lingfeat returned exactly the same 16 feature names as in WBKF. Please see examples below:

from lingfeat import extractor

text = "When you see the word Amazon, what’s the first thing that springs to mind – the world’s biggest forest, the longest river or the largest internet retailer – and which do you consider most important?"
LingFeat = extractor.pass_text(text)
LingFeat.preprocess()

WBKF = LingFeat.WBKF_() # WeeBit Corpus Knowledge Features
OSKF = LingFeat.OSKF_() # OneStopEng Corpus Knowledge Features

print('WeeBit Corpus Knowledge Features:', WBKF)
print('OneStopEng Corpus Knowledge Features:', OSKF)

Terminal Output

WeeBit Corpus Knowledge Features:  {'BRich05_S': 1.1274421401321888, 'BRich10_S': 4.858168950304389, 'BRich15_S': 20.647890945896506, 'BRich20_S': 21.932124523445964, 'BClar05_S': 0.5823907653490702, 'BClar10_S': 0.718731752038002, 'BClar15_S': 0.7291195740302404, 'BClar20_S': 0.7486800486626832, 'BNois05_S': 1.5104791224775047, 'BNois10_S': 6.548753840448406, 'BNois15_S': 7.018329580783902, 'BNois20_S': 8.321480132061497, 'BTopc05_S': 3, 'BTopc10_S': 10, 'BTopc15_S': 18, 'BTopc20_S': 23}
OneStopEng Corpus Knowledge Features:  {'BRich05_S': 2.9044833183288574, 'BRich10_S': 3.5476092249155045, 'BRich15_S': 9.398028403520584, 'BRich20_S': 14.846967313438654, 'BClar05_S': 0.00015333294868469238, 'BClar10_S': 0.25143229961395264, 'BClar15_S': 0.6553432226181031, 'BClar20_S': 0.7100768367449443, 'BNois05_S': 1.0000004289882432, 'BNois10_S': 1.4495860709293316, 'BNois15_S': 4.214530509499038, 'BNois20_S': 5.500046277858743, 'BTopc05_S': 2, 'BTopc10_S': 3, 'BTopc15_S': 10, 'BTopc20_S': 15}

According to Appendix B of the above paper, the feature names in OSKF should start with 'O', e.g. 'ORich05_S', 'ORich10_S', etc.

This bug yields 239 distinct feature names (not 255 features as introduced in the paper). Accordingly, in another open-source project of this paper:

https://github.com/brucewlee/pushingonreadability_traditional_ML

The csv files in Research_Data included only 239 linguistic features which we believe were caused by these duplicate feature names.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.