gramener / gramex-nlg Goto Github PK

View Code? Open in Web Editor NEW

24.0 24.0 25.0 596 KB

Natural Language Generation for Gramex applications.

License: Other

JavaScript 8.72% HTML 21.17% Python 69.40% Shell 0.04% CSS 0.66%

gramex-nlg's People

Stargazers

Watchers

gramex-nlg's Issues

Deletion of Narrative doesn't clear it from Cache

The deletion action has to be implemented to erase the narrative from Cache as well and not just the display of narrative on the screen.

Functionality to manually update tornado template

In the case when existing templatize missed some part of the text to identify as a "keyword", we would like to have a functionality to add it in the end.

Integration with LabelStudio

Check if labelstudio can be used to annotate tokens and spans.

Default styling for template display

The template text displayed should come with a default style. We should have an ability to bold, italics, underline, bullet list the text in Narrative.

Sentiment analysis

Any variable in a narrative can have an optional sentiment. For example:

{% set trend = 'increased' if x > 0 else 'decreased' %}
Sales have {{ trend }}.

Here, the variable {{ trend }} should be annotated with a sentiment, which may be optionally used for generating HTML annotations on the generated template, among other things.

Error while using templatize

from nlg.search import templatize
tmpl = templatize(text,fh_args,xdf)
print(tmpl)

TypeError Traceback (most recent call last)
/var/folders/5m/k83dt6yd701d4c79yq73kljh0000gn/T/ipykernel_18869/812605063.py in
----> 1 tmpl = templatize(text,fh_args,xdf)
2 print(tmpl)

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/nlg/search.py in templatize(text, args, df)
566 {{ df["species"].iloc[1] }} and {{ df["species"].iloc[-1] }}.
567 """
--> 568 dfix, clean_text, infl = _search(text, args, df)
569 return narrative.Nugget(clean_text, dfix, infl, args)
570

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/nlg/search.py in _search(text, args, df, copy)
491 # Do this only if needed:
492 # clean_text = utils.sanitize_text(text.text)
--> 493 args = utils.sanitize_fh_args(args, df)
494 # Is this correct?
495 dfs = DFSearch(df)

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/nlg/utils.py in sanitize_fh_args(args, df)
221 res['_c'] = [c[0] for c in selected]
222 if '_sort' in args:
--> 223 sort, _ = _filter_sort_columns(args['_sort'], columns)
224 res['_sort'] = [c[0] for c in sort]
225 return res

TypeError: _filter_sort_columns() missing 1 required positional argument: 'meta'

Link broken for the keyword Source in readme

Link for the keyword Source is broken at Line 30 in README.md
Link: https://github.com/gramener/gramex/blob/master/gramex/apps/guide/nlg/gramex.yaml

Inherit data sources from host application

When NLG is imported into another gramex app (the "host" app), alongwith uploaded datasets, allow any formhandler URLs present in the host app as sources of data.

Installation has failed: No module named builtins

Hello,
I'm getting the following error message during the installation.

pip install nlg -e .
Obtaining file:///home/xxxx/gramex-nlg
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "/home/xxxx/gramex-nlg/setup.py", line 9, in
import builtins
ImportError: No module named builtins

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in /home/xxxx/gramex-nlg/

Thanks in advance for helping.

Regards

James

Unexpected result while creating token using "from_json()" function of Nugget class

Raised by @dikshagupta14

    @classmethod
    def from_json(cls, obj):
        if isinstance(obj, str):
            obj = json.loads(obj)

        text = obj.pop('text')
        obj['text'] = nlp(text)

        tokenlist = obj.pop('tokenmap')
        tokenmap = {}
        for tk in tokenlist:
            index = tk.pop('index')
            if isinstance(index, int):
                token = obj['text'][index]
            elif isinstance(index, (list, tuple)):
                start, end = index
                token = obj['text'][start:end]
            tk.pop('idx')
            tk.pop('text')
            tokenmap[token] = Variable(token, **tk)
        obj['tokenmap'] = tokenmap

        return cls(**obj)

In above function, from_json() which is method of nlg.narrative.Nugget class,
the line - "token = obj['text'][index]" isn't behaving as expected for all input text.

Steps to reproduce.

Code to be used:

from gramex import data
from nlg.utils import load_spacy_model
nlp = lo
```ad_spacy_model()
from nlg.narrative import Nugget
from nlg.narrative import Variable


def from_json(cls, obj):
        if isinstance(obj, str):
            obj = json.loads(obj)
        text = obj.pop('text')
        obj['text'] = nlp(text)

        tokenlist = obj.pop('tokenmap')
        tokenmap = {}
        for tk in tokenlist:
            index = tk.pop('index')
            if isinstance(index, int):
                print('\nPrinting the object text and the type of it: ')
                print(obj['text'])
                print(type(obj['text']))
                print('\nIndex to be used for splitting the text, ', str(index))
                token = obj['text'][index]
                print('\nPrinting the token after split through index value: ')
                print(token)
            elif isinstance(index, (list, tuple)):
                start, end = index
                token = obj['text'][start:end]
            tk.pop('idx')
            tk.pop('text')
            tokenmap[token] = Variable(token, **tk)
            
        obj['tokenmap'] = tokenmap
        return cls(**obj)

#WORKING SCENARIO INPUT:

#defining input nugget dictionary data
nugget_dict = {
    "text": "CP-899-J",
    "tokenmap": [
        {
            "text": "CP-899-J",
            "index": 0,
            "idx": 0,
            "sources": [
                {
                    "location": "cell",
                    "tmpl": 'df["ABC"].iloc[0]',
                    "type": "doc",
                    "enabled": True,
                }
            ],
            "varname": "",
            "inflections": [],
        }
    ]
}
#calling the from_json function
output_nugget = from_json(Nugget, nugget_dict)
print('\nPrinting the final output: ')
print(output_nugget)

OUTPUT obtained for WORKING scenario:
Printing the object text and the type of it:
CP-899-J
<class 'spacy.tokens.doc.Doc'>

Index to be used for splitting the text, 0

Printing the token after split through index value:
CP-899-J

Printing the final output:
{{ df["ABC"].iloc[0] }}

#NON-WORKING SCENARIO input:

#defining input nugget dictionary data
nugget_dict = {
    "text": "CP-K-J",
    "tokenmap": [
        {
            "text": "CP-K-J",
            "index": 0,
            "idx": 0,
            "sources": [
                {
                    "location": "cell",
                    "tmpl": 'df["ABC"].iloc[0]',
                    "type": "doc",
                    "enabled": True,
                }
            ],
            "varname": "",
            "inflections": [],
        }
    ]
}
#calling the from_json function
output_nugget = from_json(Nugget, nugget_dict)
print('\nPrinting the final output: ')
print(output_nugget)

OUTPUT obtained for NON-WORKING scenario:
Printing the object text and the type of it:
CP-K-J
<class 'spacy.tokens.doc.Doc'>

Index to be used for splitting the text, 0

Printing the token after split through index value:
CP

Printing the final output:
{{ df["ABC"].iloc[0] }}-K-J

Expected behavior:
The output that came in non-working scenario should have been instead like this,

Printing the object text and the type of it:
CP-K-J
<class 'spacy.tokens.doc.Doc'>

Index to be used for splitting the text, 0

Printing the token after split through index value:
CP-K-J

Printing the final output:
{{ df["ABC"].iloc[0] }}

Workaround solution

def from_json(cls, obj):
        if isinstance(obj, str):
            obj = json.loads(obj)
        text = obj.pop('text')
        obj['text'] = nlp(text)

        tokenlist = obj.pop('tokenmap')
        tokenmap = {}
        for tk in tokenlist:
            index = tk.pop('index')
            if isinstance(index, int):
                print('\nPrinting the object text and the type of it: ')
                print(obj['text'])
                print(type(obj['text']))
                print('\nIndex to be used for splitting the text, ', str(index))
                len_text = len(tk['text'])
                token = obj['text'][index]
                if not len(token) == len_text:
                    token = obj['text'][index:len_text]
                print('\nPrinting the token after split through index value: ')
                print(token)
            elif isinstance(index, (list, tuple)):
                start, end = index
                token = obj['text'][start:end]
            tk.pop('idx')
            tk.pop('text')
            tokenmap[token] = Variable(token, **tk)
            
        obj['tokenmap'] = tokenmap
        return cls(**obj)

I have added a check, to make sure the length of token taken is equal to the text length for that token element.

                len_text = len(tk['text'])
                token = obj['text'][index]
                if not len(token) == len_text:
                    token = obj['text'][index:len_text]

After adding this, the output comes as expected for all inputs.

Cache not cleared in Preview box (post style implementation)

When the last nugget is deleted from the Saved Narratives, the text displayed in the Preview box does NOT clear (it remains).

The behavior should be the same for Saved narratives and unsaved narratives.

Deleting nuggets removes everything from the preview

When deleting nuggets from the frontend, the entire list gets deleted. The correct list appears only when the page is refreshed.

Functionality to give column index as fh args instead of column names

I would like to use same fh_args for different dataframe
For compatibility of the template with other dataframe, I would like to give fh_args based on column index.

Like mentioned below
'_sort': [df.columns[1]] }

Not Working

NLG is not working. I am not getting the expected result even I replicate the same example I am getting James in result when I changed sort by rating for desc it is giving me the same result.
[Python Script]

pip install nlg doesn't work

while creating a new installation in a new environment, installation through pip doesn't work.

Reproduce:
pip install nlg

ERROR: Could not find a version that satisfies the requirement nlg (from versions: none)
ERROR: No matching distribution found for nlg

Nugget serialization / deserialization

The nlg.narrative.Nugget class does not support the serialization required for the nlg.webapp.process_text function.

The deserialization also needs to support creation of nugget objects from nugget configs stored on disk.

Module import error

I am getting an error on this statement - from nlg.search import templatize
Error is: OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

Templatize output post-processing

Summary
Templatize output provides a dictionary of tokens with the corresponding list of templates. We expect a full template as the output compatible with tornado

Steps to reproduce
run a templatize function with valid text, fh_args, and data frame

What is the current bug behavior?
Currently, intermediate output is provided which needs post-processing

What is the expected correct behavior?
A tornado template as an output

Possible fixes
A post-processing functionality encapsulated inside the templatize function.
/label ~bug

Better support for detecting inflections

Currently, an inflection is defined as one of the five following modifications to a word:

uppercase
lowercase
capitalization
singularization
pluralization.

The detection mechanism works by comparing the lemmas of two words and trying to find which of the above are applied to the first argument to convert it into the second argument. This only detects one inflection at a time, whereas there may be more. For example:

# Say `df` is the actors dataset
# (https://github.com/gramener/gramex-nlg/blob/dev/nlg/tests/data/actors.csv)
from nlg.utils import load_spacy_model
from nlg.grammar import _token_inflections

nlp = load_spacy_model()
text = "James Stewart is the highest rated actor."
doc = nlp(text)
X = nlp('Actors')[0]  # This is df['category'].iloc[0]
Y = doc[-2]
print(_token_inflections(X, Y))
# <function singular>

whereas it should indicate the fact that the inflection is both a singularization and lowercasing.

Also, the API is inconsistent. The first three modifications are Python string methods, but the other two are NLG functions. So the detector may return either a callable or a string representing a string method 🤦‍♂️

ToDo

Detect multiple inflections (for the five above, we can at best detect two inflections: one of first three, one of last two)
See if order matters (for the five above, ~~it should not~~) - it does
Support more inflections, like PoS tag changes

Cloning nuggets with different FormHandler arguments

Nuggets should have a .clone() method, which copies a nugget but assigns a different set of FormHandler arguments. This would help auto-generate new nuggets.

E.g, in the context of the actors dataset

fh_args = {'_sort': ['-rating']}
doc = nlp('James Stewart is the actor who has the highest rating.')
nugget = nlg.templatize(doc, fh_args, df)

new_nugget = nugget.clone({'_sort': ['rating']})  # note the change in sort order
new_nugget.render(df)
# Katharine Hepburn is the actress who has the highest rating.

CAUTION: This will only work correctly if all tokens have been properly templatized. Note that the token "highest", left un-templatized, stays as it is even when the order of sorting changes. In this case, the token will have to be made sensitive to the sorting order (see #38) . But generally, any token left un-templatized may get affected.

Installation fails at step # 2

Gramex setup command fails with the below error. Could you please assist.
pip install nlg was successful

Auto-captions for gramexcharts

Use the text generation framework:

chart data structure -> analysis metric -> calibration -> entities -> template

to generate captions for the following types of charts:

bar charts (compare a variable to a fixed value - min, max, mean, median, percentile, etc.)
scatter plots (infer a relationship between two variables)
donut / pie charts (part of whole comparisons)

Conditional phrases

In the variable settings modal, find a way to support simple if-else logic. Consider the following sentences:

A randomly generated number is positive.
A randomly generated number is negative.

This should be templatized as:

{% import random %}
{% set n = random.choice([-1, 1]) * random.random() %}
A randomly generated number is {% if n > 0 %}positive{% else %}negative{% end %}.

Templatize comparative adjectives with respect to sort order.

In context of the actors dataset,

doc = nlp('James Stewart has the highest rating.')
fh_args = {'_sort': ['-rating']}
nugget = templatize(doc, fh_args, df)
print(nugget)

{% set fh_args = {"_sort": ["-rating"]}  %}
{% set df = U.gfilter(orgdf, fh_args.copy()) %}
{% set fh_args = U.sanitize_fh_args(fh_args, orgdf) %}
{# Do not edit above this line. #}
{{ df["name"].iloc[0] }} is the {{ G.singular(df["category"].iloc[-2]).lower() }} with the highest rating.

The adjective highest should be templatized in the following manner:

Find its lemma and inflected form and put them in a variable.
Tie the lemma to the sort order.
Automatically insert the antonym of the lemma if the sort order changes.

Misc documentation feedback

Feedback from @mankoven

Use Sphinx's APIref
Move Glossary away from README
Write explanation for the actors dataset in the notebook

Variable template insertion should rely on spacy token indices, not str.replace

From the README example,

the auto-gen template turns out to be:

{% set fh_args = {"_by": ["species"], "_c": ["sepal_width|avg"], "_sort": ["sepal_width|avg"]}  %}
{% set df = U.gfilter(orgdf, fh_args.copy()) %}
{% set fh_args = U.sanitize_fh_args(fh_args, orgdf) %}
{# Do not edit above this line. #}
The {{ df["{{ fh_args['_by'][0] }}"].iloc[0] }} {{ fh_args['_by'][0] }} has the least average sepal_width.

The "virginica" token turns out to be a nested template - the correct value is {{ df["species"].iloc[0] }}. But "species" itself is another variable (right the next word) with value {{ fh_args['_by'][0] }}, and therefore gets re-templatized. This is because templates are added in the source text with str.replace here.

They should instead be replaced by changing spacy tokens and forming new spacy docs.

ValueError while importing spacy.matcher

ValueError: spacy.strings.StringStore size changed, may indicate binary incompatibility. Expected 112 from C header, got 88 from PyObject

Above error is thrown while importing spacy Matcher
from spacy.matcher import PhraseMatcher

Spacy version: spacy=2.2
Platform: Macbook

Resolution:
Downgrade to spacy 2.1
pip install spacy==2.1

IMDB ratings data is not getting templatised

While creating nuggets using IMDB rating data, none of the nugget text is being autodetected.

UI: No option to delete a variable

Can't delete the selected variable.

Verbosity for the auto-templatization process

Add a verbosity parameter to nlg.search.templatize. It could have the following levels:

0: Ignore all warnings
1 (default): raise user warnings
2: log all NEs and noun-chunk matches.
3. log inflections

For every level > 1, whatever is logged should be in addition to the previous level. i.e. at level 3, NEs and NP chunks are also logged in addition to inflections.

Reorder Templates in each Narrative

We need to have a way to reorder templates in a Narrative. One approach is to provide UP / DOWN arrow next to each template. This will allow reordering templates.

Shorter strings break templatization

In the context of the actors dataset, if any name is passed (e.g.: "Humphrey Bogart") to the templatize function:

~/src/nlg/nlg/search.py in templatize(text, args, df)
    510 {{ df["species"].iloc[1] }} and {{ df["species"].iloc[-1] }}.
    511     """
--> 512     dfix, clean_text, infl = _search(text, args, df)
    513     return narrative.Nugget(clean_text, dfix, infl, args)
    514 

~/src/nlg/nlg/search.py in _search(text, args, df, copy)
    441     dfix.update(search_args(dfs.ents, args))
    442     dfix.clean()
--> 443     inflections = grammar.find_inflections(dfix, args, df)
    444     _infl = {}
    445     for token, funcs in inflections.items():

~/src/nlg/nlg/grammar.py in find_inflections(search, fh_args, df)
    256         rendered = Template('{{{{ {} }}}}'.format(tmpl)).generate(
    257             df=df, fh_args=fh_args).decode('utf8')
--> 258         if rendered != token.text:
    259             x = nlp(rendered)[0]
    260             infl = _token_inflections(x, token)

AttributeError: 'tuple' object has no attribute 'text'

Column names are not always recognized

Some column names, even if they are nouns, do not get picked up in input sentences. Run a reverse search of input spacy documents, once the default dataframe search is complete.

Tighter integration with spaCy

Too many functions deal partially with strings and partially with spaCy docs. Strings should only be the entry point for the public API - rest everything should work almost entirely with spacy.{docs, spans, tokens}

Narrative name is empty on load

When loading a narrative in the IDE, the narrative_name variable in nlg.js is not set, causing this:

spaCy infix separators

Datasets may have hyphenated column names or names separated by underscores, like the iris dataset.

Initialize the spaCy tokenizer with modified infix rules to handle this.

Named entities vs noun chunks

nlg.utils.ner wrongly clubs named entities and nouns together. This is a problem because NEs undergo a literal search, whereas nouns should undergo a lemmatized search.

E.g. "Inception is a good movie."
Here the word "movie", if it appears in an inflected form in the dataframe, will not be correctly templatized because NER clubs it with NEs like "Inception".

Functionality to understand derived quant

We would like to have a functionality where we can identify derived quants from the dataframe.

For example, the length of the dataframe.

Embed code should have reference to data

The Embed code snipped should have conditional logic

Check for Formhandler URL to retrieve data
If Formhandler URL is NOT present, show static text from Narrative (fall back to the original text used while creating narrative)

Error while using pickel with nuggets

I created a few nuggets , defined a dictionary for it and trying to dump it using pickle but it is giving error.

nugget_map={}
nugget_map['nugget1']=nugget1
nugget_map['nugget2']=nugget2
with open('nugget_map.pickle', 'wb') as handle:
pickle.dump(nugget_map, handle, protocol=pickle.HIGHEST_PROTOCOL)

PicklingError Traceback (most recent call last)
in
1 data_train=pd.read_csv('data_new.csv')
----> 2 nugget_map=train_data(data_train)

in train_data(train)
85 #nugget_map = pd.DataFrame.from_dict(nugget_map)
86 with open('nugget_map.pickle', 'wb') as handle:
---> 87 pickle.dump(nugget_map, handle, protocol=pickle.HIGHEST_PROTOCOL)
88 return nugget_map

PicklingError: Can't pickle <function at 0x7f30354f98c8>: attribute lookup on nlg.narrative failed

Can you please check this, or can you please share some workaround how can I dump the nugget map.

Intermediate representation of FormHandler arguments

Do not tokenize / process / search formhandler arguments as they are. Leading hyphens cause the tokens to get corrupted. For example:

>>> df = pd.read_csv('actors.csv')
>>> fh_args = {'_sort': ['-rating']}
>>> text = nlp('James Stewart is the actor with the highest rating.')
>>> n = templatize(text, fh_args, df)
>>> n.render(df)
b'James Stewart is the actor with the highest -rating.'

Instead, preprocess FormHandler arguments with gramex.data._filter_{sort, select, groupby}_columns to get a cleaner representation which leaves the token untouched. Use this representation in the template.

gramener / gramex-nlg Goto Github PK

gramex-nlg's People

Stargazers

Watchers

Forkers

gramex-nlg's Issues

Recommend Projects

Recommend Topics

Recommend Org