Git Product home page Git Product logo

gramex-nlg's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

gramex-nlg's Issues

Default styling for template display

The template text displayed should come with a default style. We should have an ability to bold, italics, underline, bullet list the text in Narrative.

Sentiment analysis

Any variable in a narrative can have an optional sentiment. For example:

{% set trend = 'increased' if x > 0 else 'decreased' %}
Sales have {{ trend }}.

Here, the variable {{ trend }} should be annotated with a sentiment, which may be optionally used for generating HTML annotations on the generated template, among other things.

Error while using templatize

from nlg.search import templatize
tmpl = templatize(text,fh_args,xdf)
print(tmpl)


TypeError Traceback (most recent call last)
/var/folders/5m/k83dt6yd701d4c79yq73kljh0000gn/T/ipykernel_18869/812605063.py in
----> 1 tmpl = templatize(text,fh_args,xdf)
2 print(tmpl)

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/nlg/search.py in templatize(text, args, df)
566 {{ df["species"].iloc[1] }} and {{ df["species"].iloc[-1] }}.
567 """
--> 568 dfix, clean_text, infl = _search(text, args, df)
569 return narrative.Nugget(clean_text, dfix, infl, args)
570

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/nlg/search.py in _search(text, args, df, copy)
491 # Do this only if needed:
492 # clean_text = utils.sanitize_text(text.text)
--> 493 args = utils.sanitize_fh_args(args, df)
494 # Is this correct?
495 dfs = DFSearch(df)

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/nlg/utils.py in sanitize_fh_args(args, df)
221 res['_c'] = [c[0] for c in selected]
222 if '_sort' in args:
--> 223 sort, _ = _filter_sort_columns(args['_sort'], columns)
224 res['_sort'] = [c[0] for c in sort]
225 return res

TypeError: _filter_sort_columns() missing 1 required positional argument: 'meta'

Installation has failed: No module named builtins

Hello,
I'm getting the following error message during the installation.

pip install nlg -e .
Obtaining file:///home/xxxx/gramex-nlg
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "/home/xxxx/gramex-nlg/setup.py", line 9, in
import builtins
ImportError: No module named builtins

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in /home/xxxx/gramex-nlg/

Thanks in advance for helping.

Regards

James

Unexpected result while creating token using "from_json()" function of Nugget class

Raised by @dikshagupta14

    @classmethod
    def from_json(cls, obj):
        if isinstance(obj, str):
            obj = json.loads(obj)

        text = obj.pop('text')
        obj['text'] = nlp(text)

        tokenlist = obj.pop('tokenmap')
        tokenmap = {}
        for tk in tokenlist:
            index = tk.pop('index')
            if isinstance(index, int):
                token = obj['text'][index]
            elif isinstance(index, (list, tuple)):
                start, end = index
                token = obj['text'][start:end]
            tk.pop('idx')
            tk.pop('text')
            tokenmap[token] = Variable(token, **tk)
        obj['tokenmap'] = tokenmap

        return cls(**obj)

In above function, from_json() which is method of nlg.narrative.Nugget class,
the line - "token = obj['text'][index]" isn't behaving as expected for all input text.

Steps to reproduce.

Code to be used:

from gramex import data
from nlg.utils import load_spacy_model
nlp = lo
```ad_spacy_model()
from nlg.narrative import Nugget
from nlg.narrative import Variable


def from_json(cls, obj):
        if isinstance(obj, str):
            obj = json.loads(obj)
        text = obj.pop('text')
        obj['text'] = nlp(text)

        tokenlist = obj.pop('tokenmap')
        tokenmap = {}
        for tk in tokenlist:
            index = tk.pop('index')
            if isinstance(index, int):
                print('\nPrinting the object text and the type of it: ')
                print(obj['text'])
                print(type(obj['text']))
                print('\nIndex to be used for splitting the text, ', str(index))
                token = obj['text'][index]
                print('\nPrinting the token after split through index value: ')
                print(token)
            elif isinstance(index, (list, tuple)):
                start, end = index
                token = obj['text'][start:end]
            tk.pop('idx')
            tk.pop('text')
            tokenmap[token] = Variable(token, **tk)
            
        obj['tokenmap'] = tokenmap
        return cls(**obj)

#WORKING SCENARIO INPUT:

#defining input nugget dictionary data
nugget_dict = {
    "text": "CP-899-J",
    "tokenmap": [
        {
            "text": "CP-899-J",
            "index": 0,
            "idx": 0,
            "sources": [
                {
                    "location": "cell",
                    "tmpl": 'df["ABC"].iloc[0]',
                    "type": "doc",
                    "enabled": True,
                }
            ],
            "varname": "",
            "inflections": [],
        }
    ]
}
#calling the from_json function
output_nugget = from_json(Nugget, nugget_dict)
print('\nPrinting the final output: ')
print(output_nugget)

OUTPUT obtained for WORKING scenario:
Printing the object text and the type of it:
CP-899-J
<class 'spacy.tokens.doc.Doc'>

Index to be used for splitting the text, 0

Printing the token after split through index value:
CP-899-J

Printing the final output:
{{ df["ABC"].iloc[0] }}


#NON-WORKING SCENARIO input:

#defining input nugget dictionary data
nugget_dict = {
    "text": "CP-K-J",
    "tokenmap": [
        {
            "text": "CP-K-J",
            "index": 0,
            "idx": 0,
            "sources": [
                {
                    "location": "cell",
                    "tmpl": 'df["ABC"].iloc[0]',
                    "type": "doc",
                    "enabled": True,
                }
            ],
            "varname": "",
            "inflections": [],
        }
    ]
}
#calling the from_json function
output_nugget = from_json(Nugget, nugget_dict)
print('\nPrinting the final output: ')
print(output_nugget)

OUTPUT obtained for NON-WORKING scenario:
Printing the object text and the type of it:
CP-K-J
<class 'spacy.tokens.doc.Doc'>

Index to be used for splitting the text, 0

Printing the token after split through index value:
CP

Printing the final output:
{{ df["ABC"].iloc[0] }}-K-J


Expected behavior:
The output that came in non-working scenario should have been instead like this,


Printing the object text and the type of it:
CP-K-J
<class 'spacy.tokens.doc.Doc'>

Index to be used for splitting the text, 0

Printing the token after split through index value:
CP-K-J

Printing the final output:
{{ df["ABC"].iloc[0] }}


Workaround solution

def from_json(cls, obj):
        if isinstance(obj, str):
            obj = json.loads(obj)
        text = obj.pop('text')
        obj['text'] = nlp(text)

        tokenlist = obj.pop('tokenmap')
        tokenmap = {}
        for tk in tokenlist:
            index = tk.pop('index')
            if isinstance(index, int):
                print('\nPrinting the object text and the type of it: ')
                print(obj['text'])
                print(type(obj['text']))
                print('\nIndex to be used for splitting the text, ', str(index))
                len_text = len(tk['text'])
                token = obj['text'][index]
                if not len(token) == len_text:
                    token = obj['text'][index:len_text]
                print('\nPrinting the token after split through index value: ')
                print(token)
            elif isinstance(index, (list, tuple)):
                start, end = index
                token = obj['text'][start:end]
            tk.pop('idx')
            tk.pop('text')
            tokenmap[token] = Variable(token, **tk)
            
        obj['tokenmap'] = tokenmap
        return cls(**obj)

I have added a check, to make sure the length of token taken is equal to the text length for that token element.

                len_text = len(tk['text'])
                token = obj['text'][index]
                if not len(token) == len_text:
                    token = obj['text'][index:len_text]

After adding this, the output comes as expected for all inputs.

Not Working

NLG is not working. I am not getting the expected result even I replicate the same example I am getting James in result when I changed sort by rating for desc it is giving me the same result.
[Python Script]

pip install nlg doesn't work

while creating a new installation in a new environment, installation through pip doesn't work.

Reproduce:
pip install nlg

ERROR: Could not find a version that satisfies the requirement nlg (from versions: none)
ERROR: No matching distribution found for nlg

Nugget serialization / deserialization

The nlg.narrative.Nugget class does not support the serialization required for the nlg.webapp.process_text function.

The deserialization also needs to support creation of nugget objects from nugget configs stored on disk.

Module import error

I am getting an error on this statement - from nlg.search import templatize
Error is: OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
error

Templatize output post-processing

Summary
Templatize output provides a dictionary of tokens with the corresponding list of templates. We expect a full template as the output compatible with tornado

Steps to reproduce
run a templatize function with valid text, fh_args, and data frame

What is the current bug behavior?
Currently, intermediate output is provided which needs post-processing

What is the expected correct behavior?
A tornado template as an output

Possible fixes
A post-processing functionality encapsulated inside the templatize function.
/label ~bug

Better support for detecting inflections

Currently, an inflection is defined as one of the five following modifications to a word:

  • uppercase
  • lowercase
  • capitalization
  • singularization
  • pluralization.

The detection mechanism works by comparing the lemmas of two words and trying to find which of the above are applied to the first argument to convert it into the second argument. This only detects one inflection at a time, whereas there may be more. For example:

# Say `df` is the actors dataset
# (https://github.com/gramener/gramex-nlg/blob/dev/nlg/tests/data/actors.csv)
from nlg.utils import load_spacy_model
from nlg.grammar import _token_inflections

nlp = load_spacy_model()
text = "James Stewart is the highest rated actor."
doc = nlp(text)
X = nlp('Actors')[0]  # This is df['category'].iloc[0]
Y = doc[-2]
print(_token_inflections(X, Y))
# <function singular>

whereas it should indicate the fact that the inflection is both a singularization and lowercasing.

Also, the API is inconsistent. The first three modifications are Python string methods, but the other two are NLG functions. So the detector may return either a callable or a string representing a string method 🤦‍♂️

ToDo

  • Detect multiple inflections (for the five above, we can at best detect two inflections: one of first three, one of last two)
  • See if order matters (for the five above, it should not) - it does
  • Support more inflections, like PoS tag changes

Cloning nuggets with different FormHandler arguments

Nuggets should have a .clone() method, which copies a nugget but assigns a different set of FormHandler arguments. This would help auto-generate new nuggets.

E.g, in the context of the actors dataset

fh_args = {'_sort': ['-rating']}
doc = nlp('James Stewart is the actor who has the highest rating.')
nugget = nlg.templatize(doc, fh_args, df)

new_nugget = nugget.clone({'_sort': ['rating']})  # note the change in sort order
new_nugget.render(df)
# Katharine Hepburn is the actress who has the highest rating.

CAUTION: This will only work correctly if all tokens have been properly templatized. Note that the token "highest", left un-templatized, stays as it is even when the order of sorting changes. In this case, the token will have to be made sensitive to the sorting order (see #38) . But generally, any token left un-templatized may get affected.

Auto-captions for gramexcharts

Use the text generation framework:

chart data structure -> analysis metric -> calibration -> entities -> template

to generate captions for the following types of charts:

  • bar charts (compare a variable to a fixed value - min, max, mean, median, percentile, etc.)
  • scatter plots (infer a relationship between two variables)
  • donut / pie charts (part of whole comparisons)

Conditional phrases

In the variable settings modal, find a way to support simple if-else logic. Consider the following sentences:

A randomly generated number is positive.
A randomly generated number is negative.

This should be templatized as:

{% import random %}
{% set n = random.choice([-1, 1]) * random.random() %}
A randomly generated number is {% if n > 0 %}positive{% else %}negative{% end %}.

Templatize comparative adjectives with respect to sort order.

In context of the actors dataset,

doc = nlp('James Stewart has the highest rating.')
fh_args = {'_sort': ['-rating']}
nugget = templatize(doc, fh_args, df)
print(nugget)
{% set fh_args = {"_sort": ["-rating"]}  %}
{% set df = U.gfilter(orgdf, fh_args.copy()) %}
{% set fh_args = U.sanitize_fh_args(fh_args, orgdf) %}
{# Do not edit above this line. #}
{{ df["name"].iloc[0] }} is the {{ G.singular(df["category"].iloc[-2]).lower() }} with the highest rating.

The adjective highest should be templatized in the following manner:

  1. Find its lemma and inflected form and put them in a variable.
  2. Tie the lemma to the sort order.
  3. Automatically insert the antonym of the lemma if the sort order changes.

Variable template insertion should rely on spacy token indices, not str.replace

From the README example,

the auto-gen template turns out to be:

{% set fh_args = {"_by": ["species"], "_c": ["sepal_width|avg"], "_sort": ["sepal_width|avg"]}  %}
{% set df = U.gfilter(orgdf, fh_args.copy()) %}
{% set fh_args = U.sanitize_fh_args(fh_args, orgdf) %}
{# Do not edit above this line. #}
The {{ df["{{ fh_args['_by'][0] }}"].iloc[0] }} {{ fh_args['_by'][0] }} has the least average sepal_width.

The "virginica" token turns out to be a nested template - the correct value is {{ df["species"].iloc[0] }}. But "species" itself is another variable (right the next word) with value {{ fh_args['_by'][0] }}, and therefore gets re-templatized. This is because templates are added in the source text with str.replace here.

They should instead be replaced by changing spacy tokens and forming new spacy docs.

ValueError while importing spacy.matcher

ValueError: spacy.strings.StringStore size changed, may indicate binary incompatibility. Expected 112 from C header, got 88 from PyObject

Above error is thrown while importing spacy Matcher
from spacy.matcher import PhraseMatcher

Spacy version: spacy=2.2
Platform: Macbook

Resolution:
Downgrade to spacy 2.1
pip install spacy==2.1

Verbosity for the auto-templatization process

Add a verbosity parameter to nlg.search.templatize. It could have the following levels:

0: Ignore all warnings
1 (default): raise user warnings
2: log all NEs and noun-chunk matches.
3. log inflections

For every level > 1, whatever is logged should be in addition to the previous level. i.e. at level 3, NEs and NP chunks are also logged in addition to inflections.

Reorder Templates in each Narrative

We need to have a way to reorder templates in a Narrative. One approach is to provide UP / DOWN arrow next to each template. This will allow reordering templates.

Shorter strings break templatization

In the context of the actors dataset, if any name is passed (e.g.: "Humphrey Bogart") to the templatize function:

~/src/nlg/nlg/search.py in templatize(text, args, df)
    510 {{ df["species"].iloc[1] }} and {{ df["species"].iloc[-1] }}.
    511     """
--> 512     dfix, clean_text, infl = _search(text, args, df)
    513     return narrative.Nugget(clean_text, dfix, infl, args)
    514 

~/src/nlg/nlg/search.py in _search(text, args, df, copy)
    441     dfix.update(search_args(dfs.ents, args))
    442     dfix.clean()
--> 443     inflections = grammar.find_inflections(dfix, args, df)
    444     _infl = {}
    445     for token, funcs in inflections.items():

~/src/nlg/nlg/grammar.py in find_inflections(search, fh_args, df)
    256         rendered = Template('{{{{ {} }}}}'.format(tmpl)).generate(
    257             df=df, fh_args=fh_args).decode('utf8')
--> 258         if rendered != token.text:
    259             x = nlp(rendered)[0]
    260             infl = _token_inflections(x, token)

AttributeError: 'tuple' object has no attribute 'text'

Column names are not always recognized

Some column names, even if they are nouns, do not get picked up in input sentences. Run a reverse search of input spacy documents, once the default dataframe search is complete.

Tighter integration with spaCy

Too many functions deal partially with strings and partially with spaCy docs. Strings should only be the entry point for the public API - rest everything should work almost entirely with spacy.{docs, spans, tokens}

Named entities vs noun chunks

nlg.utils.ner wrongly clubs named entities and nouns together. This is a problem because NEs undergo a literal search, whereas nouns should undergo a lemmatized search.

E.g. "Inception is a good movie."
Here the word "movie", if it appears in an inflected form in the dataframe, will not be correctly templatized because NER clubs it with NEs like "Inception".

Embed code should have reference to data

The Embed code snipped should have conditional logic

  • Check for Formhandler URL to retrieve data
  • If Formhandler URL is NOT present, show static text from Narrative (fall back to the original text used while creating narrative)

image

Error while using pickel with nuggets

I created a few nuggets , defined a dictionary for it and trying to dump it using pickle but it is giving error.

nugget_map={}
nugget_map['nugget1']=nugget1
nugget_map['nugget2']=nugget2
with open('nugget_map.pickle', 'wb') as handle:
pickle.dump(nugget_map, handle, protocol=pickle.HIGHEST_PROTOCOL)

PicklingError Traceback (most recent call last)
in
1 data_train=pd.read_csv('data_new.csv')
----> 2 nugget_map=train_data(data_train)

in train_data(train)
85 #nugget_map = pd.DataFrame.from_dict(nugget_map)
86 with open('nugget_map.pickle', 'wb') as handle:
---> 87 pickle.dump(nugget_map, handle, protocol=pickle.HIGHEST_PROTOCOL)
88 return nugget_map

PicklingError: Can't pickle <function at 0x7f30354f98c8>: attribute lookup on nlg.narrative failed

Can you please check this, or can you please share some workaround how can I dump the nugget map.

Intermediate representation of FormHandler arguments

Do not tokenize / process / search formhandler arguments as they are. Leading hyphens cause the tokens to get corrupted. For example:

>>> df = pd.read_csv('actors.csv')
>>> fh_args = {'_sort': ['-rating']}
>>> text = nlp('James Stewart is the actor with the highest rating.')
>>> n = templatize(text, fh_args, df)
>>> n.render(df)
b'James Stewart is the actor with the highest -rating.'

Instead, preprocess FormHandler arguments with gramex.data._filter_{sort, select, groupby}_columns to get a cleaner representation which leaves the token untouched. Use this representation in the template.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.