gramener / gramex-nlg Goto Github PK
View Code? Open in Web Editor NEWNatural Language Generation for Gramex applications.
License: Other
Natural Language Generation for Gramex applications.
License: Other
The deletion action has to be implemented to erase the narrative from Cache as well and not just the display of narrative on the screen.
In the case when existing templatize missed some part of the text to identify as a "keyword", we would like to have a functionality to add it in the end.
Check if labelstudio can be used to annotate tokens and spans.
The template text displayed should come with a default style. We should have an ability to bold, italics, underline, bullet list the text in Narrative.
Any variable in a narrative can have an optional sentiment. For example:
{% set trend = 'increased' if x > 0 else 'decreased' %}
Sales have {{ trend }}.
Here, the variable {{ trend }}
should be annotated with a sentiment, which may be optionally used for generating HTML annotations on the generated template, among other things.
from nlg.search import templatize
tmpl = templatize(text,fh_args,xdf)
print(tmpl)
TypeError Traceback (most recent call last)
/var/folders/5m/k83dt6yd701d4c79yq73kljh0000gn/T/ipykernel_18869/812605063.py in
----> 1 tmpl = templatize(text,fh_args,xdf)
2 print(tmpl)
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/nlg/search.py in templatize(text, args, df)
566 {{ df["species"].iloc[1] }} and {{ df["species"].iloc[-1] }}.
567 """
--> 568 dfix, clean_text, infl = _search(text, args, df)
569 return narrative.Nugget(clean_text, dfix, infl, args)
570
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/nlg/search.py in _search(text, args, df, copy)
491 # Do this only if needed:
492 # clean_text = utils.sanitize_text(text.text)
--> 493 args = utils.sanitize_fh_args(args, df)
494 # Is this correct?
495 dfs = DFSearch(df)
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/nlg/utils.py in sanitize_fh_args(args, df)
221 res['_c'] = [c[0] for c in selected]
222 if '_sort' in args:
--> 223 sort, _ = _filter_sort_columns(args['_sort'], columns)
224 res['_sort'] = [c[0] for c in sort]
225 return res
TypeError: _filter_sort_columns() missing 1 required positional argument: 'meta'
Link for the keyword Source is broken at Line 30 in README.md
Link: https://github.com/gramener/gramex/blob/master/gramex/apps/guide/nlg/gramex.yaml
When NLG is imported into another gramex app (the "host" app), alongwith uploaded datasets, allow any formhandler URLs present in the host app as sources of data.
Hello,
I'm getting the following error message during the installation.
pip install nlg -e .
Obtaining file:///home/xxxx/gramex-nlg
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "/home/xxxx/gramex-nlg/setup.py", line 9, in
import builtins
ImportError: No module named builtins
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /home/xxxx/gramex-nlg/
Thanks in advance for helping.
Regards
James
Raised by @dikshagupta14
@classmethod
def from_json(cls, obj):
if isinstance(obj, str):
obj = json.loads(obj)
text = obj.pop('text')
obj['text'] = nlp(text)
tokenlist = obj.pop('tokenmap')
tokenmap = {}
for tk in tokenlist:
index = tk.pop('index')
if isinstance(index, int):
token = obj['text'][index]
elif isinstance(index, (list, tuple)):
start, end = index
token = obj['text'][start:end]
tk.pop('idx')
tk.pop('text')
tokenmap[token] = Variable(token, **tk)
obj['tokenmap'] = tokenmap
return cls(**obj)
In above function, from_json() which is method of nlg.narrative.Nugget class,
the line - "token = obj['text'][index]" isn't behaving as expected for all input text.
Steps to reproduce.
Code to be used:
from gramex import data
from nlg.utils import load_spacy_model
nlp = lo
```ad_spacy_model()
from nlg.narrative import Nugget
from nlg.narrative import Variable
def from_json(cls, obj):
if isinstance(obj, str):
obj = json.loads(obj)
text = obj.pop('text')
obj['text'] = nlp(text)
tokenlist = obj.pop('tokenmap')
tokenmap = {}
for tk in tokenlist:
index = tk.pop('index')
if isinstance(index, int):
print('\nPrinting the object text and the type of it: ')
print(obj['text'])
print(type(obj['text']))
print('\nIndex to be used for splitting the text, ', str(index))
token = obj['text'][index]
print('\nPrinting the token after split through index value: ')
print(token)
elif isinstance(index, (list, tuple)):
start, end = index
token = obj['text'][start:end]
tk.pop('idx')
tk.pop('text')
tokenmap[token] = Variable(token, **tk)
obj['tokenmap'] = tokenmap
return cls(**obj)
#WORKING SCENARIO INPUT:
#defining input nugget dictionary data
nugget_dict = {
"text": "CP-899-J",
"tokenmap": [
{
"text": "CP-899-J",
"index": 0,
"idx": 0,
"sources": [
{
"location": "cell",
"tmpl": 'df["ABC"].iloc[0]',
"type": "doc",
"enabled": True,
}
],
"varname": "",
"inflections": [],
}
]
}
#calling the from_json function
output_nugget = from_json(Nugget, nugget_dict)
print('\nPrinting the final output: ')
print(output_nugget)
OUTPUT obtained for WORKING scenario:
Printing the object text and the type of it:
CP-899-J
<class 'spacy.tokens.doc.Doc'>
Index to be used for splitting the text, 0
Printing the token after split through index value:
CP-899-J
Printing the final output:
{{ df["ABC"].iloc[0] }}
#NON-WORKING SCENARIO input:
#defining input nugget dictionary data
nugget_dict = {
"text": "CP-K-J",
"tokenmap": [
{
"text": "CP-K-J",
"index": 0,
"idx": 0,
"sources": [
{
"location": "cell",
"tmpl": 'df["ABC"].iloc[0]',
"type": "doc",
"enabled": True,
}
],
"varname": "",
"inflections": [],
}
]
}
#calling the from_json function
output_nugget = from_json(Nugget, nugget_dict)
print('\nPrinting the final output: ')
print(output_nugget)
OUTPUT obtained for NON-WORKING scenario:
Printing the object text and the type of it:
CP-K-J
<class 'spacy.tokens.doc.Doc'>
Index to be used for splitting the text, 0
Printing the token after split through index value:
CP
Printing the final output:
{{ df["ABC"].iloc[0] }}-K-J
Expected behavior:
The output that came in non-working scenario should have been instead like this,
Printing the object text and the type of it:
CP-K-J
<class 'spacy.tokens.doc.Doc'>
Index to be used for splitting the text, 0
Printing the token after split through index value:
CP-K-J
Printing the final output:
{{ df["ABC"].iloc[0] }}
Workaround solution
def from_json(cls, obj):
if isinstance(obj, str):
obj = json.loads(obj)
text = obj.pop('text')
obj['text'] = nlp(text)
tokenlist = obj.pop('tokenmap')
tokenmap = {}
for tk in tokenlist:
index = tk.pop('index')
if isinstance(index, int):
print('\nPrinting the object text and the type of it: ')
print(obj['text'])
print(type(obj['text']))
print('\nIndex to be used for splitting the text, ', str(index))
len_text = len(tk['text'])
token = obj['text'][index]
if not len(token) == len_text:
token = obj['text'][index:len_text]
print('\nPrinting the token after split through index value: ')
print(token)
elif isinstance(index, (list, tuple)):
start, end = index
token = obj['text'][start:end]
tk.pop('idx')
tk.pop('text')
tokenmap[token] = Variable(token, **tk)
obj['tokenmap'] = tokenmap
return cls(**obj)
I have added a check, to make sure the length of token taken is equal to the text length for that token element.
len_text = len(tk['text'])
token = obj['text'][index]
if not len(token) == len_text:
token = obj['text'][index:len_text]
After adding this, the output comes as expected for all inputs.
When the last nugget is deleted from the Saved Narratives, the text displayed in the Preview box does NOT clear (it remains).
The behavior should be the same for Saved narratives and unsaved narratives.
When deleting nuggets from the frontend, the entire list gets deleted. The correct list appears only when the page is refreshed.
I would like to use same fh_args for different dataframe
For compatibility of the template with other dataframe, I would like to give fh_args based on column index.
Like mentioned below
'_sort': [df.columns[1]] }
NLG is not working. I am not getting the expected result even I replicate the same example I am getting James in result when I changed sort by rating for desc it is giving me the same result.
[Python Script]
while creating a new installation in a new environment, installation through pip doesn't work.
Reproduce:
pip install nlg
ERROR: Could not find a version that satisfies the requirement nlg (from versions: none)
ERROR: No matching distribution found for nlg
The nlg.narrative.Nugget
class does not support the serialization required for the nlg.webapp.process_text
function.
The deserialization also needs to support creation of nugget objects from nugget configs stored on disk.
Summary
Templatize output provides a dictionary of tokens with the corresponding list of templates. We expect a full template as the output compatible with tornado
Steps to reproduce
run a templatize function with valid text, fh_args, and data frame
What is the current bug behavior?
Currently, intermediate output is provided which needs post-processing
What is the expected correct behavior?
A tornado template as an output
Possible fixes
A post-processing functionality encapsulated inside the templatize function.
/label ~bug
Currently, an inflection is defined as one of the five following modifications to a word:
The detection mechanism works by comparing the lemmas of two words and trying to find which of the above are applied to the first argument to convert it into the second argument. This only detects one inflection at a time, whereas there may be more. For example:
# Say `df` is the actors dataset
# (https://github.com/gramener/gramex-nlg/blob/dev/nlg/tests/data/actors.csv)
from nlg.utils import load_spacy_model
from nlg.grammar import _token_inflections
nlp = load_spacy_model()
text = "James Stewart is the highest rated actor."
doc = nlp(text)
X = nlp('Actors')[0] # This is df['category'].iloc[0]
Y = doc[-2]
print(_token_inflections(X, Y))
# <function singular>
whereas it should indicate the fact that the inflection is both a singularization and lowercasing.
Also, the API is inconsistent. The first three modifications are Python string methods, but the other two are NLG functions. So the detector may return either a callable or a string representing a string method 🤦♂️
ToDo
Nuggets should have a .clone()
method, which copies a nugget but assigns a different set of FormHandler arguments. This would help auto-generate new nuggets.
E.g, in the context of the actors dataset
fh_args = {'_sort': ['-rating']}
doc = nlp('James Stewart is the actor who has the highest rating.')
nugget = nlg.templatize(doc, fh_args, df)
new_nugget = nugget.clone({'_sort': ['rating']}) # note the change in sort order
new_nugget.render(df)
# Katharine Hepburn is the actress who has the highest rating.
CAUTION: This will only work correctly if all tokens have been properly templatized. Note that the token "highest"
, left un-templatized, stays as it is even when the order of sorting changes. In this case, the token will have to be made sensitive to the sorting order (see #38) . But generally, any token left un-templatized may get affected.
Use the text generation framework:
chart data structure -> analysis metric -> calibration -> entities -> template
to generate captions for the following types of charts:
In the variable settings modal, find a way to support simple if-else logic. Consider the following sentences:
A randomly generated number is positive.
A randomly generated number is negative.
This should be templatized as:
{% import random %}
{% set n = random.choice([-1, 1]) * random.random() %}
A randomly generated number is {% if n > 0 %}positive{% else %}negative{% end %}.
In context of the actors dataset,
doc = nlp('James Stewart has the highest rating.')
fh_args = {'_sort': ['-rating']}
nugget = templatize(doc, fh_args, df)
print(nugget)
{% set fh_args = {"_sort": ["-rating"]} %}
{% set df = U.gfilter(orgdf, fh_args.copy()) %}
{% set fh_args = U.sanitize_fh_args(fh_args, orgdf) %}
{# Do not edit above this line. #}
{{ df["name"].iloc[0] }} is the {{ G.singular(df["category"].iloc[-2]).lower() }} with the highest rating.
The adjective highest
should be templatized in the following manner:
Feedback from @mankoven
From the README example,
the auto-gen template turns out to be:
{% set fh_args = {"_by": ["species"], "_c": ["sepal_width|avg"], "_sort": ["sepal_width|avg"]} %}
{% set df = U.gfilter(orgdf, fh_args.copy()) %}
{% set fh_args = U.sanitize_fh_args(fh_args, orgdf) %}
{# Do not edit above this line. #}
The {{ df["{{ fh_args['_by'][0] }}"].iloc[0] }} {{ fh_args['_by'][0] }} has the least average sepal_width.
The "virginica"
token turns out to be a nested template - the correct value is {{ df["species"].iloc[0] }}
. But "species"
itself is another variable (right the next word) with value {{ fh_args['_by'][0] }}
, and therefore gets re-templatized. This is because templates are added in the source text with str.replace
here.
They should instead be replaced by changing spacy tokens and forming new spacy docs.
ValueError: spacy.strings.StringStore size changed, may indicate binary incompatibility. Expected 112 from C header, got 88 from PyObject
Above error is thrown while importing spacy Matcher
from spacy.matcher import PhraseMatcher
Spacy version: spacy=2.2
Platform: Macbook
Resolution:
Downgrade to spacy 2.1
pip install spacy==2.1
While creating nuggets using IMDB rating data, none of the nugget text is being autodetected.
Add a verbosity parameter to nlg.search.templatize
. It could have the following levels:
0: Ignore all warnings
1 (default): raise user warnings
2: log all NEs and noun-chunk matches.
3. log inflections
For every level > 1, whatever is logged should be in addition to the previous level. i.e. at level 3, NEs and NP chunks are also logged in addition to inflections.
We need to have a way to reorder templates in a Narrative. One approach is to provide UP / DOWN arrow next to each template. This will allow reordering templates.
In the context of the actors dataset, if any name is passed (e.g.: "Humphrey Bogart"
) to the templatize
function:
~/src/nlg/nlg/search.py in templatize(text, args, df)
510 {{ df["species"].iloc[1] }} and {{ df["species"].iloc[-1] }}.
511 """
--> 512 dfix, clean_text, infl = _search(text, args, df)
513 return narrative.Nugget(clean_text, dfix, infl, args)
514
~/src/nlg/nlg/search.py in _search(text, args, df, copy)
441 dfix.update(search_args(dfs.ents, args))
442 dfix.clean()
--> 443 inflections = grammar.find_inflections(dfix, args, df)
444 _infl = {}
445 for token, funcs in inflections.items():
~/src/nlg/nlg/grammar.py in find_inflections(search, fh_args, df)
256 rendered = Template('{{{{ {} }}}}'.format(tmpl)).generate(
257 df=df, fh_args=fh_args).decode('utf8')
--> 258 if rendered != token.text:
259 x = nlp(rendered)[0]
260 infl = _token_inflections(x, token)
AttributeError: 'tuple' object has no attribute 'text'
Some column names, even if they are nouns, do not get picked up in input sentences. Run a reverse search of input spacy documents, once the default dataframe search is complete.
Too many functions deal partially with strings and partially with spaCy docs. Strings should only be the entry point for the public API - rest everything should work almost entirely with spacy.{docs, spans, tokens}
Datasets may have hyphenated column names or names separated by underscores, like the iris dataset.
Initialize the spaCy tokenizer with modified infix rules to handle this.
nlg.utils.ner
wrongly clubs named entities and nouns together. This is a problem because NEs undergo a literal search, whereas nouns should undergo a lemmatized search.
E.g. "Inception is a good movie."
Here the word "movie"
, if it appears in an inflected form in the dataframe, will not be correctly templatized because NER clubs it with NEs like "Inception"
.
We would like to have a functionality where we can identify derived quants from the dataframe.
For example, the length of the dataframe.
I created a few nuggets , defined a dictionary for it and trying to dump it using pickle but it is giving error.
nugget_map={}
nugget_map['nugget1']=nugget1
nugget_map['nugget2']=nugget2
with open('nugget_map.pickle', 'wb') as handle:
pickle.dump(nugget_map, handle, protocol=pickle.HIGHEST_PROTOCOL)
PicklingError Traceback (most recent call last)
in
1 data_train=pd.read_csv('data_new.csv')
----> 2 nugget_map=train_data(data_train)
in train_data(train)
85 #nugget_map = pd.DataFrame.from_dict(nugget_map)
86 with open('nugget_map.pickle', 'wb') as handle:
---> 87 pickle.dump(nugget_map, handle, protocol=pickle.HIGHEST_PROTOCOL)
88 return nugget_map
PicklingError: Can't pickle <function at 0x7f30354f98c8>: attribute lookup on nlg.narrative failed
Can you please check this, or can you please share some workaround how can I dump the nugget map.
Do not tokenize / process / search formhandler arguments as they are. Leading hyphens cause the tokens to get corrupted. For example:
>>> df = pd.read_csv('actors.csv')
>>> fh_args = {'_sort': ['-rating']}
>>> text = nlp('James Stewart is the actor with the highest rating.')
>>> n = templatize(text, fh_args, df)
>>> n.render(df)
b'James Stewart is the actor with the highest -rating.'
Instead, preprocess FormHandler arguments with gramex.data._filter_{sort, select, groupby}_columns
to get a cleaner representation which leaves the token untouched. Use this representation in the template.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.