vanna-ai / vanna Goto Github PK

View Code? Open in Web Editor NEW

6.8K 45.0 454.0 7.75 MB

🤖 Chat with your SQL database 📊. Accurate Text-to-SQL Generation via LLMs using RAG 🔄.

Home Page: https://vanna.ai/docs/

License: MIT License

Python 100.00%

agent ai data-visualization database llm sql text-to-sql rag

vanna's People

Contributors

Stargazers

Watchers

Forkers

anassfarah damonclifford ai-app jorgetolentinog tamiral kaminczak f901107 huangtianan hivewang ang88myt amadeus75 woonhock lautaromoreira hugoarielmartinez danquin 671335366 dorucioclea grail anietieakpan farkas1companion aparnakesarkar msamylea codehiro0517 0xcha05 williamvinc trasgum anujsrc nouma-mdw aesthethic0de feisuo huangyingting njirubryance linsnowx lkccy papiguy wemysschen bmedi hassan-elseoudy alancherosr ramnathv avimuk ehutapea s1x-data-team harrison001 anggadaz mnasruul andreped zonalds ayaster perstepheny chinnaiyanvignesh bjungwirth techthiyanes henninggc pradep2023 hasuk1 dsecret sunholo-data kotthoff kenny-ngo hbcbh1999 kzsh antonioevans m8e gmh5225 acumenix intelliquery uziiwork2020 wuchirat veryvanya rippergs opensesamedoors linkinng sekmet leosapucaia aordvark wovika ssingh13-rms siddharth1988 maheskrishnan cyb3rhex alanhu1024 photoup-godwinh priyankt68 tonkworks ssdatalog jussker kurhula mivanovitch airt-ai mz0in dorbodwolf irilias dionatann huykgit98 weiplanet canslove firmai-research tonywhite11 claudey

vanna's Issues

Plotly chart arising from `vn.ask()`

The chart generated by vn.ask() cannot be replotted if it is not to the user's liking. But they should be able to. Otherwise, users might have to reshape the resultant df, and do the plotting themselves. The value presented by Vanna might be diminished somewhat in this instance

Flow diagrams for SQL

Can we create mermaid charts that do flow diagrams for how a SQL statement gets executed and the different entities involved?

Add generate_followup_questions to vn.ask

vn.get_training_data

vn.get_training_data() should return

id
type (question-sql, ddl, documentation)
data

Confidence score for generated SQL

Is there a way to get a confidence score in terms of how likely the SQL is to be correct? Or whether there are enough similar queries / etc to give Vanna enough context to generate the SQL?

Maybe this could be calculated via embedding distances?

Rename dataset to model

Model is a lot easier for people to understand.

create_model
train
get_training_data
delete_model
update_model_visibility
etc

print key dataset stats

Can we print out some key stats for a dataset - like name, description, training questions, asked questions, successful questions, visibility, users, admins, firstquestiondate, lastquestiondate, etc

Sample of what `vn.ask()` returns

Not an issue, but a reminder that sample notebooks for vn.ask() need to show that four objects are returned with vn.ask

document permissioning for functions

we should include who has permission to do this action (eg admin or anyone), for the dataset functions / write functions especially.

Use the ddl parameter here instead of the sql parameter

https://github.com/vanna-ai/vanna-py/blob/feb8443194e433654888d9a6995eeb55a1aaabc1/src/vanna/__init__.py#L790C28-L790C28

Allow chart customization in vn.generate_plotly_code

Add a parameter to vn.generate_plotly_code so that the user can specify that they want a line chart vs a bar chart etc

Write a CONTRIBUTING.md

We need documentation for contributors on how to develop/test/etc

Update readme: Add directions to run tests with tox

Use tox to run tests in github tests workflow

Add lint step into the CI

Please use flake8 for lint purposes.

Should vn.train run the SQL to validate it?

Removing datasets

Could we have a delete_dataset function?

Better display of the dataframe in vn.ask() for notebooks

Right now vn.ask() prints the table in markdown format when you run it in a notebook. We should get it to display using the native display widget.

Raise an error when vn.set_dataset is not passed a string

Use Vanna with CSV files

Should we make a vn.use_df function that loads data into sqlite and connects to it so that you can run Vanna on dataframes that you might have brought in via CSV or some other method?

Add test suite to run tests.

Documentation examples should be real use cases from demo-tpc-h

More informative results after running vn.train()

Running vn.train() returns True regardless of whether the SQL that is trained is correct or not. It also does not show whether the question that is being trained already exists. If it does, then what does vn.train() do? This raises the following issues:

Returning True is not informative. It gives the impression that the SQL trained is correct but it might not be. I was able to train on erroneous SQL queries and it returned True as well.
In what scenarios will False be returned?
Are existing questions and their SQL code overwritten when vn.train() is run with an existing question?

generate_questions returning valid questions when used inside the vn.ask(), but generates random stuff when use independantly

automatically get historical queries from dw that support

Can Vanna automatically get the last X historical SQL queries from dw that support this functionality, like Snowflake and BQ, if provided the connection?

vn.generate_questions should reference DDL, documentation, and sql

Right now vn.generate_questions only references DDL

psycopg2 doesn't work on mac

In order to get the postgres connector to work on a mac, I had to do:
pip install psycopg2-binary

I think we should consider pg8000 to avoid compatibility issues:

https://wiki.postgresql.org/wiki/Python

User should not have to define vn.run_sql()

Instead of having the user separately define a vn.run_sql() function, could it be integrated within vn.ask() or perhaps a vn.select_db()

bootstrap - automated one line training + results

Can we implement a bootstrapping one line "agent"? For example -

conn = snowflake connection
vn.set_dataset('dataset')
vn.bootstrap()

where bootstrap does the following -

gets DDL and stores
gets historical queries and stores along w generated questions
generates 10 qs
generates sql for those 10 questions
runs sql, prints results, charts etc.

Typo on Code Reference site

On the Code Reference site, it should be vn.get_models() and vn.get_model() as shown in the screengrab

add user to database returns False

Can't add user to a dataset, returns an error you can only add a user to your own organisation. This is for an organisation that i just created and must have the ownership rights of it.

Make vn.train generic

vn.train should take in

question: str or None
sql: str or None
ddl: str or None
documentation: str or None
json_file: str or None
sql_file: str or None

If just question, throw an error and print out example usage
If just sql, do vn.generate_question to generate the question and then vn.add_sql
If just DDL, do vn.add_ddl
If just documentation, do vn.store_documentation
If just a json_file, read the json file using pd.read_json and then iterate through rows to get question and sql columns to do vn.add_sql
If just a sql_file, use sqlparse to separate the sql statements. Anything that's a create table should go into vn.add_DDL and other statements should do vn.generate_question and then vn.add_sql

All parameters defaults should be None and if the user passes in any combination of invalid parameters it should raise an exception

Don't return True/False

Instead of returning True/False, output nothing when the status.success is true otherwise throw an exception with the status.message

Training multiple queries at once (bulk training)

The ability to send in

a JSON of question / SQL pairs, or
a SQL file full of semicolon delimited SQL queries

and have Vanna automatically train against a dataset. For 2, would need to auto generate the questions as well

Sweep: fix typo get_model to get_models in init.py

Excessive print outs from `vn.ask()`

Can we omit the print outs that come after the chart in vn.ask()? Unnecessary, repetitive, and result in a lot of scrolling

vn.generate_meta_description

This function will take in a question and use training data as a reference to answer questions about the data instead of returning SQL

Restrict the characters that can be in a dataset name

In order to avoid confusion and also to make the dataset name url safe, on input of the dataset name we should:

make it lowercase
replace spaces with a hyphen -
replace special characters with hyphen or remove it altogether

There should be a deterministic mapping of the input dataset string to the actual dataset name so that users can do vn.set_dataset('my WEirD dataset name!') and it will still work

Make a vn.connect_to_postgres function

Vanna generated documentation

Can we have Vanna generate documentation for tables and columns automatically? For example ..

vn.generate_docs(entity='table', name='<tablename>') would generate a docstring for a particular table, and
vn.generate_docs(entity='column', name='<columnname>') would generate a docstring for a particular column

and perhaps there could be a flag on the table call to also generate docs for cols within that table automatically?

Make a vn.connect_to_bigquery function

Add GH action for NB example runs

Add GH action that:

Runs on every push to the PR
Runs a notebook: https://github.com/marketplace/actions/run-notebook
Convert the notebook into docs using nbconvert

ENV variables required, should be fed from GH secrets to the action's context (find their values in Slack).

VANNA_API_KEY=xxx
VANNA_MODEL=xxx
SNOWFLAKE_ACCOUNT=xxx
SNOWFLAKE_USERNAME=xxx
SNOWFLAKE_PASSWORD=xxx
SNOWFLAKE_DATABASE=xxx