taoyds / spider Goto Github PK
View Code? Open in Web Editor NEWscripts and baselines for Spider: Yale complex and cross-domain semantic parsing and text-to-SQL challenge
Home Page: https://yale-lily.github.io/spider
scripts and baselines for Spider: Yale complex and cross-domain semantic parsing and text-to-SQL challenge
Home Page: https://yale-lily.github.io/spider
The following prediction is labeled as wrong:
predict: select Citizenship from singer where Birth_Year > 'terminal' intersect select Citizenship from singer where Birth_Year < 'terminal'
gold standard: SELECT Citizenship FROM singer WHERE Birth_Year < 1945 INTERSECT SELECT Citizenship FROM singer WHERE Birth_Year > 1955
But isn't it correct? I used evaluation.py and got this result. Could you double check evaluation.py? Thanks for your time!
Hi Tao,
Could you tell me when you plan to open the submission process?
Thanks a lot!
Zecheng
Steps to find wrong .sql file:
database/wta_1
wta_1.sql
CREATE TABLE players(
CRloser_rank_pointsEATE TABLE players(
This is the origin of #52.
wta_1.sqlite
file should be also updated.
I've just ran:
../evaluation.py --gold dev_gold.sql --pred dev_gold.sql --db database --table tables.json --etype all
Expected output: everything has 1.000
Actual output:
easy medium hard extra all
count 248 446 174 166 1034
===================== EXECUTION ACCURACY =====================
execution 1.000 0.996 1.000 1.000 0.998
====================== EXACT MATCHING ACCURACY =====================
exact match 1.000 1.000 1.000 1.000 1.000
---------------------PARTIAL MATCHING ACCURACY----------------------
select 1.000 1.000 1.000 1.000 1.000
select(no AGG) 1.000 1.000 1.000 1.000 1.000
where 1.000 1.000 1.000 1.000 1.000
where(no OP) 1.000 1.000 1.000 1.000 1.000
group(no Having) 1.000 1.000 1.000 1.000 1.000
group 1.000 1.000 1.000 1.000 1.000
order 1.000 1.000 1.000 1.000 1.000
and/or 1.000 1.000 1.000 1.000 1.000
IUEN 0.000 0.000 1.000 1.000 1.000
keywords 1.000 1.000 1.000 1.000 1.000
---------------------- PARTIAL MATCHING RECALL ----------------------
select 1.000 1.000 1.000 1.000 1.000
select(no AGG) 1.000 1.000 1.000 1.000 1.000
where 1.000 1.000 1.000 1.000 1.000
where(no OP) 1.000 1.000 1.000 1.000 1.000
group(no Having) 1.000 1.000 1.000 1.000 1.000
group 1.000 1.000 1.000 1.000 1.000
order 1.000 1.000 1.000 1.000 1.000
and/or 1.000 1.000 1.000 1.000 1.000
IUEN 0.000 0.000 1.000 1.000 1.000
keywords 1.000 1.000 1.000 1.000 1.000
---------------------- PARTIAL MATCHING F1 --------------------------
select 1.000 1.000 1.000 1.000 1.000
select(no AGG) 1.000 1.000 1.000 1.000 1.000
where 1.000 1.000 1.000 1.000 1.000
where(no OP) 1.000 1.000 1.000 1.000 1.000
group(no Having) 1.000 1.000 1.000 1.000 1.000
group 1.000 1.000 1.000 1.000 1.000
order 1.000 1.000 1.000 1.000 1.000
and/or 1.000 1.000 1.000 1.000 1.000
IUEN 1.000 1.000 1.000 1.000 1.000
keywords 1.000 1.000 1.000 1.000 1.000
I am ok with 0.000 but is not okay with Execution accuracy of medium section having 0.996
Both are implemented as:
if count == total:
return 1
return 0
Hi Tao,
Confuse about some group by label, show as below
1.Show the names of singers that have more than one song.
SELECT T1.Name FROM singer AS T1 JOIN song AS T2 ON T1.Singer_ID = T2.Singer_ID GROUP BY T1.Name HAVING COUNT(*) > 1
2.What are the names of the singers that have more than one songs?
SELECT T1.Name FROM singer AS T1 JOIN song AS T2 ON T1.Singer_ID = T2.Singer_ID GROUP BY T1.Name HAVING COUNT(*) > 1
3.find the name of employee who was awarded the most times in the evaluation.
SELECT t1.name FROM employee AS t1 JOIN evaluation AS t2 ON t1.Employee_ID = t2.Employee_ID GROUP BY t2.Employee_ID ORDER BY count(*) DESC LIMIT 1
4.Which employee received the most awards in evaluations? Give me the employee name.
SELECT t1.name FROM employee AS t1 JOIN evaluation AS t2 ON t1.Employee_ID = t2.Employee_ID GROUP BY t2.Employee_ID ORDER BY count(*) DESC LIMIT 1
Could you explain why 1 and 2 are group by name but 3 and 4 and group by primary key?
Thanks a lot,
Zecheng
I was trying to retrain the sqlnet model on the train_spider.json . After it loads all the data and glove.txt it give error in line number 135 of utils.py error is "IndexError: too many indices for array"
print ' Loss = %s'%epoch_train(model, optimizer, BATCH_SIZE, sql_data, table_data, schemas, TRAIN_ENTRY)
File "/content/drive/My Drive/textsql/sqlnet/scripts/utils.py", line 135, in epoch_train
cum_loss += loss.data.cpu().numpy()[0]*(ed - st)
IndexError: too many indices for array
For example, How many actors have appeared in each musical? why Group by "T1.Musical_ID" instead of "T2.Name".
SELECT T2.Name , COUNT(*) FROM actor AS T1 JOIN musical AS T2 ON T1.Musical_ID = T2.Musical_ID GROUP BY T1.Musical_ID
Hi Tao,
In the released dataset, queries with more than two "table_units" are not parsed correctly.
For example:
In dev.json line 7431.
"query": "SELECT count(*) FROM student AS T1 JOIN has_pet AS T2 ON T1.stuid = T2.stuid JOIN pets AS T3 ON T2.petid = T3.petid WHERE T1.sex = 'F' AND T3.pettype = 'dog'"
The 'from' part of the above query is parsed to:
"from": {"conds": [ [ false, 2, [ 0, [ 0, 1, false ], null ], [ 0, 9, false ], null ] ], "table_units": [ [ "table_unit", 0 ], [ "table_unit", 1 ] ] }
Meanwhile, the script named parse_sql_one.py would give the correct parse:
'from': {'table_units': [('table_unit', 0), ('table_unit', 1), ('table_unit', 2)], 'conds': [(False, 2, (0, (0, 1, False), None), (0, 9, False), None), 'and', (False, 2, (0, (0, 10, False), None), (0, 11, False), None)]}
I wonder if you have done the baseline experiments using the released dataset.
Thanks,
Yibo
Hi Tao,
I find some strange SQL label about Group By. Show as below.
show all document id , name and the number of paragraph in each document.
SELECT T1.document_id , T2.document_name , count(*) FROM Paragraphs AS T1 JOIN Documents AS T2 ON T1.document_id = T2.document_id GROUP BY T1.document_id
According to the SQL grammar, It should be labeled as:
SELECT T1.document_id , T2.document_name , count(*) FROM Paragraphs AS T1 JOIN Documents AS T2 ON T1.document_id = T2.document_id GROUP BY T1.document_id, T2.document_name
Thanks,
Zecheng
Respected Tao,
I am new to opening an issue on Github as well as in conversion from text to SQL.
I have a few questions regarding the code in Spider. I read the technical paper mentioned in "https://github.com/taoyds/spider" and understood that Spider is basically a dataset that can be used to train a model using Deep Neural Network.
Suppose my domain is completely different, can i use this code "https://github.com/taoyds/spider"?
I tried running the evaluation.py code, but I am getting AssertionError: Unknown evaluation method in the line- assert etype in ["all", "exec", "match"], "Unknown evaluation method"
In train_spider.json, there is one example with a question What is the description of the type of the company who concluded its contracts most recently?
and a query SELECT T1.company_name FROM Third_Party_Companies AS T1 JOIN Maintenance_Contracts AS T2 ON T1.company_id = T2.maintenance_contract_company_id JOIN Ref_Company_Types AS T3 ON T1.company_type_code = T3.company_type_code ORDER BY T2.contract_end_date DESC LIMIT 1
. But the column name 'company_type_code' not in table 'assets_maintenance'. Please check if there is something wrong in this example. I think the column name should be company_type
.
baselines/sqlnet/
python train.py --dataset data/
Hello, it seems that a number of the sqlite3 files provided are completely unpopulated. Is this by design? It doesn't seem to be based on the train/dev/test split.
Example empty datasets:
geo
car_1
wta_1
Hi Tao,
predict:
SELECT T2.name , count(T2.stadium_id) FROM concert AS T1 JOIN stadium AS T2 ON T1.stadium_id = T2.stadium_id GROUP BY T2.name
gold:
SELECT T2.name , count(*) FROM concert AS T1 JOIN stadium AS T2 ON T1.stadium_id = T2.stadium_id GROUP BY T1.stadium_id concert_singer
1. exact match will report wrong because: (1) different GROUP BY; (2) count(*) and count(T2.stadium_id) are different.
This may not be a bug because not exactly same.
2. eval_exec_match() will return false, because: count(T2.stadium_id) and count(*) are different. But in fact, they are the same result.
3. If I modify the count(T2.stadium_id) to count(*) in predict, eval_exec_match() still return false. Because the order of data returned by these two SQL is different. However, they are also the same result.
I think this is a bug.
One more thing:
predict:
SELECT T2.name , avg(T2.stadium_id), count(T2.stadium_id) FROM concert AS T1 JOIN stadium AS T2 ON T1.stadium_id = T2.stadium_id GROUP BY T2.name
gold:
SELECT T2.name , max(T2.stadium_id), count(T2.stadium_id) FROM concert AS T1 JOIN stadium AS T2 ON T1.stadium_id = T2.stadium_id GROUP BY T1.stadium_id concert_singer
This pair of SQL will get True from eval_exec_match(). But actually they fetch different data. This problem may happen when two identical columns appear in select.
Hi Tao,
I find a incorrect parse example as below. The limit and orderBy are all None.
SQL : {u'orderBy': [], u'from': {u'table_units': [[u'table_unit', 1], [u'table_unit', 2]], u'conds': [[False, 2, [0, [0, 5, False], None], [0, 13, False], None]]}, u'union': None, u'except': None, u'having': [[False, 4, [0, [0, 16, False], None], 2010.0, None]], u'limit': None, u'intersect': None, u'where': [], u'groupBy': [[0, 13, False]], u'select': [False, [[0, [0, [0, 8, False], None]]]]}
query : SELECT T1.Area FROM APPELLATIONS AS T1 JOIN WINE AS T2 ON T1.Appelation = T2.Appelation GROUP BY T2.Appelation HAVING T2.year < 2010 ORDER BY count(*) DESC LIMIT 1
Thanks,
Zecheng
Hi Tao,
I evaluated the first example in gold_example.txt and pred_example.txt.
I want to know why the exact match result comes out to be 1.
The examples are:
gold: SELECT count() FROM singer|concert_singer
pred: select count() from stadium
The command I used is:
python evaluation.py --gold ./evaluation_examples/gold_small.txt --pred ./evaluation_examples/pred_small.txt --etype match --db ./database/ --table tables.json
Would you please give an explanation about this?
Best,
Bo Zheng
Hi,
Again, thanks for your great job in building such a dataset for nl2sql.
During evaluation, I found an inconsistent evaluations on FROM clause.
1. SELECT T2.company_name FROM movie AS T1 JOIN culture_company AS T2 ON T1.movie_id = T2.movie_id WHERE T1.year = 1999
2. SELECT T1.company_name FROM culture_company AS T1 JOIN movie AS T2 ON T1.movie_id = T2.movie_id WHERE T2.year = 1999
3. SELECT T2.company_name FROM movie AS T1 JOIN culture_company AS T2 WHERE T1.year = 1999
4. SELECT DISTINCT T2.company_name FROM movie AS T1 JOIN culture_company AS T2 WHERE T1.year = 1999
The four sqls above are equivalent based on published evalution.py
.
However, when it comes to FROM Clause in subquery (query in WHERE Clause), inconsistency occurs.
1. SELECT mID , avg(stars) FROM Rating WHERE mID NOT IN (SELECT T1.mID FROM Rating AS T1 JOIN Reviewer AS T2 ON T1.rID = T2.rID WHERE T2.name = "Brittany Harris") GROUP BY mID
2. SELECT mID , avg(stars) FROM Rating WHERE mID NOT IN (SELECT T2.mID FROM Reviewer AS T1 JOIN Rating AS T2 ON T1.rID = T2.rID WHERE T1.name = "Brittany Harris") GROUP BY mID
3. SELECT mID , avg(stars) FROM Rating WHERE mID NOT IN (SELECT T1.mID FROM Rating AS T1 JOIN Reviewer AS T2 WHERE T2.name = "Brittany Harris") GROUP BY mID
4. SELECT mID , avg(stars) FROM Rating WHERE mID NOT IN (SELECT DISTINCT T1.mID FROM Rating AS T1 JOIN Reviewer AS T2 ON T1.rID = T2.rID WHERE T2.name = "Brittany Harris") GROUP BY mID
These four sqls are reported as inequivalence.
I am quite confused about this inconsistency. Is there any concern about the FROM Clause in subquery ?
Followings are details for inconsistency reproduction.
db_id: culture_company
Question: What are all company names that have a corresponding movie directed in the year 1999?
Query: SELECT T1.Company_name FROM culture_company AS T1 JOIN movie AS T2 WHERE T2.Year = 1
db_id: movie_1
Question: Find the average rating star for each movie that are not reviewed by Brittany Harris.
Query: SELECT mID , avg(stars) FROM Rating WHERE mID NOT IN (SELECT T1.mID FROM Rating AS T1 JOIN Reviewer AS T2 ON T1.rID = T2.rID WHERE T2.name = "Brittany Harris") GROUP BY mID
Best,
Jasper
I want to utilize table content (i.e., cells of one column) in my method, how can I get it?
Thank you.
Hi, Matthias,
Did you use Cosette before? What evaluation accuracy could this tool get? It's possible to replace execution accuracy with its results if it is reliable.
Best,
Tao
==== previous issue response ====
We write a script to parse SQL queries into different smaller clauses/components and then evaluate each clause (by set, the order doesn't matter) separately. SQL queries are equivalent if their all clauses are the same. You can find more details on this page: https://github.com/taoyds/spider/tree/master/evaluation_examples .
The tool mentioned by you looks pretty cool. Unfortunately, we didn't know it when we were working on this work. We definitely would like to check it out in the future release. Also, it would be great if anyone could contribute. However, our evaluation script works well on the current Spider dataset.
Thanks!
Originally posted by @taoyds in #1 (comment)
We are afraid that the hard SQL like TABLE JOIN is the limit for industrial application.
Thank you very much.
Can we test our model in the example split experiment?
Dear Tao Yu,
Thanks for the amazing dataset!
May I know do you have data on how many queries in Spider have "LIKE" operator? and how many for "Partition"? (For functions like rank, row_number, etc.)
Best Regards,
Hans
The datasets is that the table corresponding to question is given.
But in real industrial application, we have 100+ tables for 1 new question.
Thank you!
Hi
In the evaluation.py
in line 495:
scores[level]['exec'] = 0
should be:
scores[level]['exec'] = 0.
and in line 549:
should add a line:
scores["all"]['exec'] += 1
These bugs gives wrong and weird results when evaluate the exec accuracy. Could you fix them? Thanks a lot!
Also, I noticed that even I copy exactly the gold query, the exec accuracy is not 100. Could you check and find the errors??
easy medium hard extra all
count 250 440 174 170 1034
===================== EXECUTION ACCURACY =====================
execution 1.000 0.995 1.000 1.000 0.998
====================== EXACT MATCHING ACCURACY =====================
exact match 1.000 1.000 1.000 1.000 1.000
Dear Tao Yu,
Thanks for your data!
Well, since you provide the script to process the sql query string into a formatted json object, I wonder if you also have a standardized script to reverse that process?
That is, given a parsed sql json file like follows, do you have the script to convert it back into a sql query string "SELECT avg(LifeExpectancy) FROM country WHERE Name NOT IN (SELECT T1.Name FROM country AS T1 JOIN countrylanguage AS T2 ON T1.Code = T2.CountryCode WHERE T2.Language = \"English\" AND T2.IsOfficial = \"T\")
"?
"sql": { "except": null, "from": { "conds": [], "table_units": [ ... }, "groupBy": [], "having": [], "intersect": null, "limit": null, "orderBy": [], "select": [ ... ], ... }
Or, do you also have a evaluation script that compare two sqls based on their parsed json files?
Thanks in advance.
Best Regards,
Yi Zhao
May I ask when will group the result by primary key, and when will group the result by the selected column name? From the training data, it seems hard to find pattern. That's maybe the reason why Group By clause (no having) has a low accuracy
Hello! When I tried to run evaluation.py, it terminated half-way, throwing the following error:
Traceback (most recent call last):
File "evaluation.py", line 866, in
evaluate(gold, pred, db_dir, etype, kmaps)
File "evaluation.py", line 546, in evaluate
exec_score = eval_exec_match(db, p_str, g_str, p_sql, g_sql)
File "evaluation.py", line 626, in eval_exec_match
q_res = cursor.fetchall()
sqlite3.OperationalError: Could not decode to UTF-8 column 'last_name' with text 'Treyes Albarrac?N'
What could be the problem? Thank you in advance!
Related info: python==2.7.17 (installed via pyenv)
Hi,
this corpus looks great and I hope it will encourage people to do interesting research!
I'm wondering if you considered an evaluation based on proving equivalence of SQL queries, which can be done automatically by Cosette?
Best,
Matthias
Hi Tao,
I found that some missing foreign key relations in schema.json, For example, in this SQL, city Join with countrylanguage, but there is no foreign keys between city.CountryCode and countryLanguage.CountryCode in schema.json. Please help double check, thanks.
SELECT T1.Name , T1.Population FROM city AS T1 JOIN countrylanguage AS T2 ON T1.CountryCode = T2.CountryCode WHERE T2.Language = "English" ORDER BY T1.Population DESC LIMIT 1
For many of the apartment_rentals
SQL queries, the queries follow the pattern "What is the booking start and end date...", but the SQL queries have two SELECT booking_start_date
projections instead of a SELECT booking_start_date, booking_end_date
.
This competition benchmark is rapidly improving, to the point of being useful to many underserved communities.
Am trying to make a python package with a web front end and be deployable by users to free versions of colab, azure notebook, amazon sagemaker or similar.
Would be grateful for ideas on how to encapsulate dependencies and cycle through models. (I managed to get one of the models working on google colab, but it was a traumatic experience.)
Thank you!
There are composite key sets in some tables according to the *.sql file. For example, in college_2 database, the 6th line in the file TextBookExample.sql says "primary key (building, room_number)", which indicate building and room_number are composite key of this table. However, in the tables.json, building is annotated as the only primary key of the table. I notice there are also some other databases where such issue exists.
I ran the preprocess_train_dev_data.py script as directed in the README, but got a training set that is incompatible with the training pipeline. It is missing gt_col fields for every item and has string labels instead of numbers. Is this a bug on my end or is the preprocessing script different than the one that produced the generated datasets?
How does one add your data augmentation queries to the pipeline? do you just create a training json file full of the fake queries and then feed it to the preprocessor and train along with the real queries? The paper is unclear about what configuration achieved your best result.
hi,
I want to train the model on spider database, but it's a little bit show, so want to train on GPU.
can anyone tell me, how can I use GPU instead of CPU for training typesql -> traing.py file
thank you.
Hi Tao,
I see there is a nl2code folder inside baselines folder.
Q1: What is the performance of this baseline?
Q2: Since I'm not familiar with js, I stuck with the preprocessing scripts(current on parser-ast.js) and can not run the code. Would you provide a runable version?
Thanks!
Yibo
In the leaderboard, SyntaxSQLNet's exact-matching dev and test accuracy is stated as 18.9 and 19.7, respectively. But in the paper, the “best” (boldfaced) dev and test accuracy is 27.2% and and 24.8%, respectively, and those 18.9% and 19.7% figures are from the ablation study where data augmentation was omitted in training.
This doesn't seem to be explained in the paper or on the website. Why is the accuracy achieved with data augmentation not used in the leaderboard? Should other models also refrain from this data augmentation technique in order to compare appropriately against SyntaxSQLNet?
Hi Tao,
The table schema (table.json) misses a lot of "foreign_keys" pairs. For example: table "paper" has primary key "paper_id", and table "paper key phrase" also has column "paper_id", but there is no "foreign_keys" relation between the two columns.
Similar cases: customer id:: [contacts, customer], product id::[store product, product]
During the evaluation, it will cause extra errors.
Thanks,
Yan
May I ask, where should I find the test.json file for baseline seq2seq_attention_copy? Or what content should be in the file test.json? I have been confused.
I came across a bug in the evaluation script which causes some query predictions with wrong JOIN conditions to be evaluated as correct.
The bug can be reproduced using the following example:
plist = [("SELECT shop.Name FROM hiring JOIN shop ON hiring.Employee_ID = hiring.Shop_ID GROUP BY hiring.Shop_ID ORDER BY COUNT(*) DESC LIMIT 1",
"employee_hire_evaluation")]
glist = [("SELECT shop.Name FROM shop JOIN hiring ON shop.Shop_ID = hiring.Shop_ID GROUP BY hiring.Shop_ID ORDER BY COUNT(*) DESC LIMIT 1",
"employee_hire_evaluation")]
Here the two SQL queries have different JOIN conditions, but the script will determine them as exact match.
easy medium hard extra all
count 0 0 0 1 1
====================== EXACT MATCHING ACCURACY =====================
exact match 0.000 0.000 0.000 1.000 1.000
---------------------PARTIAL MATCHING ACCURACY----------------------
select 0.000 0.000 0.000 1.000 1.000
select(no AGG) 0.000 0.000 0.000 1.000 1.000
where 0.000 0.000 0.000 0.000 0.000
where(no OP) 0.000 0.000 0.000 0.000 0.000
group(no Having) 0.000 0.000 0.000 1.000 1.000
group 0.000 0.000 0.000 1.000 1.000
order 0.000 0.000 0.000 1.000 1.000
and/or 0.000 0.000 0.000 1.000 1.000
IUEN 0.000 0.000 0.000 0.000 0.000
keywords 0.000 0.000 0.000 1.000 1.000
I dug into the evaluation code -- the SQL parser implemented in process_sql.py
will produce different parses for each SQL query. Hence the error may be in the evaluation logic.
Parse of from clause of the first query:
"from": {
"table_units": [
[
"table_unit",
"__hiring__"
],
[
"table_unit",
"__shop__"
]
],
"conds": [
[
false,
2,
[
0,
[
0,
"__hiring.employee_id__",
false
],
null
],
[
0,
"__hiring.shop_id__",
false
],
null
]
]
}
Parse of from clause of the second query:
"from": {
"table_units": [
[
"table_unit",
"__shop__"
],
[
"table_unit",
"__hiring__"
]
],
"conds": [
[
false,
2,
[
0,
[
0,
"__shop.shop_id__",
false
],
null
],
[
0,
"__hiring.shop_id__",
false
],
null
]
]
}
Hi,
Even though our group spent a lot of time and effort on creating the Spider dataset, there definitely exist some annotation errors. We would appreciate your input if you report your findings here. Our group will try our best to correct the errors in our next release.
Thanks for your interest!
Best,
Tao
Command:
../evaluation.py --gold train_gold.sql --pred train_gold.sql --db database --table tables.json --etype all
Output:
Traceback (most recent call last):
File "../evaluation.py", line 866, in <module>
evaluate(gold, pred, db_dir, etype, kmaps)
File "../evaluation.py", line 505, in evaluate
g_sql = get_sql(schema, g_str)
File "/home/nikolai.korolev/workspace/spider/process_sql.py", line 552, in get_sql
_, sql = parse_sql(tokens, 0, tables_with_alias, schema)
File "/home/nikolai.korolev/workspace/spider/process_sql.py", line 504, in parse_sql
from_end_idx, table_units, conds, default_tables = parse_from(tokens, start_idx, tables_with_alias, schema)
File "/home/nikolai.korolev/workspace/spider/process_sql.py", line 389, in parse_from
idx, table_unit, table_name = parse_table_unit(tokens, idx, tables_with_alias, schema)
File "/home/nikolai.korolev/workspace/spider/process_sql.py", line 259, in parse_table_unit
key = tables_with_alias[tokens[idx]]
KeyError: 'ref_company_types'
For this question below, why it is labeled as the first SQL, instead of the second one:
What are the names and ids of artists with 3 or more albums, listed in alphabetical order?
SELECT T2.Name , T1.ArtistId FROM ALBUM AS T1 JOIN ARTIST AS T2 ON T1.ArtistId = T2.ArtistID GROUP BY T1.ArtistId HAVING COUNT() >= 3 ORDER BY T2.Name
SELECT T2.Name , T2.ArtistId FROM ALBUM AS T1 JOIN ARTIST AS T2 ON T1.ArtistId = T2.ArtistID GROUP BY T1.ArtistId HAVING COUNT() >= 3 ORDER BY T2.Name
Thanks :)
Hi, Tao
These DBs (twitter_1, chinook_1, world_1, small_bank_1, flight_4, icfp_1, epinions_1, voter_1, company_1) miss "schema.sql" file, and these DBs (eta_1, flight_2, car_1, wine_1, inn_1, student_1, formula_1) with empty sql files.
Do all DBs have "schema.sql" files for test set?
Hi,
Thanks for your great job in building such a dataset for nl2sql.
I am now conducting a characteristics study of the sql on it.
I find that some parsed results of FROM Clause in SQL Query are confusing.
Here is an example.
The FROM Clause in the query contains two join
operations and three tables involved.
But in the parsed results of table_units
, only two of the tables are involved.
{
"db_id": "department_management",
"query": "SELECT DISTINCT T1.creation FROM department AS T1 JOIN management AS T2 ON T1.department_id = T2.department_id JOIN head AS T3 ON T2.head_id = T3.head_id WHERE T3.born_state = 'Alabama'",
"sql": {
"except": null,
"from": {
"conds": [
[
false,
2,
[
0,
[
0,
1,
false
],
null
],
[
0,
11,
false
],
null
]
],
"table_units": [
[
"table_unit",
0
],
[
"table_unit",
2
]
]
},
...
}
}
Based on my further inspection, it may be caused by the order of JOIN and ON clauses.
Here is a clear and correct example.
{
"db_id": "bike_1",
"query": "SELECT count(*) FROM station AS T1 JOIN trip AS T2 JOIN station AS T3 JOIN trip AS T4 ON T1.id = T2.start_station_id AND T2.id = T4.id AND T3.id = T4.end_station_id WHERE T1.city = \"Mountain View\" AND T3.city = \"Palo Alto\"",
"sql": {
"except": null,
"from": {
"conds": [
[
false,
2,
[
0,
[
0,
1,
false
],
null
],
[
0,
16,
false
],
null
],
"and",
[
false,
2,
[
0,
[
0,
12,
false
],
null
],
[
0,
12,
false
],
null
],
"and",
[
false,
2,
[
0,
[
0,
1,
false
],
null
],
[
0,
19,
false
],
null
]
],
"table_units": [
[
"table_unit",
0
],
[
"table_unit",
2
],
[
"table_unit",
0
],
[
"table_unit",
2
]
]
},
...
}
},
Please let me know if I misunderstand something of the parsed results.
Great Thanks!
Is there a function to convert the SQL parse trees in the data release back to a SQL query?
HI ,
Can you please share any tips to construct the sql and where clause of wikisql dataset queries to spider format?
Thanks
Anshu
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.