Comments (17)
Ad 1. ArrowStreamLoader which is a return value of QueryArrowStream has a function JSONData() [][]*string. Maybe you can decide on these values? If not, we can add additional function returning some info about types.
The problem there is that it doesn't correctly manage when the JSON data is chunked. It returns the JSON data from the initial response, and it's still difficult to tell when you need to call JSONData
vs using the Arrow interface since it is entirely dependent upon the query that was sent in which the caller doesn't know. This inconsistency where metadata queries can't return Arrow when requested is problematic because there's no way for callers like ADBC to know if the query supported Arrow or not.
Ad 2. Not yet, but we are planing to add this feature to remaining drivers probably later this year. Customers ask for it.
You will likely run into this problem again when you do that as customers are going to almost certainly dislike the inconsistency. Personally, the preferred solution here would be to add support to return Arrow for these metadata queries.
But for now, at minimum something should be added to ArrowStreamLoader
to explicitly be able to denote whether or not a consumer needs to use JSONData
or not and whether they should expect JSON data from the batches
from gosnowflake.
hi and thank you for submitting this issue, i'll take a look sometimes early next week. If the issue is very urgent or impactful, as suggested during the issue creation: please create a case with Snowflake Support instead
as a quick preliminary analysis:
-
i'm already quite certain we can rule this out, but to be absolutely sure the issue is not caused by a theoretical connectivity issue to the cloud storage from where the query results are coming, can you please run
SELECT SYSTEM$ALLOWLIST()
(orALLOWLIST_PRIVATELINK()
if that's your use-case), and make sure that all endpoints returned inSTAGE
category, are in fact reachable from the Windows host you have the problem?
Alternatively you can test this by running something likeSELECT * FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF100.CUSTOMER LIMIT 400000
withgosnowflake
and see if the result is successfully returned (= as arrow, that is, since this is just testing the connectivity and not arrow vs. json resultset and their size) -
if you have a runnable reproduction code which when executed exhibits the problem you're seeing, that would be truly great and appreciated and likely speed up the debug effort a lot
-
took a quick look on the linked ADBC issue and user zeroshade might be right here:
[..]snowflake doesn't return the results of SHOW TABLES using Arrow, it returns it as JSON that we have to parse and manipulate into Arrow. This is likely an issue that needs to be addressed on the server side of Snowflake
if the issue will be indeed proven to be at the server side, then it's independent from the gosnowflake
(or , any other) driver and it might take a longer period to fix.
But I don't want to go there yet, and instead understand the issue better when I have a chance early next week.
from gosnowflake.
did a test on Debian 11 to see how gosnowflake
handles the large amount of resultset for SHOW TABLES
because it was very weird that no one else reported such an issue (so far) for such a basic functionality.
If there's any issue in the JSON resultset parsing, it should be visible regardless of OS, but to be sure I'll try to test this on Windows too.
edit: same (below) script works the same on Windows 10 as well, using nothing but the vanilla gosnowflake
1.9.0
Reproduction:
- created a new account on the same deployment as your account in the example, and 3000 tables with 1
int
column - using following simple script with the vanilla
gosnowflake
1.9.0, without ADBC:
# cat gosnowflake1110.go
package main
import (
"database/sql"
"flag"
"fmt"
"log"
sf "github.com/snowflakedb/gosnowflake"
)
func main() {
if !flag.Parsed() {
flag.Parse()
}
cfg, err := sf.GetConfigFromEnv([]*sf.ConfigParam{
{Name: "Account", EnvName: "SNOWFLAKE_TEST_ACCOUNT", FailOnMissing: true},
{Name: "Role", EnvName: "SNOWFLAKE_TEST_ROLE", FailOnMissing: true},
{Name: "User", EnvName: "SNOWFLAKE_TEST_USER", FailOnMissing: true},
{Name: "Password", EnvName: "SNOWFLAKE_TEST_PASSWORD", FailOnMissing: true},
{Name: "Warehouse", EnvName: "SNOWFLAKE_TEST_WAREHOUSE", FailOnMissing: true},
{Name: "Database", EnvName: "SNOWFLAKE_TEST_DATABASE", FailOnMissing: true},
{Name: "Schema", EnvName: "SNOWFLAKE_TEST_SCHEMA", FailOnMissing: true},
})
if err != nil {
log.Fatalf("failed to create Config, err: %v", err)
}
//cfg.Tracing = "DEBUG"
dsn, err := sf.DSN(cfg)
if err != nil {
log.Fatalf("failed to create DSN from Config: %v, err: %v", cfg, err)
}
db, err := sql.Open("snowflake", dsn)
if err != nil {
log.Fatalf("failed to connect. %v, err: %v", dsn, err)
}
defer db.Close()
query := "SHOW TABLES LIMIT 2999;"
rows, err := db.Query(query) // no cancel is allowed
if err != nil {
log.Fatalf("failed to run a query. %v, err: %v", query, err)
}
defer rows.Close()
var v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15, v16, v17, v18, v19, v20, v21, v22, v23 sql.NullString
for rows.Next() {
err := rows.Scan(&v1, &v2, &v3, &v4, &v5, &v6, &v7, &v8, &v9, &v10, &v11, &v12, &v13, &v14, &v15, &v16, &v17, &v18, &v19, &v20, &v21, &v22, &v23)
if err != nil {
log.Fatalf("failed to get result. err: %v", err)
}
fmt.Println(v2)
}
if rows.Err() != nil {
fmt.Printf("ERROR: %v\n", rows.Err())
return
}
fmt.Printf("Congrats! You have successfully run %v with Snowflake DB!\n", query)
}
Result: expected tables are returned:
# go run gosnowflake1110.go
{TESTTABLE1 true}
{TESTTABLE10 true}
{TESTTABLE100 true}
{TESTTABLE1000 true}
{TESTTABLE1001 true}
{TESTTABLE1002 true}
{TESTTABLE1003 true}
..
{TESTTABLE997 true}
{TESTTABLE998 true}
Congrats! You have successfully run SHOW TABLES LIMIT 2999; with Snowflake DB!
when adding cfg.Tracing = "DEBUG"
to see verbose logs, it's visible that resultset now uses cloud storage (since its a fairly big query result), indeed json
format:
time="2024-04-22T08:28:11Z" level=debug msg="chunks: 1, total bytes: 427543" func="gosnowflake.(*defaultLogger).Debugf" file="log.go:148"
time="2024-04-22T08:28:11Z" level=debug msg="add chunk to channel ChunksChan: 1, URL: https://eevofusfcb1stg.blob.core.windows.net/results/<queryid>%2Fmain%2Fdata_1?s..&rsce=gzip, RowCount: 2381, UncompressedSize: 427543, ChunkResultFormat: json" func="gosnowflake.(*defaultLogger).Debugf" file="log.go:148"
which chunk, as the console output suggests, is successfully downloaded and parsed
time="2024-04-22T08:28:12Z" level=debug msg="response returned chunk: 1 for URL: https://eevofusfcb1stg.blob.core.windows.net/results/<queryid>d_0%2Fmain%2Fdata_1?..&rsce=gzip" func="gosnowflake.(*defaultLogger).Debugf" file="log.go:148"
time="2024-04-22T08:28:12Z" level=debug msg="decoded 2381 rows w/ 427543 bytes in 18.907967ms (chunk 1)" func="gosnowflake.(*defaultLogger).Debugf" file="log.go:148"
time="2024-04-22T08:28:12Z" level=debug msg="ready: chunk 1" func="gosnowflake.(*defaultLogger).Debugf" file="log.go:148"
time="2024-04-22T08:28:12Z" level=info msg="no more download" func="gosnowflake.(*defaultLogger).Info" file="log.go:216"
As of next step, could you please:
- use
gosnowflake
without ADBC to see if the issue still reproduces for you, maybe even with the above simple repro script - kindly provide the runnable reproduction (+ if necessary, environment setup) which when run, leads to the issue you're seeing
Asking this because for now, only using the Snowflake Go Driver, the SHOW TABLES
does not seem to have any problems even with a big resultset / large number of tables
from gosnowflake.
I can't run it as is, I get the error:
2024/04/22 11:35:08 failed to get result. err: sql: expected 22 destination arguments in Scan, not 23
But I can run it if I drop v23
.
from gosnowflake.
But, that call is also just a JSON call with no Arrow context to it. I will try to get a closer re-creation.
from gosnowflake.
I can't run it as is, I get the error:
2024/04/22 11:35:08 failed to get result. err: sql: expected 22 destination arguments in Scan, not 23
But I can run it if I drop
v23
.
apparently then you have one column less in your response for some reason; maybe we're in different versions or there's some settings different between our accounts.
being able to run it without the last column means it is possible to retrieve the 'big' resultset for SHOW TABLES
in your account, when using the gosnowflake
library without ADBC.
indeed it's a JSON resultset and that's we wanted to see.
if would be very much appreciated if you could please provide a repro; as you see the issue does not reproduce with gosnowflake
-only test - at least for now but I would like to see how it breaks for you.
from gosnowflake.
not quite the repro, but this is closer to the scenario:
package main
import (
"context"
"database/sql"
"flag"
"fmt"
"log"
"github.com/apache/arrow/go/v15/arrow/memory"
"github.com/snowflakedb/gosnowflake"
sf "github.com/snowflakedb/gosnowflake"
)
func main() {
if !flag.Parsed() {
flag.Parse()
}
cfg, err := sf.GetConfigFromEnv([]*sf.ConfigParam{
{Name: "Account", EnvName: "SNOWFLAKE_TEST_ACCOUNT", FailOnMissing: true},
{Name: "Role", EnvName: "SNOWFLAKE_TEST_ROLE", FailOnMissing: true},
{Name: "User", EnvName: "SNOWFLAKE_TEST_USER", FailOnMissing: true},
{Name: "Password", EnvName: "SNOWFLAKE_TEST_PASSWORD", FailOnMissing: true},
{Name: "Warehouse", EnvName: "SNOWFLAKE_TEST_WAREHOUSE", FailOnMissing: true},
{Name: "Database", EnvName: "SNOWFLAKE_TEST_DATABASE", FailOnMissing: true},
{Name: "Schema", EnvName: "SNOWFLAKE_TEST_SCHEMA", FailOnMissing: true},
})
if err != nil {
log.Fatalf("failed to create Config, err: %v", err)
}
//cfg.Tracing = "DEBUG"
dsn, err := sf.DSN(cfg)
if err != nil {
log.Fatalf("failed to create DSN from Config: %v, err: %v", cfg, err)
}
db, err := sql.Open("snowflake", dsn)
if err != nil {
log.Fatalf("failed to connect. %v, err: %v", dsn, err)
}
defer db.Close()
ctx := context.Background()
ctx = gosnowflake.WithArrowAllocator(
gosnowflake.WithArrowBatches(ctx), memory.DefaultAllocator)
query := "SHOW TABLES LIMIT 2999;"
rows, err := db.QueryContext(ctx, query)
if err != nil {
log.Fatalf("failed to run a query. %v, err: %v", query, err)
return
}
defer rows.Close()
var v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15, v16, v17, v18, v19, v20, v21, v22 sql.NullString
for rows.Next() {
err := rows.Scan(&v1, &v2, &v3, &v4, &v5, &v6, &v7, &v8, &v9, &v10, &v11, &v12, &v13, &v14, &v15, &v16, &v17, &v18, &v19, &v20, &v21, &v22)
if err != nil {
log.Fatalf("failed to get result. err: %v", err)
}
fmt.Println(v2)
}
if rows.Err() != nil {
fmt.Printf("ERROR: %v\n", rows.Err())
return
}
fmt.Printf("Congrats! You have successfully run %v with Snowflake DB!\n", query)
}
results in the error:
panic: close of nil channel
goroutine 1 [running]:
github.com/snowflakedb/gosnowflake.(*snowflakeChunkDownloader).next(0xc000002300)
C:/Users/davidcoe/go/pkg/mod/github.com/snowflakedb/[email protected]/chunk_downloader.go:221 +0x20b
github.com/snowflakedb/gosnowflake.(*snowflakeRows).Next(0xc000508e70, {0xc00047c160, 0x16, 0x16})
C:/Users/davidcoe/go/pkg/mod/github.com/snowflakedb/[email protected]/rows.go:178 +0xb9
database/sql.(*Rows).nextLocked(0xc00079cd80)
C:/Program Files/Go/src/database/sql/sql.go:3019 +0x1f4
database/sql.(*Rows).Next.func1()
C:/Program Files/Go/src/database/sql/sql.go:2994 +0x2e
database/sql.withLock({0x1a8a840, 0xc00079cdb8}, 0xc0000d7690)
C:/Program Files/Go/src/database/sql/sql.go:3502 +0xc2
database/sql.(*Rows).Next(0xc00079cd80)
C:/Program Files/Go/src/database/sql/sql.go:2993 +0xf7
main.main()
c:/dev/test/AdbcTestDefinitions/Snowflake/SHOW_ISSUE/Repro-Show.go:58 +0x1409
from gosnowflake.
thank you for the repro and details!
ctx = gosnowflake.WithArrowAllocator(
gosnowflake.WithArrowBatches(ctx), memory.DefaultAllocator)
..
rows, err := db.QueryContext(ctx, query)
will not work with SHOW
queries (and some other queries which return JSON resultset like ls @stagename
This is documented in the Supported Data Types section of the driver documentation:
[..]Limitations:
- For some queries Snowflake may decide to return data in JSON format (examples:
SHOW PARAMETERS
orls @stage
). You cannot use JSON with Arrow batches context.
You seem to be hitting this expected behaviour.
edited for possible way forward:
- using
WithArrowBatches
but thenGetArrowBatches
needs to be used to get the result, instead of 'regular' getting them. Example (again from the documentation):
batches, err := rows.(sf.SnowflakeRows).GetArrowBatches()
- or don't force Arrow on the JSON batches and get the rows as you're trying to
rows, err := db.QueryContext(ctx, query)
from gosnowflake.
The problem is that the call isn't consistent. When the call has the lower number, it runs correctly, and the JSON can be parsed.
Line 651 in 088150c
from gosnowflake.
I think I know what's happening.
If you have enough rows in your output, snowflake will chunk the data, returning the first chunk with the initial response and requiring you to fetch the remaining results as separate requests. It looks like snowflakeArrowStreamChunkDownloader isn't handling this scenario properly in order to report this to the consumer.
In the ADBC code that @davidhcoe linked to, we check the number of batches returned by GetBatches
because the downloader will just skip the first batch if it was solely JSON. But since there are subsequent expected batches, just as JSON data, it still reports the extra batches and ArrowStreamBatch's assumption of receiving Arrow data is no longer valid for those batches.
For consistency, I think the ideal scenario here would be for gosnowflake's ArrowStreamBatch
to be updated to simply check if the batch is JSON and not arrow, and just convert the JSON to arrow like we do in the ADBC code that was linked. This would remove the burden from consumers to have to check whether or not they actually got Arrow data back based on the query and just seamlessly provide the Arrow data if that's what the consumer wanted.
from gosnowflake.
Hi @zeroshade ! My input from driver developer perspective:
QueryArrowStream
should be used only for arrow requests. It shouldn't be used if JSON is returned (JSON is set in session as default response data format, we ask for metadata likeSHOW
queries, we use multistatements that don't support Arrow).- We don't want to convert JSON to Arrow in drivers. Currently drivers don't create arrow on their own (we only reuse existing Arrow arrays). This has some important drawbacks:
a) We would have to copy backend logic for conversion to Arrow of all types. This is a huge effort having in mind all, sometimes complicated, types (like timestamps or upcoming structured types).
b) We would have to follow all new types that are introduced every now and then and keep drivers updated.
c) We would have to think about points above in all drivers, which are 7 now.
from gosnowflake.
- Since ADBC just forwards query strings to gosnowflake, there's no way for us to know whether or not a query is a metadata "SHOW" query etc. Which means that, at minimum,
QueryArrowStream
is going to need something added to it which will allow indicating that the results are JSON and not Arrow. The current attempt I made by just checking the len of the batches slice returned fromGetBatches
isn't sufficient since JSON can return chunked batches too (which is the cause of the initial problem) - Do all the drivers currently allow exporting Arrow data from them? It was my understanding that this driver and the python connector were the only ones that make Arrow externally accessible from them.
from gosnowflake.
@zeroshade
Ad 1. ArrowStreamLoader
which is a return value of QueryArrowStream
has a function JSONData() [][]*string
. Maybe you can decide on these values? If not, we can add additional function returning some info about types.
Ad 2. Not yet, but we are planing to add this feature to remaining drivers probably later this year. Customers ask for it.
from gosnowflake.
FWIW: In general, converting a result set from JSON format to Arrow format is a nontrivial operation for a client or driver library to have to perform. It requires using dependencies which can be heavy and can behave subtly differently depending on what language implementation you're using. Client-side conversion also creates greater risk of inconsistency because the logic for converting Snowflake types to Arrow types can be complicated. +1 to Matt's suggestion to always convert to Arrow on the server side whenever possible. This is better for keeping connectors lightweight and for ensuring consistency across connectors.
from gosnowflake.
@ianmcook thanks for you input! However the problem on backend is (to my best knowledge) similar to the problem in drivers - there is one component that prepares the actual table data (and this component can serialise arrow) and another one which returns metadata (and this one probably can't - but I'm not sure).
@zeroshade I analysed code a bit and I'm wondering about one thing. Currently the process looks like this:
- Call query.
- Check if JSONData is empty -> it means that everything should be available in
GetBatches
. - Call
GetBatches
and iterate for each batch downloading batch content usingGetStream
. This function returns just bytes. - On ADBC side - read bytes to arrow records.
What do you think about such approach:
- Call query.
- Check if JSONData is empty -> if it is not, it means batches should also be JSON.
- Same as above.
- On ADBC side - get JSONData and remaining batches bytes deserialise to JSON instead of arrow?
What do you think?
from gosnowflake.
On ADBC side - get JSONData and remaining batches bytes deserialise to JSON instead of arrow?
What do you think?
I was toying with this a bit, and I can do this, but this is a terrible solution for the general case. Mostly because the chunks aren't actually valid JSON. They are a series of JSON arrays separated by commas:
["a", "b", "c", "d"], ["e", "f", "g", "h"], ....
With no top-level structure as per the JSON spec. Looking at the gosnowflake code, there's an entire section devoted to custom parsing this JSON because you can't just deserialize it as normal JSON.
It also means that if snowflake ever changes how it formats the JSON for the large chunks, it'll break us as we'll now be relying on an undocumented, non-standard way of representing the JSON data. While this is likely a viable workaround for this situation, I don't think this is a viable long-term solution.
from gosnowflake.
Sorry for late reply, I was absent in the previous week.
I'm glad that you implemented a workaround and it works! I fully agree that this is not a good solution long-term, but having this workaround we can plan proper changes on backend (I believe so). We will discuss it internally and when I have some info, I'll let you know.
One last thought - having so filthy and ugly current format, this is still not going to be changed in any foreseeable future in my opinion, so you workaround is safe for now ;)
from gosnowflake.
Related Issues (20)
- Race in `easy_logging.go` introduced in v1.8.0 HOT 1
- Does setting CLIENT_SESSION_KEEP_ALIVE to true keep compute resources active and cost money? HOT 2
- SNOW-1230690 File Transfers to buckets in AWS US GOV and AWS China Partitions fail as they do not support Transfer Acceleration HOT 4
- SNOW-1313544 Parallel operations on different roles cause program to crash (race condition on temporal credential cache) HOT 12
- SNOW-1333232 High Severity DoS CVE in introduced - github.com/hambra/avro HOT 9
- SNOW-1342572: Test jira integration HOT 1
- SNOW-1346547: Error when using async queries: "390112: Your session has expired. Please login again." HOT 8
- SNOW-1424576: Intermittent panics when using async queries HOT 4
- SNOW-1432112: JWT Authentication not stripping region from account when constructing DSN HOT 11
- SNOW-1446031: OAuth authentification failed HOT 2
- SNOW-1460548: Please reduce the dependency scope HOT 1
- SNOW-1463687: Driver uses too much memory HOT 4
- SNOW-1510090: SNOW-1524103 Adding support for passing nil to variant columns HOT 5
- SNOW-1524204: Session Variables passed as connection params are ignored HOT 7
- SNOW-1540431: Panic on OCSP failure with disabled OCSP cache HOT 6
- SNOW-1552759: Super annoying warning message from gosnowflake every invocation of Hashicorp vault cli under sudo HOT 5
- SNOW-1553756: SNOW-1017659 replace `github.com/99designs/keyring` with a different implementation HOT 1
- SNOW-1563083: database/sql driver caches cancelled context HOT 3
- SNOW-1618664: Documentation: List of SQL states/codes HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gosnowflake.