Git Product home page Git Product logo

Comments (17)

zeroshade avatar zeroshade commented on August 16, 2024 1

Ad 1. ArrowStreamLoader which is a return value of QueryArrowStream has a function JSONData() [][]*string. Maybe you can decide on these values? If not, we can add additional function returning some info about types.

The problem there is that it doesn't correctly manage when the JSON data is chunked. It returns the JSON data from the initial response, and it's still difficult to tell when you need to call JSONData vs using the Arrow interface since it is entirely dependent upon the query that was sent in which the caller doesn't know. This inconsistency where metadata queries can't return Arrow when requested is problematic because there's no way for callers like ADBC to know if the query supported Arrow or not.

Ad 2. Not yet, but we are planing to add this feature to remaining drivers probably later this year. Customers ask for it.

You will likely run into this problem again when you do that as customers are going to almost certainly dislike the inconsistency. Personally, the preferred solution here would be to add support to return Arrow for these metadata queries.

But for now, at minimum something should be added to ArrowStreamLoader to explicitly be able to denote whether or not a consumer needs to use JSONData or not and whether they should expect JSON data from the batches

from gosnowflake.

sfc-gh-dszmolka avatar sfc-gh-dszmolka commented on August 16, 2024

hi and thank you for submitting this issue, i'll take a look sometimes early next week. If the issue is very urgent or impactful, as suggested during the issue creation: please create a case with Snowflake Support instead

as a quick preliminary analysis:

  1. i'm already quite certain we can rule this out, but to be absolutely sure the issue is not caused by a theoretical connectivity issue to the cloud storage from where the query results are coming, can you please run SELECT SYSTEM$ALLOWLIST() (or ALLOWLIST_PRIVATELINK() if that's your use-case), and make sure that all endpoints returned in STAGE category, are in fact reachable from the Windows host you have the problem?
    Alternatively you can test this by running something like SELECT * FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF100.CUSTOMER LIMIT 400000 with gosnowflake and see if the result is successfully returned (= as arrow, that is, since this is just testing the connectivity and not arrow vs. json resultset and their size)

  2. if you have a runnable reproduction code which when executed exhibits the problem you're seeing, that would be truly great and appreciated and likely speed up the debug effort a lot

  3. took a quick look on the linked ADBC issue and user zeroshade might be right here:

[..]snowflake doesn't return the results of SHOW TABLES using Arrow, it returns it as JSON that we have to parse and manipulate into Arrow. This is likely an issue that needs to be addressed on the server side of Snowflake

if the issue will be indeed proven to be at the server side, then it's independent from the gosnowflake (or , any other) driver and it might take a longer period to fix.
But I don't want to go there yet, and instead understand the issue better when I have a chance early next week.

from gosnowflake.

sfc-gh-dszmolka avatar sfc-gh-dszmolka commented on August 16, 2024

did a test on Debian 11 to see how gosnowflake handles the large amount of resultset for SHOW TABLES because it was very weird that no one else reported such an issue (so far) for such a basic functionality.

If there's any issue in the JSON resultset parsing, it should be visible regardless of OS, but to be sure I'll try to test this on Windows too.
edit: same (below) script works the same on Windows 10 as well, using nothing but the vanilla gosnowflake 1.9.0

Reproduction:

  1. created a new account on the same deployment as your account in the example, and 3000 tables with 1 int column
  2. using following simple script with the vanilla gosnowflake 1.9.0, without ADBC:
# cat gosnowflake1110.go 
package main

import (
	"database/sql"
	"flag"
	"fmt"
	"log"

	sf "github.com/snowflakedb/gosnowflake"
)

func main() {
	if !flag.Parsed() {
		flag.Parse()
	}

	cfg, err := sf.GetConfigFromEnv([]*sf.ConfigParam{
		{Name: "Account", EnvName: "SNOWFLAKE_TEST_ACCOUNT", FailOnMissing: true},
		{Name: "Role", EnvName: "SNOWFLAKE_TEST_ROLE", FailOnMissing: true},
		{Name: "User", EnvName: "SNOWFLAKE_TEST_USER", FailOnMissing: true},
		{Name: "Password", EnvName: "SNOWFLAKE_TEST_PASSWORD", FailOnMissing: true},
		{Name: "Warehouse", EnvName: "SNOWFLAKE_TEST_WAREHOUSE", FailOnMissing: true},
		{Name: "Database", EnvName: "SNOWFLAKE_TEST_DATABASE", FailOnMissing: true},
		{Name: "Schema", EnvName: "SNOWFLAKE_TEST_SCHEMA", FailOnMissing: true},
	})
	if err != nil {
		log.Fatalf("failed to create Config, err: %v", err)
	}
	//cfg.Tracing = "DEBUG"
	dsn, err := sf.DSN(cfg)
	if err != nil {
		log.Fatalf("failed to create DSN from Config: %v, err: %v", cfg, err)
	}

	db, err := sql.Open("snowflake", dsn)
	if err != nil {
		log.Fatalf("failed to connect. %v, err: %v", dsn, err)
	}
	defer db.Close()
	query := "SHOW TABLES LIMIT 2999;"
	rows, err := db.Query(query) // no cancel is allowed
	if err != nil {
		log.Fatalf("failed to run a query. %v, err: %v", query, err)
	}
	defer rows.Close()
	var v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15, v16, v17, v18, v19, v20, v21, v22, v23 sql.NullString
	for rows.Next() {
		err := rows.Scan(&v1, &v2, &v3, &v4, &v5, &v6, &v7, &v8, &v9, &v10, &v11, &v12, &v13, &v14, &v15, &v16, &v17, &v18, &v19, &v20, &v21, &v22, &v23)
		if err != nil {
			log.Fatalf("failed to get result. err: %v", err)
		}
		fmt.Println(v2)
	}
	if rows.Err() != nil {
		fmt.Printf("ERROR: %v\n", rows.Err())
		return
	}
	fmt.Printf("Congrats! You have successfully run %v with Snowflake DB!\n", query)
}

Result: expected tables are returned:

# go run gosnowflake1110.go
{TESTTABLE1 true}
{TESTTABLE10 true}
{TESTTABLE100 true}
{TESTTABLE1000 true}
{TESTTABLE1001 true}
{TESTTABLE1002 true}
{TESTTABLE1003 true}
..
{TESTTABLE997 true}
{TESTTABLE998 true}
Congrats! You have successfully run SHOW TABLES LIMIT 2999; with Snowflake DB!

when adding cfg.Tracing = "DEBUG" to see verbose logs, it's visible that resultset now uses cloud storage (since its a fairly big query result), indeed json format:

time="2024-04-22T08:28:11Z" level=debug msg="chunks: 1, total bytes: 427543" func="gosnowflake.(*defaultLogger).Debugf" file="log.go:148"
time="2024-04-22T08:28:11Z" level=debug msg="add chunk to channel ChunksChan: 1, URL: https://eevofusfcb1stg.blob.core.windows.net/results/<queryid>%2Fmain%2Fdata_1?s..&rsce=gzip, RowCount: 2381, UncompressedSize: 427543, ChunkResultFormat: json" func="gosnowflake.(*defaultLogger).Debugf" file="log.go:148"

which chunk, as the console output suggests, is successfully downloaded and parsed

time="2024-04-22T08:28:12Z" level=debug msg="response returned chunk: 1 for URL: https://eevofusfcb1stg.blob.core.windows.net/results/<queryid>d_0%2Fmain%2Fdata_1?..&rsce=gzip" func="gosnowflake.(*defaultLogger).Debugf" file="log.go:148"
time="2024-04-22T08:28:12Z" level=debug msg="decoded 2381 rows w/ 427543 bytes in 18.907967ms (chunk 1)" func="gosnowflake.(*defaultLogger).Debugf" file="log.go:148"
time="2024-04-22T08:28:12Z" level=debug msg="ready: chunk 1" func="gosnowflake.(*defaultLogger).Debugf" file="log.go:148"
time="2024-04-22T08:28:12Z" level=info msg="no more download" func="gosnowflake.(*defaultLogger).Info" file="log.go:216"

As of next step, could you please:

  1. use gosnowflake without ADBC to see if the issue still reproduces for you, maybe even with the above simple repro script
  2. kindly provide the runnable reproduction (+ if necessary, environment setup) which when run, leads to the issue you're seeing

Asking this because for now, only using the Snowflake Go Driver, the SHOW TABLES does not seem to have any problems even with a big resultset / large number of tables

from gosnowflake.

davidhcoe avatar davidhcoe commented on August 16, 2024

I can't run it as is, I get the error:

2024/04/22 11:35:08 failed to get result. err: sql: expected 22 destination arguments in Scan, not 23

But I can run it if I drop v23.

from gosnowflake.

davidhcoe avatar davidhcoe commented on August 16, 2024

But, that call is also just a JSON call with no Arrow context to it. I will try to get a closer re-creation.

from gosnowflake.

sfc-gh-dszmolka avatar sfc-gh-dszmolka commented on August 16, 2024

I can't run it as is, I get the error:

2024/04/22 11:35:08 failed to get result. err: sql: expected 22 destination arguments in Scan, not 23

But I can run it if I drop v23.

apparently then you have one column less in your response for some reason; maybe we're in different versions or there's some settings different between our accounts.

being able to run it without the last column means it is possible to retrieve the 'big' resultset for SHOW TABLES in your account, when using the gosnowflake library without ADBC.
indeed it's a JSON resultset and that's we wanted to see.

if would be very much appreciated if you could please provide a repro; as you see the issue does not reproduce with gosnowflake-only test - at least for now but I would like to see how it breaks for you.

from gosnowflake.

davidhcoe avatar davidhcoe commented on August 16, 2024

not quite the repro, but this is closer to the scenario:

package main

import (
	"context"
	"database/sql"
	"flag"
	"fmt"
	"log"

	"github.com/apache/arrow/go/v15/arrow/memory"
	"github.com/snowflakedb/gosnowflake"
	sf "github.com/snowflakedb/gosnowflake"
)

func main() {
	if !flag.Parsed() {
		flag.Parse()
	}

	cfg, err := sf.GetConfigFromEnv([]*sf.ConfigParam{
		{Name: "Account", EnvName: "SNOWFLAKE_TEST_ACCOUNT", FailOnMissing: true},
		{Name: "Role", EnvName: "SNOWFLAKE_TEST_ROLE", FailOnMissing: true},
		{Name: "User", EnvName: "SNOWFLAKE_TEST_USER", FailOnMissing: true},
		{Name: "Password", EnvName: "SNOWFLAKE_TEST_PASSWORD", FailOnMissing: true},
		{Name: "Warehouse", EnvName: "SNOWFLAKE_TEST_WAREHOUSE", FailOnMissing: true},
		{Name: "Database", EnvName: "SNOWFLAKE_TEST_DATABASE", FailOnMissing: true},
		{Name: "Schema", EnvName: "SNOWFLAKE_TEST_SCHEMA", FailOnMissing: true},
	})
	if err != nil {
		log.Fatalf("failed to create Config, err: %v", err)
	}

	//cfg.Tracing = "DEBUG"
	dsn, err := sf.DSN(cfg)
	if err != nil {
		log.Fatalf("failed to create DSN from Config: %v, err: %v", cfg, err)
	}

	db, err := sql.Open("snowflake", dsn)
	if err != nil {
		log.Fatalf("failed to connect. %v, err: %v", dsn, err)
	}
	defer db.Close()

	ctx := context.Background()
	ctx = gosnowflake.WithArrowAllocator(
		gosnowflake.WithArrowBatches(ctx), memory.DefaultAllocator)

	query := "SHOW TABLES LIMIT 2999;"
	rows, err := db.QueryContext(ctx, query)
	if err != nil {
		log.Fatalf("failed to run a query. %v, err: %v", query, err)
		return
	}
	defer rows.Close()
	var v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15, v16, v17, v18, v19, v20, v21, v22 sql.NullString

	for rows.Next() {
		err := rows.Scan(&v1, &v2, &v3, &v4, &v5, &v6, &v7, &v8, &v9, &v10, &v11, &v12, &v13, &v14, &v15, &v16, &v17, &v18, &v19, &v20, &v21, &v22)
		if err != nil {
			log.Fatalf("failed to get result. err: %v", err)
		}
		fmt.Println(v2)
	}
	if rows.Err() != nil {
		fmt.Printf("ERROR: %v\n", rows.Err())
		return
	}
	fmt.Printf("Congrats! You have successfully run %v with Snowflake DB!\n", query)
}

results in the error:

panic: close of nil channel

goroutine 1 [running]:
github.com/snowflakedb/gosnowflake.(*snowflakeChunkDownloader).next(0xc000002300)
	C:/Users/davidcoe/go/pkg/mod/github.com/snowflakedb/[email protected]/chunk_downloader.go:221 +0x20b
github.com/snowflakedb/gosnowflake.(*snowflakeRows).Next(0xc000508e70, {0xc00047c160, 0x16, 0x16})
	C:/Users/davidcoe/go/pkg/mod/github.com/snowflakedb/[email protected]/rows.go:178 +0xb9
database/sql.(*Rows).nextLocked(0xc00079cd80)
	C:/Program Files/Go/src/database/sql/sql.go:3019 +0x1f4
database/sql.(*Rows).Next.func1()
	C:/Program Files/Go/src/database/sql/sql.go:2994 +0x2e
database/sql.withLock({0x1a8a840, 0xc00079cdb8}, 0xc0000d7690)
	C:/Program Files/Go/src/database/sql/sql.go:3502 +0xc2
database/sql.(*Rows).Next(0xc00079cd80)
	C:/Program Files/Go/src/database/sql/sql.go:2993 +0xf7
main.main()
	c:/dev/test/AdbcTestDefinitions/Snowflake/SHOW_ISSUE/Repro-Show.go:58 +0x1409

from gosnowflake.

sfc-gh-dszmolka avatar sfc-gh-dszmolka commented on August 16, 2024

thank you for the repro and details!

ctx = gosnowflake.WithArrowAllocator(
		gosnowflake.WithArrowBatches(ctx), memory.DefaultAllocator)
..
rows, err := db.QueryContext(ctx, query)

will not work with SHOW queries (and some other queries which return JSON resultset like ls @stagename

This is documented in the Supported Data Types section of the driver documentation:

[..]Limitations:

  1. For some queries Snowflake may decide to return data in JSON format (examples: SHOW PARAMETERS or ls @stage). You cannot use JSON with Arrow batches context.

You seem to be hitting this expected behaviour.

edited for possible way forward:

  1. using WithArrowBatches but then GetArrowBatches needs to be used to get the result, instead of 'regular' getting them. Example (again from the documentation):
batches, err := rows.(sf.SnowflakeRows).GetArrowBatches()
  1. or don't force Arrow on the JSON batches and get the rows as you're trying to
rows, err := db.QueryContext(ctx, query)

from gosnowflake.

davidhcoe avatar davidhcoe commented on August 16, 2024

The problem is that the call isn't consistent. When the call has the lower number, it runs correctly, and the JSON can be parsed.

// Some queries from Snowflake do not return Arrow data regardless
indicates that the JSONData should be populated, and that is parsed in https://github.com/apache/arrow-adbc/blob/93dda626daf52a15673cc08cdd1ef69b85601ecc/go/adbc/driver/snowflake/record_reader.go#L363. Tagging @zeroshade if he has some more thoughts here.

from gosnowflake.

zeroshade avatar zeroshade commented on August 16, 2024

I think I know what's happening.

If you have enough rows in your output, snowflake will chunk the data, returning the first chunk with the initial response and requiring you to fetch the remaining results as separate requests. It looks like snowflakeArrowStreamChunkDownloader isn't handling this scenario properly in order to report this to the consumer.

In the ADBC code that @davidhcoe linked to, we check the number of batches returned by GetBatches because the downloader will just skip the first batch if it was solely JSON. But since there are subsequent expected batches, just as JSON data, it still reports the extra batches and ArrowStreamBatch's assumption of receiving Arrow data is no longer valid for those batches.

For consistency, I think the ideal scenario here would be for gosnowflake's ArrowStreamBatch to be updated to simply check if the batch is JSON and not arrow, and just convert the JSON to arrow like we do in the ADBC code that was linked. This would remove the burden from consumers to have to check whether or not they actually got Arrow data back based on the query and just seamlessly provide the Arrow data if that's what the consumer wanted.

from gosnowflake.

sfc-gh-pfus avatar sfc-gh-pfus commented on August 16, 2024

Hi @zeroshade ! My input from driver developer perspective:

  1. QueryArrowStream should be used only for arrow requests. It shouldn't be used if JSON is returned (JSON is set in session as default response data format, we ask for metadata like SHOW queries, we use multistatements that don't support Arrow).
  2. We don't want to convert JSON to Arrow in drivers. Currently drivers don't create arrow on their own (we only reuse existing Arrow arrays). This has some important drawbacks:
    a) We would have to copy backend logic for conversion to Arrow of all types. This is a huge effort having in mind all, sometimes complicated, types (like timestamps or upcoming structured types).
    b) We would have to follow all new types that are introduced every now and then and keep drivers updated.
    c) We would have to think about points above in all drivers, which are 7 now.

from gosnowflake.

zeroshade avatar zeroshade commented on August 16, 2024

@sfc-gh-pfus

  1. Since ADBC just forwards query strings to gosnowflake, there's no way for us to know whether or not a query is a metadata "SHOW" query etc. Which means that, at minimum, QueryArrowStream is going to need something added to it which will allow indicating that the results are JSON and not Arrow. The current attempt I made by just checking the len of the batches slice returned from GetBatches isn't sufficient since JSON can return chunked batches too (which is the cause of the initial problem)
  2. Do all the drivers currently allow exporting Arrow data from them? It was my understanding that this driver and the python connector were the only ones that make Arrow externally accessible from them.

from gosnowflake.

sfc-gh-pfus avatar sfc-gh-pfus commented on August 16, 2024

@zeroshade
Ad 1. ArrowStreamLoader which is a return value of QueryArrowStream has a function JSONData() [][]*string. Maybe you can decide on these values? If not, we can add additional function returning some info about types.
Ad 2. Not yet, but we are planing to add this feature to remaining drivers probably later this year. Customers ask for it.

from gosnowflake.

ianmcook avatar ianmcook commented on August 16, 2024

FWIW: In general, converting a result set from JSON format to Arrow format is a nontrivial operation for a client or driver library to have to perform. It requires using dependencies which can be heavy and can behave subtly differently depending on what language implementation you're using. Client-side conversion also creates greater risk of inconsistency because the logic for converting Snowflake types to Arrow types can be complicated. +1 to Matt's suggestion to always convert to Arrow on the server side whenever possible. This is better for keeping connectors lightweight and for ensuring consistency across connectors.

from gosnowflake.

sfc-gh-pfus avatar sfc-gh-pfus commented on August 16, 2024

@ianmcook thanks for you input! However the problem on backend is (to my best knowledge) similar to the problem in drivers - there is one component that prepares the actual table data (and this component can serialise arrow) and another one which returns metadata (and this one probably can't - but I'm not sure).

@zeroshade I analysed code a bit and I'm wondering about one thing. Currently the process looks like this:

  1. Call query.
  2. Check if JSONData is empty -> it means that everything should be available in GetBatches.
  3. Call GetBatches and iterate for each batch downloading batch content using GetStream. This function returns just bytes.
  4. On ADBC side - read bytes to arrow records.

What do you think about such approach:

  1. Call query.
  2. Check if JSONData is empty -> if it is not, it means batches should also be JSON.
  3. Same as above.
  4. On ADBC side - get JSONData and remaining batches bytes deserialise to JSON instead of arrow?
    What do you think?

from gosnowflake.

zeroshade avatar zeroshade commented on August 16, 2024

On ADBC side - get JSONData and remaining batches bytes deserialise to JSON instead of arrow?
What do you think?

I was toying with this a bit, and I can do this, but this is a terrible solution for the general case. Mostly because the chunks aren't actually valid JSON. They are a series of JSON arrays separated by commas:

["a", "b", "c", "d"], ["e", "f", "g", "h"], ....

With no top-level structure as per the JSON spec. Looking at the gosnowflake code, there's an entire section devoted to custom parsing this JSON because you can't just deserialize it as normal JSON.

It also means that if snowflake ever changes how it formats the JSON for the large chunks, it'll break us as we'll now be relying on an undocumented, non-standard way of representing the JSON data. While this is likely a viable workaround for this situation, I don't think this is a viable long-term solution.

from gosnowflake.

sfc-gh-pfus avatar sfc-gh-pfus commented on August 16, 2024

Sorry for late reply, I was absent in the previous week.
I'm glad that you implemented a workaround and it works! I fully agree that this is not a good solution long-term, but having this workaround we can plan proper changes on backend (I believe so). We will discuss it internally and when I have some info, I'll let you know.
One last thought - having so filthy and ugly current format, this is still not going to be changed in any foreseeable future in my opinion, so you workaround is safe for now ;)

from gosnowflake.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.