Git Product home page Git Product logo

Comments (23)

shivsood avatar shivsood commented on May 9, 2024 10

A flag "schemaCheckEnabled" is now supported. When set to false, connector with skip the strict schema checks.

from sql-spark-connector.

garawalid avatar garawalid commented on May 9, 2024 6

I absolutely agree with @sdebruyn & @utsavjha
This is not a safe solution! It would be better to have a proper fix.

from sql-spark-connector.

utsavjha avatar utsavjha commented on May 9, 2024 5

@shivsood Is that the official fix for this problem? That you stop checking the exception when raised?? What happens when there is incompatibility while writing?

from sql-spark-connector.

sdebruyn avatar sdebruyn commented on May 9, 2024 4

@shivsood that is not a good solution
The schema should have nullability as in the database. Just disabling the check is not a solution, it's an easy workaround for you.

from sql-spark-connector.

jbaker-dstl avatar jbaker-dstl commented on May 9, 2024 2

The workaround I've been using is to explicitly set fields to nullable/not-nullable in the schema, for example:

  val schema = StructType(
    StructField("uid", StringType, false) ::
    StructField("date", DateType, true) ::
    StructField("type", StringType, true) ::  // Explicitly make nullable, due to issue #5
    StructField("outcome", IntegerType, true) :: Nil)
  
  val dfWithCorrectSchema = spark.createDataFrame(df.rdd, schema)

It is tedious though, especially for larger tables, so I'd love this bug to be fixed. Or if someone has a better workaround, I'd be willing to give it a go!

from sql-spark-connector.

annaleighsmith avatar annaleighsmith commented on May 9, 2024 1

Are you ever going to provide an actual fix for this?

from sql-spark-connector.

mokabiru avatar mokabiru commented on May 9, 2024

#2 - Duplicate of the same [issue] (#2)

from sql-spark-connector.

pmadusud avatar pmadusud commented on May 9, 2024

@mokabiru Did you get the issue even with the same source/target table for reading and writing?

from sql-spark-connector.

mokabiru avatar mokabiru commented on May 9, 2024

@pmadusud - I used different tables for read vs write but the data type of the column is same across both the tables. For more info, can you paste your table schema and highlight the column that's causing the error?

from sql-spark-connector.

pmadusud avatar pmadusud commented on May 9, 2024

@mokabiru The first column of the table is a not null column, however when I read the data, it automatically makes it into nullable column. When i try to save, due to the mismatch it throws the error. The way I worked around this is by updating the schema of the dataframe using the dataframe I get from reading the SQL table.

from sql-spark-connector.

shivsood avatar shivsood commented on May 9, 2024

@pmadusud looks like reading the data made it into a nullable column. How did you read that data?

from sql-spark-connector.

pmadusud avatar pmadusud commented on May 9, 2024

@shivsood It is using spark.read... something similar to the below,
jdbcDF = spark.read
.format("com.microsoft.sqlserver.jdbc.spark")
.option("url", url)
.option("query", query)
.load()

from sql-spark-connector.

mokabiru avatar mokabiru commented on May 9, 2024

@shivsood It is using spark.read... something similar to the below,
jdbcDF = spark.read
.format("com.microsoft.sqlserver.jdbc.spark")
.option("url", url)
.option("query", query)
.load()

The issue occurs even when using "dbtable" option too.

from sql-spark-connector.

shivsood avatar shivsood commented on May 9, 2024

@pmadusud @mokabiru Read is delegated to generic JDBC implementation here. So you should have the same issues when using the default JDBC connector. Can u please check and confirm. Replace format as "jbdc" and see if that reads make it a nullable column?

from sql-spark-connector.

pmadusud avatar pmadusud commented on May 9, 2024

@shivsood Yes, both jdbc and com.microsoft.sqlserver.jdbc.spark return the columns as nullable even though the column is not nullable. However, while writing the data back to SQL, jdbc doesn't complain if there is a datatype or nullable column differences but com.microsoft.sqlserver.jdbc.spark errors out...

from sql-spark-connector.

nnani avatar nnani commented on May 9, 2024

Anyone got this working ?
I tried using this library for writing data, but seems this bug is not allowing so.
Any workaround ?
For now, I just commented the assertCondition for this check and trying to get around it ?

from sql-spark-connector.

ankitbko avatar ankitbko commented on May 9, 2024

@shivsood Moreover when spark writes to parquet it automatically converts all column to Nullable. This is mentioned in spark doc https://spark.apache.org/docs/2.4.5/sql-data-sources-parquet.html

When writing Parquet files, all columns are automatically converted to be nullable for compatibility reasons.

from sql-spark-connector.

ankitbko avatar ankitbko commented on May 9, 2024

Is the null checking absolutely necessary? Is it possible to have it behind a flag?

On a separate note the error message can be improved a lot. Instead of printing ${prefix} column nullable configurations at column index ${i} it would help if the column name is displayed instead of index. This will avoid user to run the cell again to capture the schema detail of dataframe and count the columns to find the root cause. This goes for few other messages I saw in that file.

cc: @rajmera3

from sql-spark-connector.

gbrueckl avatar gbrueckl commented on May 9, 2024

this is the workaround I am currently using:

  print("Getting schema of target table '{}' ...".format(targetTableName))
  targetDf = spark.read \
    .format("com.microsoft.sqlserver.jdbc.spark") \
    .option("url", jdbcUrl) \
    .option("dbtable", targetTableName) \
    .option("user", sqlUsername) \
    .option("password", sqlPassword) \
    .load().filter("1 = 0")
  
  print("Getting soruce data: \n{}".format(sourceSqlQuery))
  writeDf = spark.sql(sourceSqlQuery)
   
  if targetDf.columns != writeDf.columns:
    writeDf = writeDf.select(targetDf.columns)

  if targetDf.schema != writeDf.schema:
    writeDf = spark.createDataFrame(writeDf.rdd, targetDf.schema)
  
  print("Writing results to MS SQL ...")
  writeDf.write \
    .format("com.microsoft.sqlserver.jdbc.spark") \
    .mode(writeMode) \
    .option("url", jdbcUrl) \
    .option("dbtable", targetTableName) \
    .option("user", sqlUsername) \
    .option("password", sqlPassword) \
    .save()

this fixes the issue with nullable columns and also only selects columns that actually exist in the target table

however, it would be great if the check on null-able columns could be turned off, especially as the issue exists in both directions.
So even if Spark DF column is NOT NULLABLE and SQL column is NULLABLE, the .write() will fail

UPDATE:
ok, the workaround only works for the case where the dataframe column is NOT NULL but the SQL server would allow NULL

however, then there is another bug where spark.read with this provider does not return the correct column metadata!

from sql-spark-connector.

shivsood avatar shivsood commented on May 9, 2024

The following will need to go behind the flag for disable the null check in matchSchema.

assertCondition(dfCols(dfFieldIndex).nullable == tableCols(i).nullable,
s"${prefix} column nullable configurations at column index ${i}")

from sql-spark-connector.

shivsood avatar shivsood commented on May 9, 2024

Added a flag that can be used to disable strict schema checks. PR #50

from sql-spark-connector.

MrWhiteABEX avatar MrWhiteABEX commented on May 9, 2024

Alternative workaround is to use a projection:

import org.apache.spark.sql.catalyst.expressions.objects.{AssertNotNull}
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{col,when}

df.select(new Column(AssertNotNull(col("NullableCol").expr).as("NonNullableCol"), // Nullable to non nullable
              when(col("NonNullableCol").isNotNull, col("NonNullableCol")).otherwise(lit(null)).as("NullableCol") // Otherway around
)

from sql-spark-connector.

Praxkv1 avatar Praxkv1 commented on May 9, 2024

I do get the exact same erorr as mentioned above. I am copying 50+ tables from one sql server to other sql server using pyspark. What i observed is while reading the data to df its self the nullability of the column is lost. I have a column which is not null in my source db but when i read the data to dataframe this column is marked as notnull when i do a df.printSchema() :(

I have created the Spark Dataframe using the connector. The schema of the dataframe in terms of nullable columns etc., are different from the source table and this causes the error while try to save the data into a similar sql data table.

java.sql.SQLException: Spark Dataframe and SQL Server table have differing column nullable configurations at column index 0

from sql-spark-connector.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.