Nullable column mismatch between Spark DataFrame & SQL Table Error about sql-spark-connector HOT 23 CLOSED

pmadusud commented on May 9, 2024

Nullable column mismatch between Spark DataFrame & SQL Table Error

from sql-spark-connector.

Comments (23)

shivsood commented on May 9, 2024 10

A flag "schemaCheckEnabled" is now supported. When set to false, connector with skip the strict schema checks.

from sql-spark-connector.

garawalid commented on May 9, 2024 6

I absolutely agree with @sdebruyn & @utsavjha
This is not a safe solution! It would be better to have a proper fix.

from sql-spark-connector.

utsavjha commented on May 9, 2024 5

@shivsood Is that the official fix for this problem? That you stop checking the exception when raised?? What happens when there is incompatibility while writing?

from sql-spark-connector.

sdebruyn commented on May 9, 2024 4

@shivsood that is not a good solution
The schema should have nullability as in the database. Just disabling the check is not a solution, it's an easy workaround for you.

from sql-spark-connector.

jbaker-dstl commented on May 9, 2024 2

The workaround I've been using is to explicitly set fields to nullable/not-nullable in the schema, for example:

  val schema = StructType(
    StructField("uid", StringType, false) ::
    StructField("date", DateType, true) ::
    StructField("type", StringType, true) ::  // Explicitly make nullable, due to issue #5
    StructField("outcome", IntegerType, true) :: Nil)
  
  val dfWithCorrectSchema = spark.createDataFrame(df.rdd, schema)

It is tedious though, especially for larger tables, so I'd love this bug to be fixed. Or if someone has a better workaround, I'd be willing to give it a go!

from sql-spark-connector.

annaleighsmith commented on May 9, 2024 1

Are you ever going to provide an actual fix for this?

from sql-spark-connector.

mokabiru commented on May 9, 2024

#2 - Duplicate of the same [issue] (#2)

from sql-spark-connector.

pmadusud commented on May 9, 2024

@mokabiru Did you get the issue even with the same source/target table for reading and writing?

from sql-spark-connector.

mokabiru commented on May 9, 2024

@pmadusud - I used different tables for read vs write but the data type of the column is same across both the tables. For more info, can you paste your table schema and highlight the column that's causing the error?

from sql-spark-connector.

pmadusud commented on May 9, 2024

@mokabiru The first column of the table is a not null column, however when I read the data, it automatically makes it into nullable column. When i try to save, due to the mismatch it throws the error. The way I worked around this is by updating the schema of the dataframe using the dataframe I get from reading the SQL table.

from sql-spark-connector.

shivsood commented on May 9, 2024

@pmadusud looks like reading the data made it into a nullable column. How did you read that data?

from sql-spark-connector.

pmadusud commented on May 9, 2024

@shivsood It is using spark.read... something similar to the below,
jdbcDF = spark.read
.format("com.microsoft.sqlserver.jdbc.spark")
.option("url", url)
.option("query", query)
.load()

from sql-spark-connector.

mokabiru commented on May 9, 2024

@shivsood It is using spark.read... something similar to the below,
jdbcDF = spark.read
.format("com.microsoft.sqlserver.jdbc.spark")
.option("url", url)
.option("query", query)
.load()

The issue occurs even when using "dbtable" option too.

from sql-spark-connector.

shivsood commented on May 9, 2024

@pmadusud @mokabiru Read is delegated to generic JDBC implementation here. So you should have the same issues when using the default JDBC connector. Can u please check and confirm. Replace format as "jbdc" and see if that reads make it a nullable column?

from sql-spark-connector.

pmadusud commented on May 9, 2024

@shivsood Yes, both jdbc and com.microsoft.sqlserver.jdbc.spark return the columns as nullable even though the column is not nullable. However, while writing the data back to SQL, jdbc doesn't complain if there is a datatype or nullable column differences but com.microsoft.sqlserver.jdbc.spark errors out...

from sql-spark-connector.

nnani commented on May 9, 2024

Anyone got this working ?
I tried using this library for writing data, but seems this bug is not allowing so.
Any workaround ?
For now, I just commented the assertCondition for this check and trying to get around it ?

from sql-spark-connector.

ankitbko commented on May 9, 2024

@shivsood Moreover when spark writes to parquet it automatically converts all column to Nullable. This is mentioned in spark doc https://spark.apache.org/docs/2.4.5/sql-data-sources-parquet.html

When writing Parquet files, all columns are automatically converted to be nullable for compatibility reasons.

from sql-spark-connector.

ankitbko commented on May 9, 2024

Is the null checking absolutely necessary? Is it possible to have it behind a flag?

On a separate note the error message can be improved a lot. Instead of printing ${prefix} column nullable configurations at column index ${i} it would help if the column name is displayed instead of index. This will avoid user to run the cell again to capture the schema detail of dataframe and count the columns to find the root cause. This goes for few other messages I saw in that file.

cc: @rajmera3

from sql-spark-connector.

gbrueckl commented on May 9, 2024

this is the workaround I am currently using:

  print("Getting schema of target table '{}' ...".format(targetTableName))
  targetDf = spark.read \
    .format("com.microsoft.sqlserver.jdbc.spark") \
    .option("url", jdbcUrl) \
    .option("dbtable", targetTableName) \
    .option("user", sqlUsername) \
    .option("password", sqlPassword) \
    .load().filter("1 = 0")
  
  print("Getting soruce data: \n{}".format(sourceSqlQuery))
  writeDf = spark.sql(sourceSqlQuery)
   
  if targetDf.columns != writeDf.columns:
    writeDf = writeDf.select(targetDf.columns)

  if targetDf.schema != writeDf.schema:
    writeDf = spark.createDataFrame(writeDf.rdd, targetDf.schema)
  
  print("Writing results to MS SQL ...")
  writeDf.write \
    .format("com.microsoft.sqlserver.jdbc.spark") \
    .mode(writeMode) \
    .option("url", jdbcUrl) \
    .option("dbtable", targetTableName) \
    .option("user", sqlUsername) \
    .option("password", sqlPassword) \
    .save()

this fixes the issue with nullable columns and also only selects columns that actually exist in the target table

however, it would be great if the check on null-able columns could be turned off, especially as the issue exists in both directions.
So even if Spark DF column is NOT NULLABLE and SQL column is NULLABLE, the .write() will fail

UPDATE:
ok, the workaround only works for the case where the dataframe column is NOT NULL but the SQL server would allow NULL

however, then there is another bug where spark.read with this provider does not return the correct column metadata!

from sql-spark-connector.

shivsood commented on May 9, 2024

The following will need to go behind the flag for disable the null check in matchSchema.

assertCondition(dfCols(dfFieldIndex).nullable == tableCols(i).nullable,
s"${prefix} column nullable configurations at column index ${i}")

from sql-spark-connector.

shivsood commented on May 9, 2024

Added a flag that can be used to disable strict schema checks. PR #50

from sql-spark-connector.

MrWhiteABEX commented on May 9, 2024

Alternative workaround is to use a projection:

import org.apache.spark.sql.catalyst.expressions.objects.{AssertNotNull}
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{col,when}

df.select(new Column(AssertNotNull(col("NullableCol").expr).as("NonNullableCol"), // Nullable to non nullable
              when(col("NonNullableCol").isNotNull, col("NonNullableCol")).otherwise(lit(null)).as("NullableCol") // Otherway around
)

from sql-spark-connector.

Praxkv1 commented on May 9, 2024

I do get the exact same erorr as mentioned above. I am copying 50+ tables from one sql server to other sql server using pyspark. What i observed is while reading the data to df its self the nullability of the column is lost. I have a column which is not null in my source db but when i read the data to dataframe this column is marked as notnull when i do a df.printSchema() :(

I have created the Spark Dataframe using the connector. The schema of the dataframe in terms of nullable columns etc., are different from the source table and this causes the error while try to save the data into a similar sql data table.

java.sql.SQLException: Spark Dataframe and SQL Server table have differing column nullable configurations at column index 0

from sql-spark-connector.

Nullable column mismatch between Spark DataFrame & SQL Table Error about sql-spark-connector HOT 23 CLOSED

Comments (23)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent