Comments (23)
A flag "schemaCheckEnabled" is now supported. When set to false, connector with skip the strict schema checks.
from sql-spark-connector.
I absolutely agree with @sdebruyn & @utsavjha
This is not a safe solution! It would be better to have a proper fix.
from sql-spark-connector.
@shivsood Is that the official fix for this problem? That you stop checking the exception when raised?? What happens when there is incompatibility while writing?
from sql-spark-connector.
@shivsood that is not a good solution
The schema should have nullability as in the database. Just disabling the check is not a solution, it's an easy workaround for you.
from sql-spark-connector.
The workaround I've been using is to explicitly set fields to nullable/not-nullable in the schema, for example:
val schema = StructType(
StructField("uid", StringType, false) ::
StructField("date", DateType, true) ::
StructField("type", StringType, true) :: // Explicitly make nullable, due to issue #5
StructField("outcome", IntegerType, true) :: Nil)
val dfWithCorrectSchema = spark.createDataFrame(df.rdd, schema)
It is tedious though, especially for larger tables, so I'd love this bug to be fixed. Or if someone has a better workaround, I'd be willing to give it a go!
from sql-spark-connector.
Are you ever going to provide an actual fix for this?
from sql-spark-connector.
#2 - Duplicate of the same [issue] (#2)
from sql-spark-connector.
@mokabiru Did you get the issue even with the same source/target table for reading and writing?
from sql-spark-connector.
@pmadusud - I used different tables for read vs write but the data type of the column is same across both the tables. For more info, can you paste your table schema and highlight the column that's causing the error?
from sql-spark-connector.
@mokabiru The first column of the table is a not null column, however when I read the data, it automatically makes it into nullable column. When i try to save, due to the mismatch it throws the error. The way I worked around this is by updating the schema of the dataframe using the dataframe I get from reading the SQL table.
from sql-spark-connector.
@pmadusud looks like reading the data made it into a nullable column. How did you read that data?
from sql-spark-connector.
@shivsood It is using spark.read... something similar to the below,
jdbcDF = spark.read
.format("com.microsoft.sqlserver.jdbc.spark")
.option("url", url)
.option("query", query)
.load()
from sql-spark-connector.
@shivsood It is using spark.read... something similar to the below,
jdbcDF = spark.read
.format("com.microsoft.sqlserver.jdbc.spark")
.option("url", url)
.option("query", query)
.load()
The issue occurs even when using "dbtable" option too.
from sql-spark-connector.
@pmadusud @mokabiru Read is delegated to generic JDBC implementation here. So you should have the same issues when using the default JDBC connector. Can u please check and confirm. Replace format as "jbdc" and see if that reads make it a nullable column?
from sql-spark-connector.
@shivsood Yes, both jdbc and com.microsoft.sqlserver.jdbc.spark return the columns as nullable even though the column is not nullable. However, while writing the data back to SQL, jdbc doesn't complain if there is a datatype or nullable column differences but com.microsoft.sqlserver.jdbc.spark errors out...
from sql-spark-connector.
Anyone got this working ?
I tried using this library for writing data, but seems this bug is not allowing so.
Any workaround ?
For now, I just commented the assertCondition for this check and trying to get around it ?
from sql-spark-connector.
@shivsood Moreover when spark writes to parquet it automatically converts all column to Nullable. This is mentioned in spark doc https://spark.apache.org/docs/2.4.5/sql-data-sources-parquet.html
When writing Parquet files, all columns are automatically converted to be nullable for compatibility reasons.
from sql-spark-connector.
Is the null checking absolutely necessary? Is it possible to have it behind a flag?
On a separate note the error message can be improved a lot. Instead of printing ${prefix} column nullable configurations at column index ${i}
it would help if the column name is displayed instead of index. This will avoid user to run the cell again to capture the schema detail of dataframe and count the columns to find the root cause. This goes for few other messages I saw in that file.
cc: @rajmera3
from sql-spark-connector.
this is the workaround I am currently using:
print("Getting schema of target table '{}' ...".format(targetTableName))
targetDf = spark.read \
.format("com.microsoft.sqlserver.jdbc.spark") \
.option("url", jdbcUrl) \
.option("dbtable", targetTableName) \
.option("user", sqlUsername) \
.option("password", sqlPassword) \
.load().filter("1 = 0")
print("Getting soruce data: \n{}".format(sourceSqlQuery))
writeDf = spark.sql(sourceSqlQuery)
if targetDf.columns != writeDf.columns:
writeDf = writeDf.select(targetDf.columns)
if targetDf.schema != writeDf.schema:
writeDf = spark.createDataFrame(writeDf.rdd, targetDf.schema)
print("Writing results to MS SQL ...")
writeDf.write \
.format("com.microsoft.sqlserver.jdbc.spark") \
.mode(writeMode) \
.option("url", jdbcUrl) \
.option("dbtable", targetTableName) \
.option("user", sqlUsername) \
.option("password", sqlPassword) \
.save()
this fixes the issue with nullable columns and also only selects columns that actually exist in the target table
however, it would be great if the check on null-able columns could be turned off, especially as the issue exists in both directions.
So even if Spark DF column is NOT NULLABLE and SQL column is NULLABLE, the .write()
will fail
UPDATE:
ok, the workaround only works for the case where the dataframe column is NOT NULL but the SQL server would allow NULL
however, then there is another bug where spark.read
with this provider does not return the correct column metadata!
from sql-spark-connector.
The following will need to go behind the flag for disable the null check in matchSchema.
assertCondition(dfCols(dfFieldIndex).nullable == tableCols(i).nullable,
s"${prefix} column nullable configurations at column index ${i}")
from sql-spark-connector.
Added a flag that can be used to disable strict schema checks. PR #50
from sql-spark-connector.
Alternative workaround is to use a projection:
import org.apache.spark.sql.catalyst.expressions.objects.{AssertNotNull}
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{col,when}
df.select(new Column(AssertNotNull(col("NullableCol").expr).as("NonNullableCol"), // Nullable to non nullable
when(col("NonNullableCol").isNotNull, col("NonNullableCol")).otherwise(lit(null)).as("NullableCol") // Otherway around
)
from sql-spark-connector.
I do get the exact same erorr as mentioned above. I am copying 50+ tables from one sql server to other sql server using pyspark. What i observed is while reading the data to df its self the nullability of the column is lost. I have a column which is not null in my source db but when i read the data to dataframe this column is marked as notnull when i do a df.printSchema() :(
I have created the Spark Dataframe using the connector. The schema of the dataframe in terms of nullable columns etc., are different from the source table and this causes the error while try to save the data into a similar sql data table.
java.sql.SQLException: Spark Dataframe and SQL Server table have differing column nullable configurations at column index 0
from sql-spark-connector.
Related Issues (20)
- java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver HOT 1
- writing with mode "append" to an existing table only rolls back faulty rows w/o "NO_DUPLICATES"
- [Question] how to set MAXDOP
- Error while writing - com.microsoft.sqlserver.jdbc.SQLServerException: The connection Is closed. HOT 1
- Missing Maven distros HOT 6
- TCP/IP connection to the host whatever.database.windows.net, port 1433 has failed. Error: "null. Verify the connection properties. HOT 3
- Need the ability to use a linked service in a notebook to do a connection to a database
- Don't run the AFTER insert trigger
- Possible Wheel (.whl) distribution? HOT 1
- GA versions HOT 2
- TIMESTAMP to datetime2 fails on Azure Synapse HOT 1
- Connection closed when try to use om.microsoft.sqlserver.jdbc.spark connector HOT 4
- Version 8.4.1 of Microsoft JDBC Driver For SQL Server
- [Request] Support Spark 3.5
- Does bulk upsert data import support?
- Support for Spark 3.4.x HOT 1
- Assessing the risk of duplicated entries for BEST_EFFORT reliabilityLevel HOT 1
- sql-spark-connector issue where the access token expires post 60 mins
- Intermittent Authentication Failure using ActiveDirectoryPassword
- The project is dead or archived? HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sql-spark-connector.