Comments (17)
Hello @SteffenMangold , It is something we are looking into at the moment. This is a little surprising it behaves the way it does. At the moment, we are looking at any differences in effective POM that we doubt may be the reason for it. Will try and see if we can get a fix out in the next version of the connector
from azure-kusto-spark.
Hello @makism,
Right! adding to your point
We might have a fair guess at why this is happening. When the lib is imported from maven, the dependent jars are exploded on the spark classpath (Spark UI -> Classpath libs), where as when we upload the jar file it is simply 1 user jar that is uploaded!.
Now, from what we have sort of tried to narrow is a Classloader issue that could be causing some conflicting jars to get loaded in the wrong fashion (unfortunately, there are quite a bit of such jars to find and debug as to what causes it, we're at it.) The prime suspects we have are the ones for azure-storage or Jackson libs that may be causing it
The strategy we are trying to do is to shade such conflicting jars and also trying to add some tests by publishing it to some staging maven feeds (E2E tests pass when we upload fat jars and test!)
Hopefully we can get this out to maven mid next week or so!
from azure-kusto-spark.
Hi @makism , @letmaik , @lgo-solytic @SteffenMangold
This is fixed in the latest revision. Please give it a try and let us know
from azure-kusto-spark.
Hello @letmaik
Thanks for reporting the issue. That is strange as the same jars make it to maven. We'll have a look at it.
Quick question in the meanwhile, would you be able to share the executor log4j-active, stdout,stderr from spark UI.
If you could, that would be a great start for us to troubleshoot
cc: @asaharn
from azure-kusto-spark.
OMG I had that problem too 😵💫
With kusto-spark_3.0_2.12-5.0.2-jar-with-dependencies.jar
it indeed worked instantly.
from azure-kusto-spark.
hey there, I also stumbled upon the exactly the same issue, and the proposed solution works.
unfortunately I don't have access to the logs anymore except the following warning was contentiously logged:
HangingTaskDetector: Task XXXX is probably not making progress because its metrics
the other thing that noticed using the jar from gh, is that in the logs there are responses (probably from ADX), that the blobs are getting sealed; i think it never showed in the old runs.
hopefully these are relevant :-)
from azure-kusto-spark.
@makism, thanks for the hint. At the moment, we're just slowly wading through with some traces. There are no differences between the jars except some license files and entries in the manifest. These usually should not cause an issue. Trying to trace/profile further to see what is causing the unresponsiveness from maven to occur. Sure sounds weird, will keep posted on what the progress is.
from azure-kusto-spark.
@ag-ramachandran, could it be that the issue actually stems from another package, like kusto-data
or kusto-ingest
(also installed on Azure DBW) 🤔
from azure-kusto-spark.
@ag-ramachandran Can we expect that fix to come up anytime soon? Working around that makes our deployment pipeline more complicated.
from azure-kusto-spark.
@lgo-solytic, Right now to fix this this requires large refactors on the code.
Reason: There are many libraries in Databricks (https://learn.microsoft.com/en-us/azure/databricks/release-notes/runtime/13.3lts#installed-java-and-scala-libraries-scala-212-cluster-version) that come in bundled. Similar is the case for storage libraries that come with spark (Storage versions, HDFS libs etc.).
Resolving these conflicts will take time, the target is to get it out by November timeframe with an update. The additional challenge with databricks is that custom maven feeds are currently not permitted (atleast directly), hence tests in that area are challenging for lib/connector devs as well.
Right now we're trying our best to get around these. Will keep the issue updated as soon as we get a fix on it
from azure-kusto-spark.
@ag-ramachandran Thanks! Glad that you found a way to fix the issue. Can you let me know if the Maven package is being tested in CI?
from azure-kusto-spark.
@letmaik , was published to JITPACK , imported to databricks as a custom jar from the repo and tested.
Ref : https://jitpack.io/#Azure/azure-kusto-spark
Post publish to OSS Sonatype was tested from staging as well. Coverage and code tests run from CI today
Long answer & some more details
The latest tests were done from this commit hash : 2165bdc
Build log of this is here https://jitpack.io/com/github/Azure/azure-kusto-spark/2165bdcce3/build.log
This results in the following artefacts :
✅ Build artifacts:
com.github.Azure.azure-kusto-spark:kusto-spark_3.0_2.12:2165bdcce3
com.github.Azure.azure-kusto-spark:connector-samples:2165bdcce3
com.github.Azure.azure-kusto-spark:azure-kusto-spark:2165bdcce --> Is imported to Databricks and tested. This is the only way to test in an E2E fashion
from azure-kusto-spark.
Is imported to Databricks and tested. This is the only way to test in an E2E fashion
I have been running pyspark in local mode, which is easily testable in GitHub Actions, for example:
spark = (
SparkSession.builder.appName("test")
.config("spark.jars.packages", "org.apache.hadoop:hadoop-azure:3.3.6")
.config("spark.jars", "kusto-spark_3.0_2.12-5.0.2-jar-with-dependencies.jar") # temporary workaround
.getOrCreate()
)
So, in the above, I would consume azure-kusto-spark
instead from Maven Central using "spark.jars.packages"
going forward. I will test it soon, but I just want to make sure before I switch back to Maven Central that each release has automated tests that would catch these problems. The CI in this repo seems to be only for building the package, but not testing it.
from azure-kusto-spark.
@letmaik , Is a good idea. The challenge is/are transient dependencies that are brought in and implicit spark sessions that are created on some platforms.
Just to be making sure they work from maven co-ordinates and not from local uber jars, the best approach we found was to use a local sonatype nexus/maven feed on ADO which we could test privately (using notebooks we have for scenarios on various options / reads and writes)
The code itself has CI as a pre-req. This runs spark in local master modes, the additional tests we did was specifically for the reported issue with bundled jars vs transient downloaded deps.
and uses the mvn verify phase, that runs all the tests
from azure-kusto-spark.
@letmaik , were you able to try this ? Did it work ?
from azure-kusto-spark.
@ag-ramachandran Yes, it all works fine. Thanks for fixing this! :)
from azure-kusto-spark.
Thanks @letmaik , closing the issue now
from azure-kusto-spark.
Related Issues (20)
- Ingest without spark temp tables HOT 2
- Azure CLI authentication support HOT 3
- no option to pass in the appId/appKey with the API call for authentication in Synapse HOT 2
- Support for Scala 2.13 HOT 2
- Write to Kusto in Synapse with option "sparkIngestionPropertiesJson" always failed in spark 3.3 HOT 2
- Cannot write to ADX from Azure Databricks using Kusto connector for pyspark "com.microsoft.kusto.spark.datasource" HOT 6
- ThrottleExceptions when writing data to ADX/Kusto HOT 12
- Stuck at connecting to Kusto HOT 1
- Ingestion fails for tables with "-" in the names HOT 1
- KUSTO_MANAGED_IDENTITY_AUTH is not a member of com.microsoft.kusto.spark.common.KustoOptions and com.microsoft.kusto.spark.datasink.KustoSinkOptions HOT 7
- Importing the spark connector enables verbose logging HOT 6
- com.microsoft.azure.kusto.data.auth.CloudDependentTokenProviderBase.initializeWithCloudInfo throws Null Pointer Exception HOT 2
- Overwrite data option not working HOT 1
- Spark write to Synapse error: java.lang.NoClassDefFoundError: com/twitter/util/TimeoutException HOT 17
- Unable to Authenticate Using Managed Identity HOT 3
- ExtendedKustoClient: Some extents were not processed and we got an empty move result'1' Please open issue if you see this trace. At: https://github.com/Azure/azure-kusto-spark/issues HOT 1
- DeviceAuthentication does not exist in the JVM on Databricks Runtime 14.3 LTS HOT 2
- Data from subsequent batches are skipped after an BlobAlreadyReceived_BlobAlreadyFoundInBatch error HOT 4
- Dependency issues after update to maven kusto-spark_3.0_2.12:5.0.7 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from azure-kusto-spark.