Tutorials and sample code for integrating CDM folders with Azure Data Services

License: MIT License

Python 22.30% C# 38.60% Jupyter Notebook 8.12% PowerShell 1.97% Shell 1.87% Ruby 1.97% TSQL 25.18%

cdm-azure-data-services-integration's Introduction

page_type

languages

products

description

sample

csharp

python

ruby

powershell

azure

Tutorial and sample code for integrating Power BI dataflows and Azure Data Services using Common Data Model folders in Azure Data Lake.

OBSOLETE

The technology used in this tutorial is now obsolete.

For information on using Azure Data Factory mapping data flows to read and write CDM entity data, see this blog post, which describes the overall solution, with links to an article describing how CDM support uses inline datasets, and an article providing details of the source and sink properties.

For information on the new Spark CDM Connector for use in Azure Databricks and Synapse to read and write CDM entity data, see https://github.com/Azure/spark-cdm-connector

OBSOLETE

CDM folders and Azure Data Services integration

Tutorial and sample code for integrating Power BI dataflows and Azure Data Services using Common Data Model (CDM) folders in Azure Data Lake Storage Gen2. For more information on the scenario, see this blog post.

Features

The tutorial walks through use of CDM folders in a modern data warehouse scenario. In it you will:

Configure your Power BI account to save Power BI dataflows as CDM folders in ADLS Gen2;
Create a Power BI dataflow by ingesting order data from the Wide World Importers sample database and save it as a CDM folder;
Use an Azure Databricks notebook that prepares and cleanses the data in the CDM folder, and then writes the updated data to a new CDM folder in ADLS Gen2;
Use Azure Machine Learning to train and publish a model using data from the CDM folder;
Use an Azure Data Factory pipeline to load data from the CDM folder into staging tables in Azure SQL Data Warehouse and then invoke stored procedures that transform the data into a dimensional model;
Use Azure Data Factory to orchestrate the overall process and monitor execution.

Each step leverages metadata contained in the CDM folder to make it easier and simpler to accomplish the task.

The provided samples include code, libraries, and Azure resource templates that you can use with CDM folders you create from your own data.

IMPORTANT: the sample code is provided as-is with no warranties and is intended for learning purposes only.

Getting Started

See the tutorial for pre-requisites and installation details.

License

The sample code and tutorials in this project are licensed under the MIT license. See the LICENSE file for more details.

Contributing

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

cdm-azure-data-services-integration's People

Contributors

Stargazers

Watchers

cdm-azure-data-services-integration's Issues

java.lang.ClassNotFoundException: Failed to find data source: com.microsoft.cdm

Please provide us with the following information:

This issue is for a: (mark with an `x`)

- [x ] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Setup the cluster using the "5.2 (includes Apache Spark 2.4.0, Scala 2.11)" runtime version
Imported the latest version of the library
tried to run the following command in my notebook:
partiesDf = (spark.read.format("com.microsoft.cdm")
.option("cdmModel", inputLocation)
.option("entity", "IN_PARTIES")
.option("appId", appID)
.option("appKey", appKey)
.option("tenantId", tenantID)
.load())

Any log messages given by the failure

java.lang.ClassNotFoundException: Failed to find data source: com.microsoft.cdm. Please find packages at http://spark.apache.org/third-party-projects.html

Expected/desired behavior

Expect the read to happen

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)
Windows 10

Versions

N/A

Mention any other details that might be useful

N/A

Thanks! We'll be in touch soon.

Databricks/Datalake - Writing simple dataset to CDM gives: Write job aborted

Please provide us with the following information:

This issue is for a: (mark with an `x`)

- [ ] bug report -> please search issues before submitting
- [ ] feature request
- [x] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

run a simple select from any table
write the result to a data-lake empty folder

Code sample:
`val consolidatedSessions = spark.sql("SELECT * FROM sessions limit 100")

(consolidatedSessions.write.format("com.microsoft.cdm")
.option("entity", "Sessions")
.option("appId", clientId)
.option("appKey", secret)
.option("tenantId", tenantId)
.option("cdmFolder", cdmDataLakeFolder)
.option("cdmModelName", cdmModelName)
.save())`

After this exception, the job has created a snapshot file only without writing the model.json

Any log messages given by the failure

org.apache.spark.SparkException: Writing job aborted.
at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2Exec.scala:92)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:146)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:134)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$5.apply(SparkPlan.scala:187)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:183)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:134)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:114)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:114)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:710)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:710)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1.apply(SQLExecution.scala:111)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:240)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:97)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:170)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:710)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:281)
at lineb7a43bd5cad8496aa5f3fefa36dd780933.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-2525265643147671:10)
at lineb7a43bd5cad8496aa5f3fefa36dd780933.$read$$iw$$iw$$iw$$iw$$iw.(command-2525265643147671:62)
at lineb7a43bd5cad8496aa5f3fefa36dd780933.$read$$iw$$iw$$iw$$iw.(command-2525265643147671:64)
at lineb7a43bd5cad8496aa5f3fefa36dd780933.$read$$iw$$iw$$iw.(command-2525265643147671:66)

Expected/desired behavior

All the data saved properly in the data-lake

OS and Version

5.5 LTS (includes Apache Spark 2.4.3, Scala 2.11)

error"java.lang.ClassCastException: java.sql.Date cannot be cast to java.lang.Integer" in read-write-demo-wide-world-importers.py

Please provide us with the following information:

This issue is for a: (mark with an `x`)

- [x ] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

In the section Write all entities to output CDM folder
it went wrong on
(salesOrderDf.write.format("com.microsoft.cdm")
.option("entity", "Sales Orders")
.option("appId", appID)
.option("appKey", appKey)
.option("tenantId", tenantID)
.option("cdmFolder", outputLocation)
.option("cdmModelName", cdmModelName)
.save())

Any log messages given by the failure

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<command-2033808373542819> in <module>()
      5                    .option("tenantId", tenantID)
      6                    .option("cdmFolder", outputLocation)
----> 7                    .option("cdmModelName", cdmModelName)
      8                    .save())

/databricks/spark/python/pyspark/sql/readwriter.py in save(self, path, format, mode, partitionBy, **options)
    734             self.format(format)
    735         if path is None:
--> 736             self._jwrite.save()
    737         else:
    738             self._jwrite.save(path)

/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling o1985.save.
: org.apache.spark.SparkException: Writing job aborted.
	at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2Exec.scala:92)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:146)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:134)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$5.apply(SparkPlan.scala:187)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:183)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:134)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:114)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:114)
	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:710)
	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:710)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1.apply(SQLExecution.scala:111)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:240)
	at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:97)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:170)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:710)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:281)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
	at py4j.Gateway.invoke(Gateway.java:295)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:251)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 21, 10.139.64.5, executor 2): java.lang.ClassCastException: java.sql.Date cannot be cast to java.lang.Integer
	at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
	at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getInt(rows.scala:41)
	at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getInt(rows.scala:195)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:640)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:117)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:116)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1560)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:146)
	at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:67)
	at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:66)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139)
	at org.apache.spark.scheduler.Task.run(Task.scala:112)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Expected/desired behavior

OS and Version?

Windows 10.

Versions

databricks sample problem

This issue is for a: (mark with an `x`)

- [x] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

in Azure databricks, after reading a CMD into a df, 'display(df)' never finishes.

Getting a failure in the notebook - "org.apache.http.client.HttpResponseException: The specified path does not exist."

Could be either, or a user error

bug report -> please search issues before submitting
feature request
documentation issue or request
regression (a behavior that used to work and stopped in a new release)


### Minimal steps to reproduce
Configured the secrets per the tutorial.  Triple checked that I have the tenantID correct from AAD.
Run the whole notebook.  Get the error.

I also attempted to hard code the tenantID and got the same error

### Any log messages given by the failure
salesOrderDf = (spark.read.format("com.microsoft.cdm")
                          .option("cdmModel", inputLocation)
                          .option("entity", "Sales Orders")
                          .option("appId", appID)
                          .option("appKey", appKey)

ERROR highlights the line below
                          .option("tenantId", tenantID)

                          .load())

org.apache.http.client.HttpResponseException: The specified path does not exist.
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<command-1583637544527603> in <module>()
      4                           .option("appId", appID)
      5                           .option("appKey", appKey)
----> 6                           .option("tenantId", tenantID)
      7                           .load())

/databricks/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
    170             return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
    171         else:
--> 172             return self._df(self._jreader.load())
    173 
    174     @since(1.4)

/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

### Expected/desired behavior
I expect the Notebook to execute correctly.

### OS and Version?
MacOS - but everything is done through browser and Azure

### Versions
Not sure what is sought after here.  I configured the cluster using the article versions.

### Mention any other details that might be useful
None that I can think of.

Jupyter: CdmModel from JSON ValueError: 'date' is not a valid DataType

Please provide us with the following information:

This issue is for a: (mark with an `x`)

- [X] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Run cdm-customer-classification-demo ipython notebook file in Jupyter

Get to the step where you initialize the model
import CdmModel
import json
model_endpoint = "https://cdmstoragepowerbi.dfs.core.windows.net/powerbi/WideWorldImporters/WideWorldImporters-Sales/model.json"

aad_token = generate_aad_token()
Error is here --->model_json = read_from_adls(endpoint = model_endpoint, auth = aad_token).json()

model = CdmModel.Model.fromJson(model_json)

Any log messages given by the failure

alueError Traceback (most recent call last)
ValueError: 'date' is not a valid DataType

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last)
in
3 model_json = read_from_adls(endpoint = model_endpoint, auth = aad_token).json()
4
----> 5 model = CdmModel.Model.fromJson(model_json)

~\OneDrive\Desktop\cdm-azure-data-services-integration-master\cdm-azure-data-services-integration-master\AzureMachineLearning\CdmModel.py in fromJson(cls, value)
370 elif not isinstance(value, dict):
371 value = json.load(value)
--> 372 return super(Model, cls).fromJson(value)
373
374 def toJson(self):

~\OneDrive\Desktop\cdm-azure-data-services-integration-master\cdm-azure-data-services-integration-master\AzureMachineLearning\CdmModel.py in fromJson(cls, value)
61 element = value.pop(entry.name, result)
62 if element != result:
---> 63 setattr(result, entry.name, cls.__getCtor(entry.cls)(element))
64 result.customProperties = value
65 return result

~\OneDrive\Desktop\cdm-azure-data-services-integration-master\cdm-azure-data-services-integration-master\AzureMachineLearning\CdmModel.py in fromJson(cls, value)
105 ctor = cls.itemType.fromJson
106 for item in value:
--> 107 super(ObjectCollection, result).append(ctor(item))
108 return result
109

~\OneDrive\Desktop\cdm-azure-data-services-integration-master\cdm-azure-data-services-integration-master\AzureMachineLearning\CdmModel.py in fromJson(cls, value)
45 def fromJson(cls, value):
46 actualClass = PolymorphicMeta.classes[cls][value["$type"]]
---> 47 return super(Polymorphic, actualClass).fromJson(value)
48
49 class Base(object):

~\OneDrive\Desktop\cdm-azure-data-services-integration-master\cdm-azure-data-services-integration-master\AzureMachineLearning\CdmModel.py in fromJson(cls, value)
61 element = value.pop(entry.name, result)
62 if element != result:
---> 63 setattr(result, entry.name, cls.__getCtor(entry.cls)(element))
64 result.customProperties = value
65 return result

~\OneDrive\Desktop\cdm-azure-data-services-integration-master\cdm-azure-data-services-integration-master\AzureMachineLearning\CdmModel.py in fromJson(cls, value)
105 ctor = cls.itemType.fromJson
106 for item in value:
--> 107 super(ObjectCollection, result).append(ctor(item))
108 return result
109

~\OneDrive\Desktop\cdm-azure-data-services-integration-master\cdm-azure-data-services-integration-master\AzureMachineLearning\CdmModel.py in fromJson(cls, value)
61 element = value.pop(entry.name, result)
62 if element != result:
---> 63 setattr(result, entry.name, cls.__getCtor(entry.cls)(element))
64 result.customProperties = value
65 return result

c:\users\mattb\appdata\local\programs\python\python37\lib\enum.py in call(cls, value, names, module, qualname, type, start)
308 """
309 if names is None: # simple value lookup
--> 310 return cls.new(cls, value)
311 # otherwise, functional API: we're creating a new Enum type
312 return cls.create(value, names, module=module, qualname=qualname, type=type, start=start)

c:\users\mattb\appdata\local\programs\python\python37\lib\enum.py in new(cls, value)
562 )
563 exc.context = ve_exc
--> 564 raise exc
565
566 def generate_next_value(name, start, count, last_values):

c:\users\mattb\appdata\local\programs\python\python37\lib\enum.py in new(cls, value)
546 try:
547 exc = None
--> 548 result = cls.missing(value)
549 except Exception as e:
550 exc = e

c:\users\mattb\appdata\local\programs\python\python37\lib\enum.py in missing(cls, value)
575 @classmethod
576 def missing(cls, value):
--> 577 raise ValueError("%r is not a valid %s" % (value, cls.name))
578
579 def repr(self):

ValueError: 'date' is not a valid DataType

Expected/desired behavior

This should load the Model.json file without issue

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)

Windows 10

Versions

Python 3.6

Mention any other details that might be useful

Thanks! We'll be in touch soon.

Saving new files in CDM via Databricks, save seems to overwrite the old files

Please provide us with the following information:

This issue is for a: (mark with an `x`)

- [ ] bug report -> please search issues before submitting
- [ ] feature request
- [x ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

saving an arbitrary spark df to a CDM folder:
'''(spark_df.write.format("com.microsoft.cdm")
.option("entity", CDMentity )
.option("appId", appID)
.option("appKey", appKey)
.option("tenantId", tenantID)
.option("cdmFolder", outputLocation)
.option("cdmModelName", CDMmodelName)
.save())'''

Any log messages given by the failure

the csv file in snapshot folder has been overwritten

Expected/desired behavior

Tried passing file name in save but without success.```

Reading from Dynamics 365 using ADF using the API

Hi Guys,
It looks like my issue before no 272 got deleted, so here it is again.
Can we create a linked service to Dynamics like we can for Resources, databases etc?
This is how its done to talk to blob store.

Create Azure Data Factory using .NET SDK - Azure Data Factory | Microsoft Docs

Console.WriteLine("Creating linked service " + storageLinkedServiceName + "...");

LinkedServiceResource storageLinkedService = new LinkedServiceResource(
new AzureStorageLinkedService
{
ConnectionString = new SecureString(
"DefaultEndpointsProtocol=https;AccountName=" + storageAccount +
";AccountKey=" + storageKey)
}
);
client.LinkedServices.CreateOrUpdate(
resourceGroup, dataFactoryName, storageLinkedServiceName, storageLinkedService);
Console.WriteLine(SafeJsonConvert.SerializeObject(
storageLinkedService, client.SerializationSettings));

From here
https://www.c-sharpcorner.com/article/create-azure-data-factory-and-pipeline-using-net-sdk/

I would like to build a generic solution that would read all the parent child relations in a set of SQL tables and populate them in Dynamics.
Also can you do a sample reading from Dynamics 365.

Best Regards,
Alistair

Jupyter model_json error

Please provide us with the following information:

This issue is for a: (mark with an `x`)

- [X] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Tried to follow the steps with Jupyter Notebook walkthrough. Got to where i had to read in to model endpoint. In my case this was

model_endpoint = "https://cdmstoragepowerbi.blob.core.windows.net/powerbi/WideWorldImporters%2FWideWorldImporters-Sales%2Fmodel.json"

When I then tried to initialize the model_json variable in this line of code

aad_token = generate_aad_token()
--->model_json = read_from_adls(endpoint = model_endpoint, auth = aad_token).json()
model = CdmModel.Model.fromJson(model_json)

it failed and gave the following error. Is it because i gave it the wrong filepath endpoint?

Any log messages given by the failure

Expected/desired behavior

JSONDecodeError Traceback (most recent call last)
in
1 #model_endpoint = "https<>WWI-Sales/model.json"
2 aad_token = generate_aad_token()
----> 3 model_json = read_from_adls(endpoint = model_endpoint, auth = aad_token).json()
4 model = CdmModel.Model.fromJson(model_js

on)

c:\users\mattb\appdata\local\programs\python\python37\lib\site-packages\requests\models.py in json(self, **kwargs)
888 try:
889 return complexjson.loads(
--> 890 self.content.decode(encoding), **kwargs
891 )
892 except UnicodeDecodeError:

c:\users\mattb\appdata\local\programs\python\python37\lib\json_init_.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
346 parse_int is None and parse_float is None and
347 parse_constant is None and object_pairs_hook is None and not kw):
--> 348 return _default_decoder.decode(s)
349 if cls is None:
350 cls = JSONDecoder

c:\users\mattb\appdata\local\programs\python\python37\lib\json\decoder.py in decode(self, s, _w)
335
336 """
--> 337 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
338 end = _w(s, end).end()
339 if end != len(s):

c:\users\mattb\appdata\local\programs\python\python37\lib\json\decoder.py in raw_decode(self, s, idx)
353 obj, end = self.scan_once(s, idx)
354 except StopIteration as err:
--> 355 raise JSONDecodeError("Expecting value", s, err.value) from None
356 return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)

Windows 10

Versions

Python Kernal used was 3.6

Mention any other details that might be useful

Thanks! We'll be in touch soon.

java.util.NoSuchElementException: No value found for 'date'

Please provide us with the following information:

This issue is for a: (mark with an `x`)

- [ x] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Run read-write-demo-wide-world-importers.py

Try to create the data-frame for

"Sales Orders" and/or "Sales Customers"

Any log messages given by the failure

Py4JJavaError Traceback (most recent call last)
in
4 .option("appId", appID)
5 .option("appKey", appKey)
----> 6 .option("tenantId", tenantID)
7 .load())

/databricks/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
170 return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
171 else:
--> 172 return self._df(self._jreader.load())
173
174 @SInCE(1.4)

/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in call(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()

/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(

Py4JJavaError: An error occurred while calling o285.load.
: java.util.NoSuchElementException: No value found for 'date'
at scala.Enumeration.withName(Enumeration.scala:124)
at com.microsoft.cdm.utils.CDMModel$$anonfun$schema$1.apply(CDMModel.scala:34)
at com.microsoft.cdm.utils.CDMModel$$anonfun$schema$1.apply(CDMModel.scala:30)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1334)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1334)
at com.microsoft.cdm.utils.CDMModel.schema(CDMModel.scala:35)
at com.microsoft.cdm.read.CDMDataSourceReader.readSchema(CDMDataSourceReader.scala:28)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation$.create(DataSourceV2Relation.scala:175)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:290)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:203)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)

Expected/desired behavior

The data get parsed correctly and the notebook runs without errors

OS and Version?

Windows 10

Versions

DBR 5.5 Conda Beta Spark 2.4.3, Scala 2.11

Mention any other details that might be useful

I tried changing the date time columns in the entity data sets to just date because i thought it might be a translation issue from the cdm model, but refreshing after doing that didn't change anything. I'm at a complete loss as to why this is happening, and why it is only affecting those 2 dataframes.

Every other dataframe in the query runs perfectly. It's literally just those two entities
"Sales Orders" and/or "Sales Customers"

does anybody know what the fix is?

Thanks! We'll be in touch soon.

java.lang.String cannot be cast to org.apache.spark.unsafe.types.UTF8String

Please provide us with the following information:

This issue is for a: (mark with an `x`)

- [x] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

I make a cross post here, from the repository https://github.com/Azure/spark-cdm

See issue Azure/spark-cdm#3

The issue is related to the new 2.4 spark-cdm branch, allowing to use spark 2.4
It seems we still have an issue when calling function like display(foo):

java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.unsafe.types.UTF8String
	at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)

Unable to parse the date

This issue is for a: (mark with an `x`)

- [x ] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Run read-write-demo-wide-world-importers.py

Any log messages given by the failure

Every step starting with display(salesBuyingGroupsDf) fails with date parsing error.

java.text.ParseException: Unable to parse the date: 01/01/2013 00:00:00
at org.apache.commons.lang.time.DateUtils.parseDateWithLeniency(DateUtils.java:359)
at org.apache.commons.lang.time.DateUtils.parseDate(DateUtils.java:285)
at com.microsoft.cdm.utils.DataConverter$$anonfun$6.apply(DataConverter.scala:43)
at com.microsoft.cdm.utils.DataConverter$$anonfun$6.apply(DataConverter.scala:43)
at com.microsoft.cdm.read.CDMDataReader$$anonfun$1.apply(CDMDataReader.scala:54)
at com.microsoft.cdm.read.CDMDataReader$$anonfun$1.apply(CDMDataReader.scala:48)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at com.microsoft.cdm.read.CDMDataReader.get(CDMDataReader.scala:48)
at com.microsoft.cdm.read.CDMDataReader.get(CDMDataReader.scala:19)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.next(DataSourceRDD.scala:59)
at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:640)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:62)
at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:159)
at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:158)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139)
at org.apache.spark.scheduler.Task.run(Task.scala:112)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Expected/desired behavior

The data get parsed correctly and the notebook runs without errors

OS and Version?

Windows 10

Versions

Spark 2.4.3, Scala 2.11

Mention any other details that might be useful

The type of the columns causing the problem is DateTime in the CDM schema. When reading data within Databricks, the columns are assigned Data type (without the time part) even though Timestamp would be more appropriate.

I tried setting the schema on read myself (not allowed because of CDM).
I tried setting DateFormat and TimestampFormat settings on read (no effect).
I also tried converting the columns to Timestamp or String. Basically anything I try to do with the dataframe results in the same error.

Links in Tutorial PDF not working

Please provide us with the following information:

This issue is for a: (mark with an `x`)

- [ ] bug report -> please search issues before submitting
- [ ] feature request
- [x ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Links in Tutorial PDF not working

Minimal steps to reproduce

Links in Tutorial PDF not working

Any log messages given by the failure

Expected/desired behavior

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)

Versions

Mention any other details that might be useful

Thanks! We'll be in touch soon.

Python Integration does not support manifest file format

Please provide us with the following information:

This issue is for a: (mark with an `x`)

- [ ] bug report -> please search issues before submitting
- [x ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Cannot load manifest.cdm.json file types

Any log messages given by the failure

Expected/desired behavior

Use both model.json and manifest.json schema definitions

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)

Versions

Mention any other details that might be useful

Thanks! We'll be in touch soon.

Azure Explorer and access management

Please provide us with the following information:

This issue is for a: (mark with an `x`)

- [ ] bug report -> please search issues before submitting
- [ ] feature request
- [x ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

While creating permission for Power BI services, storage explorer is not able to recognize the object Ids.

Any log messages given by the failure

Expected/desired behavior

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)

Versions

Mention any other details that might be useful

Thanks! We'll be in touch soon.

cdm-customer-classification-demo doesn't work

Please provide us with the following information:

This issue is for a: (mark with an `x`)

- [ x] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

run the cdm-customer-classification-demo.ipyyb notebook

also there are incorrect references in the notebook in regards to
you have
sales_invoice_line_df = read_from_adls_with_cdm_format(model.entities["Sales InvoiceLines"], "cdm")
sales_invoice_df = read_from_adls_with_cdm_format(model.entities["Sales Invoices"], "cdm")
when it should read
sales_invoice_line_df = read_from_adls_with_cdm_format(model.entities["Sales OrderLines"], "cdm")
sales_invoice_df = read_from_adls_with_cdm_format(model.entities["Sales Orders"], "cdm")

Any log messages given by the failure

TypeError Traceback (most recent call last)
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

TypeError: Cannot cast array from dtype('O') to dtype('int64') according to the rule 'safe'

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last)
in
9 #sales_invoice_line_df = read_from_adls_with_cdm_format(model.entities["Sales OrderLines"], "cdm")
10 #sales_invoice_line_df
---> 11 sales_invoice_df = read_from_adls_with_cdm_format(model.entities["Sales Orders"], "cdm")
12 #sales_invoice_df
13

Expected/desired behavior

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)

Versions

Mention any other details that might be useful

Thanks! We'll be in touch soon.

新软件

The display(newSalesBuyingGroupsDf) method on databricks notebook runs forever

The display(newSalesBuyingGroupsDf) method on the Databricks notebook runs forever.

The previous salesBuyingGroupsDf = (spark.read.format("com.microsoft.cdm")
.option("cdmModel", inputLocation)
.option("entity", "Sales BuyingGroups")
.option("appId", appID)
.option("appKey", appKey)
.option("tenantId", tenantID)
.load())
does return the metadata from model.json file.
BuyingGroupID:long
BuyingGroupName:string
LastEditedBy:long
ValidFrom:date
ValidTo:date

But the display for data frame doe not work. I have also verified that the data lake folder has the files.

azure-samples / cdm-azure-data-services-integration Goto Github PK

cdm-azure-data-services-integration's Introduction

OBSOLETE

OBSOLETE

CDM folders and Azure Data Services integration

Features

Getting Started

License

Contributing

cdm-azure-data-services-integration's People

Contributors

Stargazers

Watchers

Forkers

cdm-azure-data-services-integration's Issues

Please provide us with the following information:

This issue is for a: (mark with an x)

Minimal steps to reproduce

Any log messages given by the failure

Expected/desired behavior

OS and Version?

Versions

Mention any other details that might be useful

Please provide us with the following information:

This issue is for a: (mark with an x)

Minimal steps to reproduce

Any log messages given by the failure

Expected/desired behavior

OS and Version

Please provide us with the following information:

This issue is for a: (mark with an x)

Minimal steps to reproduce

Any log messages given by the failure

Expected/desired behavior

OS and Version?

Versions

This issue is for a: (mark with an x)

Minimal steps to reproduce

Please provide us with the following information:

This issue is for a: (mark with an x)

Minimal steps to reproduce

Any log messages given by the failure

Expected/desired behavior

OS and Version?

Versions

Mention any other details that might be useful

Please provide us with the following information:

This issue is for a: (mark with an x)

Minimal steps to reproduce

Any log messages given by the failure

Expected/desired behavior

Please provide us with the following information:

This issue is for a: (mark with an x)

Minimal steps to reproduce

Any log messages given by the failure

Expected/desired behavior

OS and Version?

Versions

Mention any other details that might be useful

Please provide us with the following information:

This issue is for a: (mark with an x)

Minimal steps to reproduce

Any log messages given by the failure

Expected/desired behavior

OS and Version?

Versions

Mention any other details that might be useful

Please provide us with the following information:

This issue is for a: (mark with an x)

This issue is for a: (mark with an x)

Minimal steps to reproduce

Any log messages given by the failure

Expected/desired behavior

OS and Version?

Versions

Mention any other details that might be useful

Please provide us with the following information:

This issue is for a: (mark with an x)

Minimal steps to reproduce

Any log messages given by the failure

This issue is for a: (mark with an `x`)

This issue is for a: (mark with an `x`)

This issue is for a: (mark with an `x`)

This issue is for a: (mark with an `x`)

This issue is for a: (mark with an `x`)

This issue is for a: (mark with an `x`)

This issue is for a: (mark with an `x`)

This issue is for a: (mark with an `x`)

This issue is for a: (mark with an `x`)

This issue is for a: (mark with an `x`)

This issue is for a: (mark with an `x`)

This issue is for a: (mark with an `x`)

This issue is for a: (mark with an `x`)

This issue is for a: (mark with an `x`)