Git Product home page Git Product logo

azure-kusto-microhack's Introduction

Azure Data Explorer Microhack (Preview)

Kusto Product Group and Microsoft Global Black Belt team are pleased to present this challenge based, collaboration driven, discover-by-doing learning experience to you. Microhacks are divided into three parts to cater enough time for the participants to understand the key concepts of Azure Data Explorer effectively.

Scenario

Contoso is a supply chain logistics company that runs a fleet of ships, trucks, and cargo planes to transport and deliver goods around the world. Some of the world’s largest enterprises rely on Contoso’s logistics capabilities to deliver goods to their end customers. Contoso has invested in connecting its fleet with sensors that measure temperature, pressure, humidity, tilt, shock, and light exposure inside its fleet. These sensors emit telemetry data every 1 minute, property data whenever there is a change in the device property, and command data whenever a new command is executed.

Contoso is looking for suitable data storage and analytical solution that provides out of the box integration with Azure IoT services such as IoT Hub, Event Hubs as well as can read data from storage accounts. Contoso is developing a SaaS application that will allow its customers to track, trace and monitor their shipments. Contoso wants to offer out of the box visualizations with interactive capabilities to enable its customers to drill-in/drill-out of the data. Contoso will offer its customers to view and analyze the last 6 months data. Contoso will retain every customer’s data for up to 1 year. Contoso wants to offer blazing fast loading of visualizations to its customers. This MicroHack walks through the steps in designing, creating, and configuring Azure Data Explorer clusters keeping in mind these requirements. Once the cluster is deployed, this MicroHack enlists the steps to ingest data into ADX databases and tables using various integration methods such as One Click ingestion.

Pre-requisites

  • An Azure subscription
  • (Not applicable for proctor led events) Deploy IoT Central application, create simulated devices and create Data Exports to Event Hubs and Storage Accounts (use this guide to create this infrastructure). For proctor led events, this infrastructure has been pre-created for you. Proctor will provide connection strings, or SAS tokens at an appropriate stage of the hack.
  • Authorization to create an Azure Data Explorer cluster or Synapse Data Explorer Pool

Overview - The microhack architecture

The following architecture has been deployed for you, except the ADX cluster and its integration with other Azure services. IoT Central acts as the source of telemetry generated by Contoso’s sensors installed on its fleet of trucks, vessels, and airplanes. Telemetry data is streamed on a continuous basis to the Event Hub. Device logs, device property changes and commands executed on the devices are stored in a Storage Account as blobs.

Screen capture 1

Deployment Instructions

You can deploy the aforementioned architecture using the steps mentioned below:

On the Azure Cloud Shell run the following commands to deploy the solution:

  1. Login to Azure
az login

Note: You must do this step or you will see errors when running the script when connecting to IoT Central

  1. If you have more than one subscription, select the appropriate one:
az account set --subscription "<your-subscription>"
  1. Get the latest version of the repository
git clone https://github.com/MSUSSolutionAccelerators/ADX-IoT-Analytics-Solution-Accelerator.git

Optionally, you can update the iotanalyticsLogistics.parameters.json file to personalize your deployment.

  1. Deploy solution
cd ADXIoTAnalytics
. ./deploy.sh
  1. Choose option 2 to deploy from the options provided

What is Azure Data Explorer and when is it a good fit?

Azure Data Explorer is a fully managed, high-performance, big data analytics platform that makes it easy to analyze high volumes of data in near real time. The Azure Data Explorer toolbox gives you an end-to-end solution for data ingestion, query, visualization, and management.

By analyzing structured, semi-structured, and unstructured data across time series, and by using Machine Learning, Azure Data Explorer makes it simple to extract key insights, spot patterns and trends, and create forecasting models. Azure Data Explorer is scalable, secure, robust, and enterprise-ready, and is useful for log analytics, time series analytics, IoT, and general-purpose exploratory analytics.

Azure Data Explorer capabilities are extended by other services built on its powerful query language, including Azure Monitor logs, Application Insights, Time Series Insights, and Microsoft Defender for Endpoint

Generally speaking, when you interact with Azure Data Explorer, you're going to go through the following workflow (ADX Microhacks will cover all these steps):

  1. Create an ADX cluster: To use Azure Data Explorer you first create a cluster. An Azure Data Explorer cluster is the most basic unit.
  2. Create database: Each cluster has one or more databases in that cluster. Each Azure Data Explorer cluster can hold up to 10,000 databases and each database up to 10,000 tables.
  3. Ingest data: Load data into database tables so that you can run queries against it. Azure Data Explorer supports several ingestion methods.
  4. Query data: Azure Data Explorer uses the Kusto Query Language, which is an expressive, intuitive, and highly productive query language. It offers a smooth transition from simple one-liners to complex data processing scripts, and supports querying structured, semi-structured, and unstructured (text search) data. Use the web application to run, review, and share queries and results. You can also send queries programmatically (using an SDK) or to a REST API endpoint.
  5. Visualize results: Use different visual displays of your data in the native Azure Data Explorer Dashboards. You can also display your results using connectors to some of the leading visualization services, such as Power BI and Grafana.

Microhack 1: Cluster Creation and Data Ingestion (Preview)

This Microhack is organized into the following 4 challenges:

  • Challenge 1: Create an ADX cluster
  • Challenge 2: Create integration with Azure services (Event Hub and Storage Account)
  • Challenge 3: Starting with the basics of KQL

Each challenge has a set of tasks that need to be completed in order to move on to the next challenge. It is advisable to complete the challenges and tasks in the prescribed order.

Challenge 1: Create an ADX cluster

To use Azure Data Explorer (ADX), you first have to create an ADX cluster, and create one or more databases in that cluster. Each database has tables. Then you can ingest data into a database so that you can run queries against it.

In this challenge, you will design an ADX based architecture, create an ADX cluster and database. In addition, you will get familiarized with two tools that enable you to connect to your Azure Data Explorer and run queries.

Expected Learning Outcomes:

  • Deploy ADX cluster from Azure Portal
  • Use ADX clients such as Kusto Web Explorer and Kusto Explorer
  • The initial configuration of the cluster at creation time
Task 1: Create an ADX cluster resource

Sign in to the Azure portal, select the + Create a resource button in the upper-left corner of the portal’s main page.

  • Search for Azure Data Explorer. Under Azure Data Explorer, select Create.

  • Fill out the basic cluster details with the following information.

Screen capture 1

  • Subscription: Use your own subscription
  • Resource Group: It's recommended to create a new resource group for the microhack's resources. Call it: -microhack-RG
  • Cluster name: Must be unique for each participant. Call it: microhackadx (cluster name must begin with a letter and contain lowercase alphanumeric characters.)
  • Region: France Central
  • Enable performance update (EngineV3): keep the default (enabled)
  • Compute specification: For a production system, select the specification that best meets your needs (storage optimized or compute optimized). For this Microhack we can use the Dev (No SLA) SKU.
    With various compute SKU options to choose from, you can optimize costs for the performance and hot-cache requirements for your scenario. If you need the most optimal performance for a high query volume, the ideal SKU should be compute-optimized. If you need to query large volumes of data with relatively lower query load, the storage-optimized SKU can help reduce costs and still provide excellent performance. You can read more about ADX’s SKU types here.
  • Availability zones: keep the default. Move to the next tab (“Scale”). Choose how to scale your resource. Select the “Optimized autoscale” option. It’s always recommended to use this option. Optimized Autoscale is a built-in feature that helps clusters perform their best when demand changes. Optimized Autoscale enables your cluster to be performant and cost effective by adding and removing instances based on demand. For this microhack keep the default values (Minimum instance count == 2, Maximum instance count == 3)

You can keep all the other configurations with the default values. Select Review + create to review your cluster details. Then, select Create to provision the cluster. Provisioning typically takes about 10 minutes. Creating an ADX cluster takes in average 10-15 minutes.

When the deployment is complete, select Go to resource. You will be redirected to the ADX cluster resource page. On the top of the Overview page, you can see the basic details of the cluster, like: the Subscription, the state (running) and the URI.

Task 2: Create a Database
  • You're now ready for the second step in the process: database creation.

  • On the Overview tab, select Create database. Alternatively, you can go to the “Databases” blade.

    Screen capture 1

  • Fill out the form with the following information.

Setting Suggested Value Field Description
Admin Default selected The admin field is disabled. New admins can be added after database creation.
Database Name TelemetryDatabase The database name must be unique within the cluster.
Retention period 365 The time span (in days) for which it's guaranteed that the data is kept available to query. The time span is measured from the time that data is ingested. This is the longer-term storage (in reliable storage) retention.
Cache period 31 The time span (in days) for which to keep frequently queried data available in SSD storage or RAM of the cluster’s VM, rather than in longer-term storage. Azure Data Explorer stores all its ingested data in reliable storage (most commonly Azure Blob Storage), away from its actual processing (such as Azure Compute) nodes. To speed up queries on that data, Azure Data Explorer caches it, or parts of it, on its processing nodes, SSD, or even in RAM. The best query performance is achieved when all ingested data is cached. Sometimes, certain data doesn't justify the cost of keeping it "warm" in local SSD storage. For example, many teams consider that rarely accessed older log records are of lesser importance. They prefer to have reduced performance when querying this data, rather than pay to keep it warm all the time. By increasing the cache policy, more VMs will be required to store data on their SSD/RAM. For Azure Data Explorer cluster, compute cost (VMs) is the most significant part of cluster cost as compared to storage and networking.
  • Select Create to create the database. Creation typically takes less than a minute. When the process is complete, you're back on the cluster Overview blade. You can see the database that you have created from on the Databases blade.

    Screen capture 1

Task 3: Write your first Kusto Query Language (KQL) query

What is a Kusto query? Azure Data Explorer provides a web experience that enables you to connect to your Azure Data Explorer clusters and write and run Kusto Query Language queries. The web experience is available in the Azure portal and as a stand-alone web application, the Azure Data Explorer Web UI, that we will use later. A Kusto query is a read-only request to process data and return results. The request is stated in plain text that's easy to read. A Kusto query has one or more query statements and returns data in a tabular or graph format. In the next challenges, we'll ingest data to the cluster, and then learn the most important concepts in KQL and write interesting queries. In this task, you will write a few basic queries to get an understanding of the environment. To start, go to the “Query” blade. In this example, you'll use the Azure Data Explorer web interface as a query editor (Kusto Query Language can also be used in Azure Monitor Logs, Azure Sentinel, and other services that are built on-top of Azure Data Explorer.)

Screen capture 1

We can see our cluster and the database that we created. To tun KQL queries, we must select the database that the query will run on (the scope). To select the data base, just click on the database name. Now – you can write a simple KQL query: print ("hello world"), and hit the “Run” button. The query will be executed and its result can be seen in the result grid on the bottom of the page.

Screen capture 1

You can also download Kusto Explorer, the desktop client to run queries and benefit from some advanced features available in the client.

Task 4: Enable Diagnostic logs

Azure Monitor diagnostic logs provide monitoring data about the operation of Azure resources. ADX uses diagnostic logs for insights on ingestion, commands, queries, and tables usage. You can export operation logs to Azure Storage, Event Hub, or Log Analytics. Diagnostic logs are disabled by default. To enable diagnostic logs, go to your cluster page in the portal. Under Monitoring, select Diagnostic settings.

Screen capture 1

Select Add diagnostic setting. In the Diagnostic settings window. Enter a Diagnostic setting name as per your preference. Select all the log categories and metrics (SucceededIngestion, FailedIngestion, IngestionBatching, Command, or Query, TableUsageStatistics, TableDetails and Journal). For this microhack, select the Destination details to be a Log Analytics workspace and select your own workspace or create a new one. Save the new diagnostic logs settings and metrics.

Challenge 2: Create integration with Azure services (Event Hub and Storage Account)

Data ingestion to ADX is the process used to load data records from one or more sources into a table in your ADX cluster. Once ingested, the data becomes available for query.

ADX supports several ingestion methods. [These methods include ingestion tools, connectors and plugins, managed pipelines, programmatic ingestion using SDKs, and direct access to ingestion.]

Expected Learning Outcomes:

  • Create continuous ingestion from Azure Event Hub (a managed pipeline)
  • Create one-time ingestion from Azure Blob Storage to your ADX cluster.
Task 1: Use the “One-click” UI (User Interfaces) to create a data connection to Event Hub

For the best user experience, we will use the Azure Data Explorer Web UI (aka: Kusto web Explorer/KWE). To open it, click on the “Open in Web UI” or just go to Kusto Web Explorer

Screen capture 1

The web UI opens. The messages are in a JSON format and they are being sent to your event hub. This is how a sample message looks like:

{
"messageProperties": {
  "iothub-creation-time-utc": "2021-12-22T11:16:58.668Z"
},
"enrichments": {},
"applicationId": "1b2f5f29-a78b-4012-bf31-2016473cadf6",
"deviceId": "13n5b9yiael",
"messageSource": "telemetry",
"telemetry": {
  "Status": "Online",
  "BatteryLife": 52,
  "Light": 62602.66864777621,
  "Tilt": 44.70275959833819,
  "Shock": -6.381560743718394,
  "ActiveTags": 164,
  "Location": {
    "lon": -68.4585,
    "lat": 40.9633,
    "alt": 1069.4617
  },
  "TransportationMode": "Train",
  "LostTags": 5,
  "Temp": 13.178830710467722,
  "Humidity": 91.17280445807984,
  "Pressure": 1033.3527307505506,
  "TotalTags": 187
},
"schema": "default@v1",
"enqueuedTime": "2021-12-22T11:16:58.753Z",
"templateId": "dtmi:ltifbs50b:mecybcwqm"
}

KWE lets us easily connect to Azure Event Hub and build a table which is schema based on an event sample data. Go to the “Data” blade. The name of this capability is “One-click ingestion” and it allows you to quickly ingest data.

Task 2: Configure the Event hub data connection

The ’Ingest new data’ wizard opens.

Destination tab: The Cluster and Database fields are auto-populated. Select the ADX cluster and the Database that you created in challenge 1. We haven’t created a table, so use the “Create new table” option. In order to make it easier to query, we recommend using a table name without hyphens, '-'.
The table will be named LogisticsTelemetry.

Source tab: Set the Source type to “Event Hub”, and specify the event hub details:

  • Subscription: Your event hub's subscription

  • Event hub namespace: Your event hub's namespace

  • Event hub: Your event hub name

  • Data connection name: Set a name for your data connection. We used ‘Database1-adx-microhack-eh’. Data connection connects ADX database to Event hub (or to storage account through Event Grid notifications)

  • Consumer group: you can use the default one

  • Compression: None Event system properties: leave empty. For this Microhack, we are not going to use them. (System properties store properties (meta data) that are set by the Event Hubs service, at the time the event is enqueued. ADX can embed the selected properties into a new column in your destination table.)

    Click on “Next: Schema”

    Schema tab:

    Sample data will be read from the Event Hub and you will see data preview. The default format is TXT, but our Event hub sends JSON data. Change the Data format field to JSON. Keep the nested levels as 1. As you can see, ADX inferred the column names and the data type according to the JSON’s data. Among the types you can find GUID, string, datetime, and dynamic.
    You can think of dynamic column as a JSON-like type. The dynamic data type can take on any value of other scalar data types like: bool, datetime, guid, int, long, real, string, timespan, an array of dynamic values, and a property bag (dictionary) that maps unique string values to dynamic values.
    Although the dynamic type appears JSON-like, it can hold values that the JSON model does not represent because they don't exist in JSON (e.g., long, real, datetime, timespan, and guid).

The 'Nested levels' field expands levels of nested data in dynamic type columns into separate columns. Although the raw event’s JSON format has nestedness of 2 levels, for this microhack we will use 1 level and see later how to leverage the powerful update policy capability of ADX to break these dynamic columns.

This is an example of the telemetry JSON (that is part of a bigger JSON that is being sent from the evet hub):

{"Location":{"alt":"252.71910000000003","lon":-93.2176,"lat":41.7911},"LostTags":4,"Light":"49115.368835522917","Temp":"32.93780098864795","TotalTags":186,"Status":"Offline","TransportationMode":"Land","BatteryLife":-5,"Tilt":"-52.64112596209344","Humidity":"74.018336518734131","Shock":"6.696957328744805","Pressure":"603.69265616418761","ActiveTags":170}

Screen capture 1

If you set the Nested levels to 2, ADX will “break” the JSON to independent fields.

Screen capture 1

For this microhack, we want to learn how ADX deals with dynamic fields, so we will keep the Nested levels as 1.

Screen capture 1

Open the command viewer. You can see the control commands that were automatically generated. In contrast to Kusto queries, control commands are requests to Kusto to process or modify data or metadata. Control commands are distinguished from queries by having the first character in the text of the command be the dot (.) character (which can't start a query). Not all control commands modify data or metadata. The large class of commands that start with .show, are used to display metadata or data. For example, the .show tables command returns a list of all tables in the current database.

Review the control commands that were generated by One-Click.

Screen capture 1

You can see the: '.create table table_name command', which creates a new empty table. '.alter table table_name policy ingestionbatching', which alters the batching policy of the specified table. During the ingestion process, ADX optimizes for throughput by batching small ingress data chunks together before ingestion. Batching reduces the resources consumed by the ingestion process. The batching policy defines when to seal a batch and send it for the next stage of the ingestion (once the first condition is met):

Parameter Value Description
Size 1G Batch size limit reached or exceeded the configured size
Count 1000 files Batch file (blob) number limit reached the configured count
Time 5 minutes The configured batching time has expired (maximum delay time per batch) (One click sets the batching time to 30 seconds)

The table mapping (data mapping) command: Data mappings are used during ingestion to map incoming data to columns inside tables. In our case, the incoming data arrives from the event hub in JSON format. The table mapping maps the JSON fields (by describing the path to elements in a JSON document) to our table’s columns. The desired result:

Screen capture 1

Click on ”Next: Summary” to create this data connection.

The Azure portal opens on the Event Hubs Instance page, so you can monitor the Event hub’s outgoing messages. In addition, in the One click’s Continuous ingestion from Event Hub established window, all steps should be marked with green check marks when establishment finishes successfully. The cards below these steps give you options to explore your data with Quick queries or Monitor the Event Hub connections and data.

Use the “Take 10” link (under “Quick queries”). Review the query and the data.

Notice that the query begins with a reference to the data table. This data is piped into the first and only operator in our query (take) and returns a specific number of arbitrary rows. Run the query by either selecting the Run button above the query window or selecting Shift+Enter on the keyboard. Use the “Manage Data Connection” link (under the “Monitor” section) to go to the portal and review your data connection. The data connection is saved under the Database.

Task 3: Use the “One-click” UI (User Interfaces) to create a data connection to Azure blob storage

This time, we will ingest data from an Azure Storage account. We will ingest two datasets:
1. Logistics telemetry data. This time, the table will be named LogisticsTelemetryHistorical.
2. Data on New York City taxi rides, which will be used for Microhack 2

Go again to the “Data management” tab, and select the Ingest from blob container option under Continuous ingestion

Make sure the cluster and the Database fields are correct. Select Create new table

In the Link to source, paste the SAS URL of the blob storage (the proctors will provide this information). Then select one of the Schema defining file (all the files in that blob storage have the same schema) and click Next

Screen capture 1

Make sure you use the JSON Data format

Screen capture 1

Wait for the ingestion to be completed. For production modes, you could use Azure Event Grid for continuous Blob ingestion. The Event Grid link under Continuous Ingestion will create the Event Grid resource for that. We won't use this option in this Microhack.

Verify that data was ingested to the table

LogisticsTelemetryHistorical
| count 

Repeat the above steps for ingesting data from the New York City container.

Relevant docs for this challenge:

Challenge 3: Starting with the basics of KQL

In this challenge you’ll write queries in Kusto Query Language (KQL) to explore and gain insights from your data.

Expected Learning Outcomes:

  • Know how to write queries with KQL.
  • Use KQL to explore data by using the most common operators.

What is a Kusto query? A Kusto query is a read-only request to process data and return results. The request is stated in plain text that's easy to read, author, and automate. A Kusto query has one or more query statements and returns data in a tabular or graph format.

What is a tabular statement? The most common kind of query statement is a tabular expression statement. Both its input and its output consist of tables or tabular datasets.

Tabular statements contain zero or more operators. Each operator starts with a tabular input and returns a tabular output. Operators are sequenced by a pipe (|). Data flows, or is piped, from one operator to the next. The data is filtered or manipulated at each step and then fed into the following step.

It's like a funnel, where you start out with an entire data table. Each time the data passes through another operator, it's filtered, rearranged, or summarized. Because the piping of information from one operator to another is sequential, the query's operator order is important. At the end of the funnel, you're left with a refined output. Let's look at an example query:

LogisticsTelemetryHistorical
| where enqueuedTime > ago(7d) 
| where messageSource == "telemetry"
| count 

This query has a single tabular expression statement. The statement begins with a reference to the table LogisticsTelemetry and contains the operators where and count. Each operator is separated by a pipe. The data rows for the source table are filtered by the value of the enqueuedTime column and then filtered by the value of the messageSource column. In the last line, the query returns a table with a single column and a single row that contains the count of the remaining rows.

References:

Task 1: Basic KQL queries - explore the data

In this task, you will see some KQL examples. For this task, we will use the table LogisticsTelemetry, which obtains data from the event hub.
Execute the queries and view the results. KQL queries can be used to filter data and return specific information. Now, you'll learn how to choose specific rows of the data. The where operator filters results that satisfy a certain condition.

LogisticsTelemetry
| where deviceId startswith "x"
| take 10

Similarly, you can filter where the time of an event occurred more than a certain number of years/days/minutes ago. For example, run the following query, where 2m means 2 minutes:

LogisticsTelemetry
| where enqueuedTime > ago(2m)
| take 10 

Find out how many records are in the table

LogisticsTelemetry
| summarize count() // or: count

Find out how many records have enqueuedTime bigger than the last 10 minutes.

LogisticsTelemetry
| where enqueuedTime > ago(10m)
| summarize count()

Find out how many records have deviceId that startswith "x"

LogisticsTelemetry
| where deviceId startswith "x"
| summarize count()

Find out how many records have deviceId that startswith "x", per device ID (aggregate by device ID)

LogisticsTelemetry
| where deviceId startswith "x"
| summarize count() by deviceId

Find out how many records startswith "x", per device ID (aggregate by device ID). Render a timechart

LogisticsTelemetry
| where deviceId startswith "x"
| summarize count() by deviceId
| render piechart 

KQL makes it simple to access fields in JSON and treat them like an independent column:

LogisticsTelemetry
// | where enqueuedTime > ago(10d) 
| extend h = telemetry.Humidity
| summarize avg(toint(h)) by bin(enqueuedTime, 1h)
| render timechart 

For the following tasks, we will use the table LogisticsTelemetryHistorical.

Task 2: Explore the table and columns

Write a query to get the schema of the table.

Expected result:

getschema operator

Task 3: Keep the columns of your interest

Write a query to get only specific desired columns: deviceId, enqueuedTime, Temp. Take arbitrary 10 records.

Expected result:

project-away operator - Azure Data Explorer | Microsoft Docs

Project operator - Azure Data Explorer | Microsoft Docs

Task 4: Filter the output

Write a query to get only specific desired columns: deviceId, enqueuedTime, Temp. Take arbitrary 10 records from the past 90 days.

Hint 1: “ago”
Hint 2: In case you see 0 records, remember that operators are sequenced by a pipe (|). Data is piped, from one operator to the next. The data is filtered or manipulated at each step and then fed into the following step. By using the ‘Take’ operator, there is no guarantee which records are returned

where operator in Kusto query language - Azure Data Explorer | Microsoft Docs

Task 5: Sorting the results

Write a query to get the 5 records which have the highest temperature. Write another query get the 5 records which have the lowest temperature.

sort operator - Azure Data Explorer | Microsoft Docs

Task 6: Reorder, rename, add columns

Write a query to convert Fahrenheit temperatures to Celsius temperatures. For readability, show the Fahrenheit temperature and the Celsius temperaturesa as the 2 left-most columns. You can use the following formula: C = (F – 32) * (5.0/9.0)
Take 5 random records from the past week. Hint 1: 'project' operator provides lot more features Hint 2: We used 5.0 and 9.0, rather than 5 and 9 to ensure these numbers were to the 'real' data type (double-precision floating-point format), rather than 'long' (a signed integer, Int64)

Expected result:

extend operator project-rename operator project-reorder operator

Task 7: Total number of records

Write a query to find out how many records are in the table.

count operator - Azure Data Explorer | Microsoft Docs

Task 8: Aggregations and string operations

Write a query to find out how many records have deviceId starting with 'x'.
Write another query to find out how many records have deviceId starting with 'x', per device ID (aggregated by deviceId).
Expected result for the second query:

String operators - Azure Data Explorer | Microsoft Docs

summarize operator - Azure Data Explorer | Microsoft Docs

Task 9: Render a chart

Write a query to find out how many records startswith "x" , per device ID (aggregated by device ID) and render a piechart.

Expected result:

render operator - Azure Data Explorer | Microsoft Docs

Task 10: Create bins and visualize time series

Write a query to show a timechart of the number of records over time. Use 10 minute bins (buckets). Each point on the timechart represent the number of devices on that bucket.

Expected result:

bin() - Azure Data Explorer | Microsoft Docs

Task 11: Aggregations with time series visualizations

Write a query to show a timechart of the average temperature over time. Use 30 minute bins (buckets) Each point on the timechart represent the average temperature in that 30 min period. Hint: summarize avg(Temp) by bin(enqueuedTime, 30m)

Expected result:

summarize operator

Microhack 2: Data exploration and visualization with KQL (Preview)

This Microhack is organized into the following 3 challenges:

  • Challenge 5: Explore and transform data
  • Challenge 6: Advanced KQL operators
  • Challenge 7: Visualisation

Each challenge has a set of tasks that need to be completed in order to move on to the next challenge. It is advisable to complete the challenges and tasks in the prescribed order.

Challenge 5: Explore and transform data

Expected Learning Outcomes:

  • Create an update policy to transform the data at ingestion time

For the next task, we will use the LogisticsTelemetry table (which obtains data from the Event Hub).

Task 1: Create an update policy

By taking 10 records, we can see that the telemetry column has a JSON structure. In this task, we will use an 'update policy' to manipulate the raw data in the LogisticsTelemetry table (the source table) and transform the JSON data into separate columns, that will be ingested into a new table that we’ll create (“target table”).
Update policy is like an internal ETL. It can help you manipulate or enrich the data as it gets ingested into the source table (e.g. extracting KSON into separate columns, creating a new calculated column, joining the new records with a static dimension table that is already in your database, etc). For these cases, using an update policy is a very common and powerful practice.
Each time records get ingested into the source table, the update policy's qeury (which we'll define in the update policy) will run on them (and only on newly ingested records - other existing records in the source table aren’t visible to the update policy when it runs), and the results of the query will be appended to the target table.
We want to create a new table, with a calculated column (we will call it: NumOfTagsCalculated) that contains the following value: telemetry.TotalTags + telemetry.TotalTags - telemetry.LostTags.

The schema of the new (destination) table would be:

  ( deviceId:string, enqueuedTime:datetime, NumOfTagsCalculated:int, Temp:real)

Screen capture 1

Example (note that the order of the keys may be different):

{
  "BatteryLife": 73,
  "Light": "70720.236143472212",
  "Tilt": "18.608539012789223",
  "Humidity": "60.178854625386215",
  "Shock": "-4.6141182265359628",
  "Pressure": "529.61165751122712",
  "ActiveTags": 165,
  "TransportationMode": "Ocean",
  "Status": "Online",
  "LostTags": 9,
  "Temp": "7.5054504554744765",
  "TotalTags": 185,
  "Location": {
      "alt": 1361.0469,
      "lon": -107.7473,
      "lat": 36.0845
  }
}

Build the target table

.create table LogisticsTelemetryManipulated  (deviceId:string, enqueuedTime:datetime, NumOfTagsCalculated:long, Temp:real) 

Create a function for the update policy

.create-or-alter function ManipulateLogisticsTelemetryData()
{ 
     <Complete the query>
} 

Create the update policy

     <Complete the command>

Make sure the data is transformed correctly in the destination table

    LogisticsTelemetryManipulated
   | take 10

Relevant docs for this challenge:

Challenge 6: Going more advanced with KQL

Task 1: Declaring variables

Use a 'let' statement to create a list of the 10 device Ids which have the highest Shock. Then, use this list in a following query to find the average temperature of these 10 devices.

You can use the 'let' statement to set a variable name equal to an expression or a function. let statements are useful for:

  • Breaking up a complex expression into multiple parts, each represented by a variable.
  • Defining constants outside of the query body for readability.
  • Defining a variable once and using it multiple times within a query.

Hint 1: in operator - Azure Data Explorer | Microsoft Docs

Hint 2: let - Azure Data Explorer | Microsoft Docs

Hint 3: Remember to include a ";" at the end of your let statement.

Task 2: Add more fields to your timechart

Write a query to show a timechart of the number of records, by TransportationMode. Use 10 minute bins.

Expected result:

Task 3: Some geo-mapping

Write a query to show on map the locations (based on the longitude and latitude) of 10 devices with the highest temperature from the last 90 days.
Hint 1: 'top' operator
Hint 2: render scatterchart with (kind = map)

Once the map is displayed, you can click on the locations. Note that in order to show more details in the balloon, you need to change the render phrase to include 'series='.

render operator with scatter chart

Expected result:

Task 4: Range

Range is a tabular operator: it generates a single-column table of values, whose values are start, start + step, ... up to and until stop. Run the following query and review the results:

range MyNumbers from 1 to 8 step 2

Range also works with dates: range LastWeek from ago(7d) to now() step 1d

We will use the range operator as part of the time series creation in the next tasks.

Machine learning with Kusto and time series analysis

Many interesting use cases use machine learning algorithms and derive interesting insights from telemetry data. Often, these algorithms require a strictly structured dataset as their input. The raw log data usually doesn't match the required structure and size. We will see how we can use the make-series operator to create well curated data (time series).

Then, we can use built in functions like series_decompose_anomalies. Anomalies/ outliers will be detected by the Kusto service and highlighted as red dots on the time series chart.

Time series: What is a time series? A time series is a collection of observations of well-defined data items obtained through repeated measurements over time and listed in time order. Most commonly, the data points are consistently measured at equally spaced intervals. For example, measuring the temperature of the room each minute of the day would comprise a time series. Data collected irregularly is not a time series.

What is time series analysis?

Time series analysis comprises methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data. Time series forecasting, for example, is the use of a model to predict future values based on previously observed values.

What is time series decomposition? Time series decomposition involves thinking of a series as a combination of 4 components:

  • trends (increasing or decreasing value in the series)
  • seasonality (repeating short-term cycle in the series)
  • baseline (the predicted value of the series, which is the sum of seasonal and trend components)
  • noise (The residual random variation in the series). We can use built in functions, that uses time series decomposition to forecast future metric values and/or detect anomalous values.

This is what time series looks like:

Screen capture 1

Why should you use series instead of the summarize operator?

The summarize operator does not add "null bins" — rows for time bin values for which there's no corresponding row in the table. It's a good idea to "pad" the table with those bins. Advanced built in ML capabilities like anomaly detection need the data points to be consistently measured at equally spaced intervals. The make-series can create such a “complete” series.

Task 4: Anomaly detection

Write a query to create an anomaly chart of the average shock.

For this task, we will provide more instructions:

To generate these series, start with:

let min_t = (toscalar(LogisticsTelemetryHistorical | summarize min(enqueuedTime)));
let max_t = (toscalar(LogisticsTelemetryHistorical | summarize max(enqueuedTime)));
let step_interval = 10m;
LogisticsTelemetryHistorical
| make-series avg_shock_series=avg(Shock) on (enqueuedTime) from (min_t) to (max_t) step step_interval 

Now, we will use this avg_shock_series and run series_decompose_anomalies. This built-in function takes an expression containing a series (dynamic numerical array) as input, and extracts anomalous points with scores.

| extend anomalies_flags = series_decompose_anomalies(avg_shock_series, 1) 
| render anomalychart  with(anomalycolumns=anomalies_flags, title='avg shock anomalies') 

The anomalies/outliers can be clearly spotted in the 'anomalies_flags' points.

make-series
ADX Anomaly Detection

Expected result:



FOR THE NEXT TASKS, WE WILL USE the NYC TAXI DATA.

If the proctor hasn't provided the data set, use this Azure Open Dataset on NYC Taxi Rides to ingest this data into your ADX cluster.

Task 5: Get familiar with the new table and create a piechart

Write some queries to get familiar with this table. After some familiarity, write a query to create a piechart of the payments type. Use 'tostring' to convert the payment_type to string before rendering the piechart.

Expected result:

Task 6: Datetime operations

Write a query to create a columnchart which will show the number of rides for each day of the week, across the entire data set. You can use 1, 2, ..., 7 to denote Sunday through Saturday.

dayofweek() - Azure Data Explorer | Microsoft Docs

Expected result:

Task 7: Multiple series on the same timechart

Write a query to find out if the tip amount correlates with the number of passengers in the taxi between 1 July 2021 and 31 July 2021. Restrict the number of passengers to maximum of 4.

Expected result:

Task 8: Detect anomalies in the tip amount

Write a query to draw an anomaly chart for the tip amount in the month of July 2021.
Hint 1: make-series for the average tip amount, with 1 h steps
Hint 2: Use series_decompose_anomalies with this series and parameter of 5 (sensitivity level)

Expected result:

Task 9: External data

The externaldata operator returns a table whose schema is defined in the query itself, and whose data is directly read from an external storage artifact, such as a blob in Azure Blob Storage, a file in Azure Data Lake Storage, or even a file in GitHub repository. Since the data is not being ingested into ADX, it cannot be indexed, compressed, or stored in the hot cache. For best performance, we recommend that data be ingested. External data can, however, be used in sporadic cases, where you do not want to ingest the data.
Take a look at this csv file: https://raw.githubusercontent.com/Azure/azure-kusto-microhack/main/assets/ExternalData/payment_type_lookup.csv. The file represents the mapping between the numeric code of the payment type and its description
Here is how we can use KQL to handle this external data:

let payment_type_lookup_data = (externaldata (code:string,description:string)
[ @"https://raw.githubusercontent.com/Azure/azure-kusto-microhack/main/assets/ExternalData/payment_type_lookup.csv" ]
with(format="csv", ignoreFirstRecord=true))
| project tolong(code), description;
payment_type_lookup_data

externaldata operator - Azure Data Explorer | Microsoft Docs

Task 10: Let's join the party

The taxi rides table has a field of Payment_type. This is a numeric code signifying how the passenger paid for the trip. Use the payment_type_lookup, to join between the payment code and the description. Use a leftouter join to merge the rows of the two tables to form a new table, by matching values of the payment code column.

Render a time chart of the number of records, per payment type over time, with 1 day bins, based on data between 2021-07-01 and 2021-07-31. What is the most common method of payment for rides? Credit cards or cash? What does it look like over time?

Expected result:

Task 11: Forecasting

Create a timechart that will show:

  • The number of rides during July 2021
  • A forecast of the number of drive-ins for the first week of August, based on July 2021 (Use the series_decompose_forecast function).

Hint: Start your query with:

let min_t = datetime(2021-07-01);
let max_t = datetime(2021-08-07); // Note that there is no data in the first week of August. We will forecast the data for this week.
taxi
| where tpep_dropoff_datetime between (min_t .. max_t) 
  • Make a series of the number of rides, on tpep_pickup_datetime between these dates. Use steps of 30 minutes.
  • Use series_decompose_forecast with parameters of this series and second parameter of: '24*7' (The second parameter is an Integer specifying the number of points at the end of the series to predict (forecast). These points are excluded from the learning (regression) process. We will use '24*7` additional data points, in order to forecast a week forward).
  • Once a series is created, you can render a timechart.

Expected result:

Challenge 7: Visualisation

Task 1: Prepare interactive dashboards with ADX Dashboard

Using the Dashboard feature of Azure Data Explorer, build a dashboard using outputs of any 5 queries (on LogisticsTelemetryHistorical table) that you have created in the previous challenges with the following improvements:

  • Add filter on the dashboard so that the user can choose the timespan
  • Add filter on the dashboard so that the user can choose the transportation mode

Include filters for the dashboard so that the queries do not need to be modified if the user wants to analyze the charts with different values of a parameter. For example, users would like to analyze the charts over the last week, the last 14 days as well as the last 1 month. Users would also like to analyze the charts by different transportation modes.

Hint 1: In the query window, explore the “Share” menu.

Screen capture 1

Task 2: Prepare management dashboard with PowerBI

Visualize the outputs of Task 5 and Task 6 in Challenge 6 in PowerBI using the DirectQuery mode.

Hint 1: In the query window, explore the “Share” menu.

There are multiple ways to connect ADX and PowerBI depending on the use case. For this Microhack, we will use the DirectQuery method. Feel free to explore other methods on the docs.

Visualize data using the Azure Data Explorer connector for Power BI | Microsoft Docs

Visualize data using a query imported into Power BI | Microsoft Docs

azure-kusto-microhack's People

Contributors

guregini avatar microsoftopensource avatar shahdevangp avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.