Azure Data Explorer Kafka Connector – new features with version 1.x

This article is contributed. See the original author and article here.

Azure Data Explorer has a mature, open source Kafka Connect sink connector that is in use in production by multiple customers. The connector was *gold* certified in September 2020 by Confluent and is now listed in Confluent Connector Hub.  The certification involved stringent testing by Confluent for quality, standards, best practices, feature completeness, delivery guarantees and performance.  To meet the certification requirements, we added a number of new features to our connector, renamed some, and removed support for some with version 1.x.  This article details the same. 

For comprehensive information about our connector, visit our Git repo.

Improved delivery guarantees

Our connector now offers “At least once” delivery guarantees (no data loss).

What’s new

 

1. Comprehensive upfront validation

The connector validates for all required and dependent sink properties, access and grants and shuts down upfront with appropriate error messages.

 

2. Configurable behavior on error

A new sink property with possible values of fail (shut down connector task), ignore (ignore and continue processing), and log (log errors and continue processing).  

 

behavior.on.error

 

 

3. *New* configurable retries

A set of new sink properties-

 

errors.retry.max.time.ms
errors.retry.backoff.time.ms

 

errors.retry.max.time.ms: Period of time in milliseconds to retry for transient errors

errors.retry.backoff.time.ms: Period of time in milliseconds to backoff before retry for transient errors

 

4. Additional converters supported

We have added support for the following converters-

 

org.apache.kafka.connect.json.JsonConverter
io.confluent.connect.avro.AvroConverter
io.confluent.connect.json.JsonSchemaConverter

 

JsonSchemaConverter is for Json with schema.  


4. Schema registry support

The connector now supports schema registry with the property-

 

value.converter.schema.registry.url

 

We support schema registry for avro and json, protobuf is on the roadmap.

 

5. Schema with message

The connector now supports messages with schema with the following sink property that accepts a boolean-

 

value.converter.schemas.enable

 


6. Kafka Connect deadletterqueue

The optional sink properties below can be provided for Kafka Connect to redirect messages that failed deserialization-

 

errors.deadletterqueue.bootstrap.servers
errors.deadletterqueue.topic.name
errors.deadletterqueue.security.protocol
errors.deadletterqueue.sasl.mechanism
errors.deadletterqueue.sasl.jaas.config
errors.deadletterqueue.sasl.kerberos.service.name

 

The last 4 sink properties are required for secure clusters and*kerberos* property is specific to kerberized Kafka clusters.

 

7. Miscellaneous deadletterqueue

The optional sink properties below can be provided for the connector to redirect messages that failed besides deserialization, e.g. transient failures beyond retry period –

 

misc.deadletterqueue.bootstrap.servers
misc.deadletterqueue.topic.name
misc.deadletterqueue.security.protocol
misc.deadletterqueue.sasl.mechanism
misc.deadletterqueue.sasl.jaas.config
misc.deadletterqueue.sasl.kerberos.service.name

 

The last 4 sink properties are required for secure clusters and*kerberos* property is specific to kerberized Kafka clusters.

 

What’s renamed

For the purpose of standardization and consistency, we have renamed some properties-

Previous name New name
kusto.auth.authority aad.auth.authority
kusto.auth.appid aad.auth.appid
kusto.auth.appkey aad.auth.appkey
kusto.tables.topics_mapping kusto.tables.topics.mapping
kusto.sink.tempdir tempdir.path
kusto.sink.flush_interval_ms flush.interval.ms
kusto.sink.flush_size
flush.size.bytes

 

What’s removed

The following properties are no longer supported with version 1.x-

Property name Rationale
kusto.auth.username Customers typically only use application IDs/service principals; We therefore removed support for user principals
kusto.auth.password Customers typically only use application IDs/service principals; We therefore removed support for user principals
eventDataCompression We now compress all data shipped to Azure Data Explorer, to improve performance

Connector documentation

Visit out Git repo

Connector download 

Visit out Git repo release page

Confluent Connector Hub

https://www.confluent.io/hub/microsoftcorporation/kafka-sink-azure-kusto

Need support?

Ran into a bug? Need hep getting started?  Have a feature request?  Open an issue.

Want to contribute?

Read our guidelines and get involved.  We welcome community contribution.

SQL Server IAM page

SQL Server IAM page

This article is contributed. See the original author and article here.

Hi Guys, welcome to SQL Server Allocation Series, hope you enjoy the post1 and post2.

Today, I’m going to show you the detail of IAM page.

An Index Allocation Map (IAM) page maps the extents in a 4-GB part of a database file used by an allocation unit. An allocation unit is one of three types:

  • IN_ROW_DATA
    Holds a partition of a heap or index.
  • LOB_DATA
    Holds large object (LOB) data types, such as XML, VARBINARY(max), and VARCHAR(max).
  • ROW_OVERFLOW_DATA
    Holds variable length data stored in VARCHAR, NVARCHAR, VARBINARY, or SQL_VARIANT columns that exceed the 8,060 byte row size limit.

Each partition of a heap or index contains at least an IN_ROW_DATA allocation unit. It may also contain a LOB_DATA or ROW_OVERFLOW_DATA allocation unit, depending on the heap or index schema.

An IAM page covers a 4-GB range in a file and is the same coverage as a GAM or SGAM page. If the allocation unit contains extents from more than one file, or more than one 4-GB range of a file, there will be multiple IAM pages linked in an IAM chain. Therefore, each allocation unit has at least one IAM page for each file on which it has extents. There may also be more than one IAM page on a file, if the range of the extents on the file allocated to the allocation unit exceeds the range that a single IAM page can record.

Liwei_0-1599248408060.png

 

Liwei_1-1599248408062.png

 

IAM pages linked in a chain per allocation unit An IAM page has a header that indicates the starting extent of the range of extents mapped by the IAM page. The IAM page also has a large bitmap in which each bit represents one extent. The first bit in the map represents the first extent in the range, the second bit represents the second extent, and so on. If a bit is 0, the extent it represents is not allocated to the allocation unit owning the IAM. If the bit is 1, the extent it represents is allocated to the allocation unit owning the IAM page.

When the SQL Server Database Engine has to insert a new row and no space is available in the current page, it uses the IAM and PFS pages to find a page to allocate, or, for a heap or a Text/Image page, a page with sufficient space to hold the row. The SQL Server Database Engine uses the IAM pages to find the extents allocated to the allocation unit. For each extent, the SQL Server Database Engine searches the PFS pages to see if there is a page that can be used. Each IAM and PFS page covers lots of data pages, so there are few IAM and PFS pages in a database. This means that the IAM and PFS pages are generally in memory in the SQL Server buffer pool, so they can be searched quickly. For indexes, the insertion point of a new row is set by the index key, but when a new page is needed, the previously described process occurs.

 

 

In post 1, I created a table with 8 pages, all these 8 pages are in mixed extent. I’m going to insert more rows into the table to increase the pages. As you expected, these new rows/pages will be allocated to uniform extents.

I’m going to walk you through how SQL Server save the info in IAM pages.

Data manipulation:

T-SQL:

insert heaptable1 values(15,REPLICATE(‘o’,8000))     –page 9

insert heaptable1 values(16,REPLICATE(‘p’,8000))     –page 10

insert heaptable1 values(17,REPLICATE(‘q’,8000))     –page 11

insert heaptable1 values(18,REPLICATE(‘r’,8000))     –page 12

insert heaptable1 values(19,REPLICATE(‘s’,8000))     –page 13

insert heaptable1 values(20,REPLICATE(‘t’,8000))     –page 14

insert heaptable1 values(21,REPLICATE(‘u’,8000))     –page 15

insert heaptable1 values(22,REPLICATE(‘v’,8000))     –page 16

insert heaptable1 values(23,REPLICATE(‘w’,8000))     –page 17

 

The table heaptable1 has 17 pages in four extents, two mixed extents and two uniform extents.

select allocated_page_file_id as [FileID],allocated_page_page_id as [PageID],page_type_desc,extent_page_id/8 as ExtentID,

is_mixed_page_allocation,extent_page_id as [First Page in Extent],extent_page_id+7 as [LastPage in Extent],is_allocated From 

sys.dm_db_database_page_allocations(db_id(),object_id(‘dbo.heaptable1′),null,null,’detailed’)  order by allocated_page_page_id

Liwei_2-1599248408068.png

 

Page 246 is the IAM page.

Page 245,247,328,329,330,331,332 and 333 are in mixed extent.

Starting from page (1:352), all the pages are allocated in uniform extent.

Page(1:352)~(1:359) are in extent 44,

Page(1:360)~(1:367) are in extent 45.

 

Let me run DBCC PAGE of IAM page to show you how the info recorded.

Liwei_3-1599248408078.png

 

The first 8 pages are listed under ‘Single page allocations’, which means they are in mixed extent.

(1:352)~(1:360) are in uniform extent, the result matches the DMF sys.dm_db_database_page_allocations

(1:352) is the first page of extent 44

(1:360) is the first page of extent 45

 

Let’s dive into the IAM page with parameter 1:

Liwei_4-1599248408088.png

 

 

1.There are two slots in IAM page,

  • The first slot(slot 0) are for pages in Mixed extent.
  • The second slot(slot 1) are for pages in Uniform extent.

2.Let me interpret the slot 0.

1)Each string in rectangle stands for one page, the format is PageId:FileId.

2)The first 8 characters are page id, the last four part is for Fileid.

Here is the interpretation table:

Original string->

PId:FileId->

FId:PId->

FId:PId->

FId:PId(Decimal)

f50000000100

f5000000:0100

0100: f5000000

0001:000000f5

1:245

f70000000100

f7000000:0100

0100:f7000000

0001:000000f7

1:247

480100000100

48010000:0100

0100:48010000

0001:00000148

1:328

490100000100

49010000:0100

0100:49010000

0001:00000149

1:329

4a0100000100

4a010000:0100

0100:4a010000

0001:0000014a

1:330

4b0100000100

4b010000:0100

0100:4b010000

0001:0000014b

1:331

4c0100000100

4c010000:0100

0100:4c010000

0001:0000014c

1:332

4d0100000100

4d010000:0100

0100:4d010000

0001:0000014d

1:333

3.I have demonstrate how the IAM page stores the first 8 pages in mixed extent, now let’s move on to the uniform extent.

 

4.The uniform extents info is stored in the second slot(slot 1).

Liwei_5-1599248408098.png

 

 

1)’0000381f’ is reserved, we don’t need to worry about it.

2)The string I cared about is ‘00000000 0030’

Following table interprets the string.

Liwei_6-1599248408099.png

 

3)Each bit in the table stands for one extent.

4)0 means the extent is not occupied/allocated by this table of the IAM.

5)1 means the extent is occupied/allocated by this table of IAM, and it’s uniform extent.

6)Here is more detail

Liwei_7-1599248408102.png

 

Which exactly as same as the result we got from DBCC Page with parameter 3 before.

Liwei_8-1599248408110.png

 

 

You may download the dbtest20200904.zip and give it a try.

Video Tutorial: Install Application Integration – Application Deployment Part 17

This article is contributed. See the original author and article here.

Hello everyone, here is part 17 of a series focusing on Application Deployment in Configuration Manager.  This series is recorded by @Steve Rachui, a Microsoft principal premier field engineer.

This session focuses on the Install Application option for rapidly deploying applications to device(s).  The tutorial walks through what this component is, how to configure it and demonstrates its use.

 

 

Posts in the series

Go straight to the playlist

Excel MVP Blog Roundup

This article is contributed. See the original author and article here.

This week, we bring you recent articles by Excel MVPs Mourad Louha, Ismael Romero, and Frederic Le Guen.

 

Datentypen aus Power BI in Excel abrufen (Get data types from Power BI in Excel)

Excel MVP Mourad Louha describes his first experience with Power BI datatypes in Excel and how easy it is to get custom datatypes into Excel – starting with an introduction to the geography datatype and then the steps to create custom datatype from an existing Power BI dataset.

 

Power Query: Repetir y numerar filas como n de N (Power Query: Repeat and number rows as n of N)

Excel MVP Ismael Romero describes using Power Query to obtain an enumerated list in ‘n of N’ form for each original record.

 

Le secret du raccourci Ctrl + D (The Secret of Ctrl-D)

Excel MVP Frederic Le Guen reveals the secret of Ctrl-D when working with graphics objects in Excel.

 

Find this useful or want to know more about a specific Excel topic? Leave a comment below.

Just Enough Azure Data Explorer for Architects

This article is contributed. See the original author and article here.

JUST ENOUGH AZURE DATA EXPLORER FOR CLOUD ARCHITECTS


This article provides a 
high-level overview of all the key capabilities in Azure Data Explorer, positioning, and resources, and is intended to demystify Azure Data Explorer for cloud architects. 


WHAT IS AN ANALYTICAL DATABASE?
 


Analytical databases are purpose-built and optimized to query
 and run advanced analytics on large volumes of data with extremely low response times.  Modern analytical databases are generally distributed, scalable, fault-tolerant, and feature columnar, compressed formats, and intelligently use in-memory and disk caching technologies.  In contrasttransactional databases are optimized for single digit millisecond point reads by key or intra-partition range querysupport random CRUD and ACID semantics, and more. 


WHAT IS AZURE DATA EXPLORER?
 


Azure Data Explorer
, code name, Kusto, is a low latency, big data, Microsoft proprietary, append-only analytical database.  It is feature and functionality rich as a platform – with durable storage, query engine, search engine, proprietary Kusto Query Language, support for ingestion of data in disparate formats from disparate sources, in batch and streaming, with advanced analytics/data science and geospatial analytical capabilities out of the box, server side programmability and support for visualization, scheduling, orchestration and automation.   

 

Azure Data Exploreis mature and heavily used internally at Microsoft and powers many core Azure offerings such as Azure Monitor, Azure Sentinel, Azure Time Series Insights and more.  It is fully managed, scalable, secure, robust, and enterprise ready and is a popular platform for log analytics, time series analytics, IoT, and general-purpose exploratory analytics.  In a subsequent section, we cover, what Azure Data Explorer is not. 


USE CASES FOR AZURE DATA EXPLORER
 

  • Azure Data Explorer is a big data solution, ideal for- 
  • Analytics on IoT telemetry 
  • Analytics on all manners of time series data  
  • Analytics on all manners of logs 
  • Analytics on clickstream and any other form of live sources 
  • Text search 
  • Geo-spatial analytics 
  • Advanced analytics 
  • Exploratory environment for data scientists and business/data analysts 

 
Reference architecture is covered further in this article, after the value proposition.

 

VALUE PROPOSITION AND SALIENT FEATURES 

1. LOW LATENCY QUERY ENGINE 


Azure Data Explorer leverages 
innovative, and contemporary compressed columnar and row stores along with a hierarchical cache paradigm, featuring configurable hot and cold caches backed by memory and local disk, with data persistency on Azure Storage. With this architecture, terabytes and petabytes of data can be queried, and results returned within milliseconds to seconds.  Refer the white paper for more implementation details. 

 

For applications where performance of analytical queries is critical, Azure Data Explorer is a great fit as a backend analytical database. 


2. DISTRIBUTED BY DESIGN
 


Azure Data Explorer is a 
big data, cluster computing system comprised of engine nodes that serve out queries and data management service nodes that perform/orchestrate a variety of data related activities including ingestion and shard management.  A few key capabilities of distributed systems are partitioning, consistency, replication, scalability and fault tolerance. 

 

  • Data persisted in Azure Data Explorer is durably backed by Azure Storage that offers replication out of the box, locally within an Azure Data Center, zonally within an Azure Region.   
  • An Azure Data Explorer cluster can be scaled up vertically or down as workload resource requirements change.  Azure Data Explorer clusters can be scaled out horizontally and scaled in, manually, or with auto-scale. 
  • From a partitioning perspective, Azure Data Explorer natively partitions by ingestion time with a proprietary formula to prevent hot spotting, but also offers a user configurable partitioning based on data attributes/columns.  Out of the box, the service offers options of strong and weak consistency, with strong consistency as the default.  All data ingested into Azure Data Explorer is automatically indexed, allowing fast lookup queries 

 
For analytical applications where scale and reliability are a must, Azure Data Explorer is a great fit. 


3. RICH INTEGRATION AND CONNECTOR
 ECOSYSTEM 


Azure Data Explorer has a rich connector eco-system for batch and near real-time ingestion
, with Microsoft first party services as well as open source distributed systems 
  

It supports open source integration systems and aggregatorconnector frameworks, like Apache Kafka’s Kafka Connect and ELK stack’s Logstash.  It supports Apache Spark for both read and persist, opening up numerous integration possibilities with open source big data systems, whether transactional or analytical.  
  

From a streaming perspective, Azure Data Explorer supports Azure IoT, Azure Event Hub and Apache Kafka.  For bulk ingestion or event-driven ingestion, Azure Data Explorer integrates with Azure Storage, with a queued ingestion functionality, and with Azure Event Grid for configurable event driven ingestion. 
 

From an ingestion format perspective, Azure Data Explorer supports a number of formats from CSV and JSON to binary formats like Avro, Parquet and ORC, with support for multiple compression codecs. 
 

It is a common pattern today to curate all enterprise information assets in a data lake. Azure Data Explorer supports the external table paradigm with an integration with Azure Data Lake Store.  It also features continuous ingest and continuous export capabilities to Azure Data Lake Store. 

 

Collectively, with the connector eco-system, support for disparate ingestion formats and compression, whether in batch or streaming modeswith support for business workflows, scheduling, and orchestration, Azure Data Explorer is enterprise-ready from an integration perspective. 


4. QUERY LANGUAGE
 AND SERVER-SIDE PROGRAMMABILITY 


Azure Data Explorer features a proprietary Kusto Query Language
 (KQL)that is expressive, and intuitive.  KQL supports querying structured, semi-structured, and unstructured(text search) data, all the typical operators of a database query language, typical aggregation and sorting capability, relational query grammar with joins, and hints and more, cross-cluster and cross database queries, and is feature rich from a parsing (json, XML etc) perspective.  It supports geospatial analytics and advanced analytics as well. 

 

Azure Data Explorer supports server-side stored functions, continuous ingest and continuous export to Azure Data Lake store.  It also supports ingestion time transformations on the server side with update policiesand precomputed scheduled aggregates with materialized views (preview). 

 

Azure Data Explorer with Kusto Query Language capabilities, and its support for ingesting disparate formats, in batch or near real time, allows enterprises to gain insights instantly from data, in its raw form, in previously unthought of ways and proactively, reactively respond.  With the server-side capabilities detailed above, Azure Data Explorer supports building common automated analytical application features.  


5. ADVANCED ANALYTICS
 


Azure Data Explorer offers time series capabilities, including a large set of functions from basic element-wise operation (adding, subtracting time series) via filtering, regression, seasonality detection up to anomaly detection and forecasting. Time series functions are optimized for processing thousands of time series in seconds. 
  Azure Data Explorer also offers clustering plugins for pattern detection that are very powerful for diagnosis of anomalies, and root cause analysis.  You can extend Azure Data Explorer capabilities by embedding python code in KQL query, leveraging Python open source eco-system for ML, statistics, signal processing and a lot more.  Using inline Python you can do in-place machine learning model training, leveraging Azure Data Explorer compute, against data stored within ADX, or train your model anywhere, export it to ADX and use ADX solely for scoring. You can also query ADX from Jupyter/Azure notebooks by using KqlMagic extension from within the notebook.  To note, ADX supports ONNX models. 

 

With its support for data science languages and libraries, training and scoringboth in batch and near real time, Azure Data Explorer makes a compelling modern advanced analytics platform. 

 
6. VISUALIZATION 


Azure Data Explorer offers visualization out of the box with its Web/Desktop based 
integrated development environment.  It also offers native dashboarding out of the box, with support for a variety of charts, and direct query support.  Further, it has native integration with Power BI with support for predicate and projection pushdown and native connectors for Grafana, Kibana and Redash (now Databricks), and ODBC support for Tableau, Sisense, Qlik and more.  With KQL used in Jupyter notebooks along with Python, Python visualization libraries can be leveraged as well. 

 

An analytical database platform is incomplete without a complementary and strong visualization story.  With its own native dashboarding, charting, and support for popular BI/dashboarding ISV solutions, Azure Data Explorer makes it possible to build visualization rich analytical solutions. 

 

7. AUTOMATED INFORMATION LIFECYCLE MANAGEMENT 


Azure Data Explorer offers configurable time duration based hot cache policy with automated eviction upon expiration. It offers configurable time to live/retention policy with automated removal of data upon expiration.
 

 

Historical data can be persisted in Azure Data Lake Store, with the “External Table” feature to query via KQL.  Azure Data Lake Store offers its own information lifecycle management with tiers, automated retention and archival. 


8. SECURITY
 


Azure Data Explorer offers enterprise grade security and has adoption in security critical industry domains such as healthcare and finance, and governments. 


PERIMETER PROTECTION
 

Azure Data Explorer is a Vnet injectable service and offers perimeter protection solution with network isolation, with network security group rules and firewall for inbound and outbound access control respectively.  It also supports service endpoint to storage to bypass the public internet and optimally route over the Azure backbone.   

 

AUTHENTICATION 

Azure Data Explorer supports Azure Active Directory (AAD) authentication out of the box.   

 

AUTHORIZATION 

Azure Data Explorer supports access control based on AAD security groups, with Role Based Access Control at a cluster, database, and table row level granularity.  

 
DATA MASKING 

Azure Data Explorer supports configurable data masking.   

 

ENCRYPTION 

Azure Data Explorer supports it supports encryption over the wire with TLS 1.2 by default, and AES256 bit “at rest” encryption with Microsoft Managed keys (default) or “Bring Your Own Keys” at its durable storage layer (cold cache), with user configurable “at rest” encryption for its hot cache (disks).  Intracluster shuffle is encrypted with IPSec.   

 

AUDITING 

All operations are audited and persisted.   

 

COMPLIANCE CERTIFICATIONS & GDPR 

It is rich from a compliance certification perspective and has delete support for GDPR.  Information in Azure Data Explorer can be catalogued in Azure Data Catalog to fulfill governance requirements.  Azure Data Explorer audit/monitoring data can be ingested into Azure Sentinel or any other SIEM of choice. 

 

9. WORKLOAD ISOLATION AND COST CENTER CHARGEABILITY 


Azure Data Explorer supports centralizing ingestion and data engineering in a (leader) cluster and in-place querying federated to follower clusters made possible by Azure Data Share.  
This simplifies cost center chargeability, and sizing query clusters, preventing resource contention between data engineering and querying workloads with the separate cluster per workload model. 

 

10. DEVELOPER AND OPERATIONS FRIENDLY 


Azure Data Explorer has SDKs in C#, Java, Python, Node
, R and Go, and a REST interface for management and data operations.  From an IDE perspective it offers a Web user interface and a Desktop IDE with built-in powerful intellisense, visualization and query-sharing/collaboration.     

 

Azure Data Explorer offers logging, monitoring, diagnostics and alerting with its Azure Monitor integration.  Patching of infrastructure, health monitoring etc are fully managed by the service.  Provisioning and deployment automation is simplified through support for Azure DevOps.  Due to its managed nature, it does not require a database administrator. 

 

The capabilities detailed above make the service both developer and operations friendly. 

 

11. COST OPTIMIZATION 


Azure Data Explorer offers 
reserved instance pricing, which is cost optimization with term commitment model.  It offers information lifecycle management by automatically evicting expired data from hot cache, and deleting expired data in cold cache as a cost optimization.  It offers configurable auto-scale out and scale-in capability to right-size a cluster based on workload in execution.  Azure Advisor recommendations are available for optimization of the cluster, including rightsizing. Azure Data Explorer offers a developer SKU and the ability to pause clusters when not in use as further cost optimizations.   Review our 

 

12. AVAILABILITY AND DISASTER RECOVERY 


Azure Data Explorer is built natively highly available from the ground up, 
leverages the natively resilient Azure storage for data durability, supports zonal redundancy, and offers an SLA of 99.9% with configurable options for disaster recovery.  Azure Data Explorer’s underlying services are provisioned highly available across compute nodes, load balanced as required and are self-healing.   

 

Azure Data Explorer is available in all Azure regions and is one of the foundational services deployed after infrastructure and identity in a region. 

 

For analytical applications that need to be resilient to all manners of failures, Azure Data Explorer is a great fit.  Learn more about high availability and disaster recovery in Azure Data Explorer here.

 

13. AUTOMATION, SCHEDULING & ORCHESTRATION 


Azure Data Explorer supports Azure Data Factory, Azure Logic Apps, and 
Power Automate (previously Microsoft Flow) for data movement/composing business workflows, for scheduling and orchestration.   

 

REFERENCE ARCHITECTURE 

RESOURCES