Microsoft Archives - Page 597 of 1143 - Dr. Ware Technology Services

HPC Performance and Scalability Results with Azure HBv3 VMs

by Contributed | Mar 15, 2021 | Technology

This article is contributed. See the original author and article here.

Article contributed by Jithin Jose, Jon Shelley, and Evan Burness

Azure HBv3 Virtual Machines for High-Performance Computing (HPC) featuring new AMD EPYC 7003 “Milan” processors are now generally available. This blog provides in-depth technical information about these new VMs. Below, based on testing across CFD, FEA, and quantum chemistry workloads, we report that HBv3 VMs are:

2.6x faster on small-scale HPC workloads (e.g. 16-core comparison, HBv3 v. H16mr))

17% faster for medium-scale HPC workloads (1 HBv3 VM v. 1 HBv2 VM)

12-18% faster for large-scale HPC workloads (2 – 16 VMs, HBv3 v. HBv2)

23-89% faster for very large HPC workloads (> 64 VMs)

Capable of scaling MPI HPC workloads to nearly 300 VMs and ~33,000 CPU cores

HBv3 VMs – VM Size Details & Technical Overview

HBv3 VMs are available in the following sizes:

VM Size	CPU cores	Memory (GB)	Memory per Core (GB)	L3 Cache (MB)	NVMe SSD	InfiniBand RDMA network
Standard_HB120-16rs_v3	16	448 GB	28 GB	480 MB	2 x 960 GB	200 Gbps
Standard_HB120-32rs_v3	32	448 GB	14 GB	480 MB	2 x 960 GB	200 Gbps
Standard_HB120-64rs_v3	64	448 GB	7 GB	480 MB	2 x 960 GB	200 Gbps
Standard_HB120-96rs_v3	96	448 GB	4.67 GB	480 MB	2 x 960 GB	200 Gbps
Standard_HB120rs_v3	120	448 GB	3.75 GB	480 MB	2 x 960 GB	200 Gbps

These VMs share much in common with HBv2 VMs, with two key exceptions being the CPUs and local SSDs. Full specifications include:

Up to 120 AMD EPYC 7V13 CPU cores (EPYC 7003 series, “Milan”)

2.45 GHz Base clock / 3.675 GHz Boost clock

Up to 32 MB L3 cache per core (double-wide L3 compared to 7002 series, “Rome”)

448 GB RAM

340 GB/s of Memory Bandwidth (STREAM TRIAD)

200 Gbps HDR InfiniBand (SRIOV), Mellanox ConnectX-6 NIC with Adaptive Routing

2 x 900 GB NVMe SSD (3.5 GB/s (reads) and 1.5 GB/s (writes) per SSD, large block IO)

HBv3 VMs also differ in the follow ways at the BIOS level and subsequently the VM level, as well:

BIOS setting	HBv2	HBv3
NPS (nodes per socket)	NPS=4	NPS=2
L3 as NUMA	Enabled	Disabled
NUMA domains within OS	30	4

Microbenchmarks

Below are initial performance characterizations using a variety of configurations on both microbenchmarks as well as commonly used HPC applications for which the HB family of VMs is optimized for.

MPI Latency (us)

OSU Benchmarks (5.7) – osu_latency with MPI = HPC-X, Intel MPI, MVAPICH2, OpenMPI

Message Size (bytes)	HPC-X (2.7.4)	Intel MPI (2021)	MVAPICH2 (2.3.5)	OpenMPI (4.0.5)
0	1.62	1.69	1.73	1.63
1	1.62	1.69	1.75	1.63
2	1.62	1.69	1.75	1.63
4	1.62	1.7	1.75	1.64
8	1.63	1.69	1.75	1.63
16	1.63	1.7	1.79	1.64
32	1.78	1.83	1.79	1.79
64	1.73	1.8	1.81	1.74
128	1.86	1.91	1.95	1.84
256	2.4	2.45	2.48	2.37
512	2.47	2.54	2.52	2.46
1024	2.58	2.63	2.63	2.55
2048	2.79	2.83	2.8	2.76
4096	3.52	3.54	3.55	3.52

MPI Bandwidth (MB/s)

OSU Benchmarks (5.7) – osu_bw with MPI = HPC-X, Intel MPI, MVAPICH2, OpenMPI

Message Size (bytes)	HPC-X (2.7.4)	Intel MPI (2021)	MVAPICH2 (2.3.5)	OpenMPI (4.0.5)
4096	8612.8	7825.14	6762.06	8525.96
8192	12590.63	11948.18	9889.92	12583.98
16384	11264.74	11149.76	13331.45	11273.22
32768	16767.63	16667.68	17865.53	16736.85
65536	19037.64	19081.4	20444.14	18260.97
131072	20766.15	20804.23	21247.24	20717.68
262144	21430.66	21426.68	21690.97	21456.29
524288	21104.32	21627.51	21912.17	21805.95
1048576	21985.8	21999.75	23089.32	21981.16
2097152	23110.75	23946.97	23252.35	22425.09
4194304	24666.74	24666.72	24654.43	24068.25

Application Performance – Small to Large Scale

Category: Small scale (1 node), license-bound HPC jobs

App: ANSYS Mechanical 21.1

Domain: Finite Element Analysis (FEA)

Model: Power Supply Module (V19cg-1)

Configuration Details: We used the 16-core VM version of HBv3, in order to match the per-core licensing required to support this workload on our last VM size built specifically to support high-performance at low core counts (H16/H16r/H16mr VMs based on high-frequency Xeon E5 2667 v3, “Haswell”, baseclock 3.2 GHz with Turbo frequencies of 3.6 GHz). This ensures that for customers running such a workload for which the total cost of solution is dominated by software licensing costs that performance and performance/$ gains on infrastructure are not offset or exceeded by having to pay for more per-core software licenses as well. Thus, the objective for this customer scenario is to see if HBv3 VMs with EPYC 7003 series processors can provide a performance uplift at an identical (or reduced) core count on a single VM.

		HBv3 (16-core VM)		H16mr (16-core VM)
VMs	Cores	Solver Performance (GFLOPS)	Elapsed Time (Solver Time+ IO)	Solver Performance (GFLOPS)	Elapsed Time (Solver Time+ IO)
1	1	3.5	1035	1.4	2411
1	2	6.2	647	2.9	1327
1	4	16.3	454	6.2	909
1	8	27.6	327	9.3	547
1	16	42.9	190	15.3	400

Figure 1: ANSYS Mechanical absolute solver performance comparison with incremental software licensed CPU cores on HBv3 and H16mr VMs

Figure 2: ANSYS Mechanical Speedup from HBv3 (16-core VM size) v. H16mr VM from 1-16 licensed CPU cores

Conclusions: Azure HBv3 VMs provide very large improvements for small, low core-count customer workloads for which software licensing is the dominant factor in a customer’s total cost of solution. Testing with the V19cg-1 benchmark running on ANSYS Mechanical shows performance speedups of 3x ad 2.6x, respectively, when running the workload at 8 and 16 cores. This addresses customer desire for improved HPC performance while keeping software licenses constant.

In addition, we observe a user can reduce licensing usage by 4x, as just 4 cores of a HBv3 VM delivers still slightly higher performance than using all 16 cores of a H16mr VM. This addresses customer desire to for lower overall total cost of solution.

Category: Medium scale HPC jobs (1 large, modern node size)

App: Siemens Star-CCM+ 15.04.008

Domain: Computational fluid dynamics (CFD)

Model: LeMans 100M Coupled Solver

Configuration Details: We used the 120-core HBv3 VM size in order to match it to the 120-core HBv2 size. HBv2, with its EPYC 7002 series (Rome) CPU cores and 340 GB/sec of memory bandwidth, is already the public cloud’s highest performing and most scalable platform for single and multi-node CFD workloads. Thus, it is important we evaluate what enhancements to CFD performance HBv3 with EPYC 7003 series (Milan) brings. A single VM test at 120 cores is also important because many Azure customer HPC workloads run at this scale, thus this comparison is highly relevant to production workload scenarios.

Figure 3: Star-CCM+ absolute performance in solver elapsed time for 20 iterations, Azure HBv3 and HBv2 VMs

Figure 4: Star-CCM+ Speedup in relative performance, 1 HBv3 v. 1 HBv2 VM

Conclusions: In this test, Azure HBv3 VMs provide a 17% performance uplift for a medium-sized HPC workload such as the CFD benchmark 100m cell Le Mans coupled solver case from Siemens for use with Star-CCM+. These results provide a reasonably good view of the performance uplift for a non-MPI HPC workloads on a 1 VM basis for well-parallelized applications. The 17% gap corresponds closely with the 19% improvement in instructions per clock of the Zen3 core in EPYC 7003 series as compared to the Zen 2 core in EPYC 7002 series found in Azure HBv2 VMs.

Of note, the EPYC 7002-series CPUs in HBv2 still provide exceptionally good performance for this model, and aside from HBv3 VMs remain the fastest and most scalable VMs on the public cloud for HPC workloads. Siemens, itself, recommends as of Q4 2020 AMD EPYC 7002-series (Rome) over Intel Xeon for best performance and performance/$. Thus, both HBv2 and HBv3 VMs recommend exceptionally good performance and value options for Azure HPC customers.

Finally, one difference we call out is that our testing on HBv2 VMs occurred with CentOS 7.7 whereas our testing with HBv3 VMs occurred with CentOS 8.1. Both images feature the same HPC-centric tunings, but it is worth follow up investigation to determine if OS differences contribute to the performance delta measured here. Also, the HBv2 VM performance was taken with 116 cores utilized (out of 120) because it produced the best performance. On HBv3 VMs, using all 120 cores produced the best performance.

Category: Large scale HPC jobs (2 – 16 modern nodes, or ~2,000 CPU cores/job)

App: Siemens Star-CCM+ 15.04.008

Domain: Computational fluid dynamics (CFD)

Model: LeMans 100M Coupled Solver

Configuration Details: We again use the 120-core HBv3 VM size in order to match it to the 120-core HBv2 size. HBv2, with its EPYC 7002 series (Rome) CPU cores and 340 GB/sec of memory bandwidth, is already the public cloud’s highest performing and most scalable platform for single and multi-node CFD workloads. Thus, it is important we evaluate what enhancements to CFD performance HBv3 with EPYC 7003 series (Milan) brings.

Large multi–node (or, perhaps more appropriately in a public cloud context, “multi-VM”) performance up to ~2,000 processor cores is important because many Azure customers run MPI workloads at this scale, or would like to in search of faster time to solution or higher model fidelity. For both HBv2 and HBv2 VMs, we found using 116 cores out of 120 in the VM produced the best performance, and thus this setting was used for the scaling exercise. We also used Adaptive Routing in both cases, which can be employed by customers following the steps here. As mentioned above, CentOS 7.7 was used for HBv2 benchmarking, while CentOS 8.1 is used for HBv3 benchmarking.

Figure 5: Star-CCM+ absolute performance in solver elapsed time for 20 iterations, 2 – 16 VMs on Azure HBv3 and HBv2 VMs

Figure 6: Star-CCM+ relative performance, 2 – 16 VMs on Azure HBv3 and HBv2 VMs

Conclusions: In this test, Azure HBv3 VMs provide a 12-18% performance uplift for large HPC workloads such as the CFD benchmark 100m cell Le Mans coupled solver case from Siemens for use with Star-CCM+ across a scale range of two to sixteen VMs (up to ~2,000 CPU cores). These results provide a reasonably good view of the performance uplift for a non-MPI HPC workloads on a 1 VM basis for well-parallelized applications. The 12-17% gap corresponds somewhat closely with the 19% improvement in instructions per clock of the Zen3 core in EPYC 7003 series as compared to the Zen 2 core in EPYC 7002 series found in Azure HBv2 VMs.

Significant Boosts at Very Large Scale MPI Jobs

Category: Very large scale HPC jobs (64 – 128 nodes, or ~4,000 to ~16,000 CPU cores/job)

App: OpenFOAM v1912, CP2K (latest stable), Star-CCM+ 15.04.088

Domain: Computational fluid dynamics (CFD), Quantum Chemistry

Model: 28m motorbike (OpenFOAM), H20-DFT-LS (CP2K), and Le Mans 100m Coupled Solver

Very large-scale multi–node (or, perhaps more appropriately in a public cloud context, “multi-VM”) performance up to ~16,000 processor cores is important because some Azure customers run MPI workloads at these kinds of scale, or would like to in search of faster time to solution or higher model fidelity.

For OpenFOAM, we tested a variety of configurations and found that the best performance settings in terms of processes per node varied from one scaling step to another. Thus, we have posted the best for each below. In other words, we have plotted the “best foot forward” for each of HBv2 and HBv3 VMs.

For CP2K and Star-CCM+, we found using 116 out of 120 processor cores per VM produced the best performance, and thus we are using this setting for this scaling exercise.

We used Adaptive Routing in for all cases, which can be employed by customers following the steps here.

Figure 7: OpenFOAM, CP2K, and Star-CCM+ relative performance at scale v. HBv2 VMs

Conclusions: Across several widely used HPC applications, a common pattern observed is that as scaling increases, the performance difference between HBv3 VMs featuring AMD EPYC 7003 series processors and HBv2 VMs featuring AMD EPYC 7002 series processors increases substantially and often suddenly.

In Star-CCM+, the 12-18% performance lead for HBv3 observed between 1-16 VMs grows to 23% at 128 VMs (14,848 cores)

In CP2K, a 10-15% performance lead for HBv3 observed between 1-16 VMs grows to 43% at 128 VMs (14,848 cores)

In OpenFOAM, a 12-18% lead for HBv2 observed between 1-16 VMs grows to a nearly 90% at 64 VMs (4,096 cores)

This is a most unique phenomenon and one that whose repeatability across several applications bodes very well for the EPYC 7003 series processor for very large scaling MPI workloads. To understand the uniqueness of what we observe here, consider that HBv2 are HBv3 VMs are identical in the following ways:

Up to 120 processor cores (both AVX2 capable)

~330-340 GB/s memory bandwidth (STREAM TRIAD)

480 MB L3 cache per VM

Mellanox HDR 200 Gb InfiniBand (1 NIC per VM) with common network design

It is worth noting that HBv3 VMs *can* run at a ~200-250 MHz higher frequency (~3,000-3100 MHz on HBv3 v. ~2,820 MHz for HBv2) when all (or nearly all) cores are loaded with these applications. However, this advantage is workload dependent and, even if present in the cases benchmarked above, would not come close to accounting for the widening performance gaps we have measured.

The L3 cache architecture of Milan and the Zen3 core, however, are a key difference that appears to be having a very positive affect on these workloads. While the total L3 cache per server (and per VM) is the same it is divided up far less at the hardware level. A “Rome” L3 cache boundary is every 4 cores and is 16 MB in size. A “Milan” L3 cache boundary is every 8 cores and is 32 MB in size. In other words, a dual-socket Rome server is, physically, 32 blocks each with 4 cores and 16 MB L3, whereas a Milan dual-socket server is, physically, half as many blocks (16) with 2x as many cores and 2x as much L3 (8 and 16 MB, respectively). This significantly decreases the probability of cache misses which in turns means much higher effective memory bandwidth for the workload in question.

The Azure HPC team will be following up on this discovery with additional benchmarking and profiling. In the meantime, it appears EPYC 7003 series delivers some of its largest differentiation v. its logical predecessor, Rome, for supercomputing-class MPI workloads.

Application Performance – Extreme Scale

Category: Extreme scale HPC jobs ( > 20,000 cores/job)

App: Siemens Star-CCM+ 15.04.008

Domain: Computational fluid dynamics (CFD)

Model: LeMans 100M Coupled Solver

Configuration Details: We again used the 120-core HBv3 VM size for this scaling examination, this time testing the ability of HBv3 VMs to scale to levels reserved for some of the largest supercomputers. Extreme-scale performance evaluations are critical proof points for Azure’s most demanding HPC customers such as those performing time-critical weather modeling, geophysical re-simulation, and advanced research into effective disease treatments. Here, we once more tested Star-CCM+ ver. 15.04.088 with CentOS 8.1, Adaptive Routing, and HPC-X MPI ver. 2.7.4. We performed the scaling exercise using 116 out of 120 cores available to the VM due to this configuration providing the best performance.

Figure 8: Star-CCM+ relative performance at scale from 1 – 288 VMs on Azure HBv3

Conclusions: In this test, Azure HBv3 VMs demonstrate speedup with scale from 1 to 288 VMs (116 to 33,408 CPU cores). Performance is linear or super linear up to 64 VMs 7,424 cores. This means HPC customers on Azure can scale realize time to solution improvements that directly correspond to the amount of HBv3 infrastructure they provision, which due to the speedup results in no additional total cost of the job. Beyond 64 VMs, the amount of work per process comes too small and scaling efficiency inevitably declines. Still, at 288 VMs, we still observe scaling efficiency of 75% and job speedup of more than 215x.

Community Update – Introducing the Microsoft Azure Data Community Advisory Board

by Contributed | Mar 15, 2021 | Technology

This article is contributed. See the original author and article here.

Earlier this year, Microsoft launched our Azure Data Community initiative, with assets such as a landing page for the community, a support network for local Community Groups, and Microsoft Teams subscriptions for Community Group leaders. We are following through on our commitment that community should be “Community-Owned, Microsoft-Empowered”.

To drive this initiative, we’ve chosen various recognized Community experts from all over the globe to act as an advisory board to Microsoft on community needs. The current members of the Azure Data Advisory Board are Annette Allen, Steve Jones, Wolfgang Strasser, Tillmann Eitelberg, Randolph West, Kevin Kline, Gaston Cruz, Pio Balistoy and Monica Rathbun.

These technical professionals are well known for their commitment to the community, getting things done, and being problem solvers. They are speakers, leaders, organizers, advocates and represent user groups and conferences of all shapes and sizes. We’re asking them to advise Microsoft on what the data community needs and how we can help.

Since these folks can’t do this job alone, they’ll be tasked with selecting and establishing a larger, diverse advisory committee to collaborate. Together they’ll decide the guide lines for the board and committee, such as who should be on the advisory board, when to step down from the role, how a replacement is selected, how often they meet, and other general logistics.

But it isn’t all just advising. There’s work to do. These folks will be among the first people and groups onboarded to the Community Teams Tenant. They will be pairing up with other groups to help them onboard and to speed up the roll out, just as other user groups will be asked to do the same. This is a community driven effort. And Microsoft is listening. Make sure you reach out to them with ideas, especially if you are a local Community Group leader.

This group isn’t advising the Azure Data Community – that’s something the community itself should decide on. We will be holding an Azure Data Community group leaders meeting soon on the Teams channel to discuss community and share best practices with each other. We’re looking forward to seeing the amazing things you do and how you help each other learn and grow.

The March 12th Weekly Roundup is Posted!

by Contributed | Mar 15, 2021 | Technology

This article is contributed. See the original author and article here.

Pssst! You may notice the Round Up looks different – we’re rolling out a new, concise way to show you what’s been going on in the Tech Community week by week.

Top 10 Blogs & Conversations this Week:

Released: March 2021 Exchange Server Security Updates

What’s New in Microsoft Teams | Microsoft Ignite 2021

Microsoft 365 apps say farewell to Internet Explorer 11 and Windows 10 sunsets Microsoft Edge Legacy

Enhanced performance, designed for simplicity – the new Outlook for Mac

Why can’t I see the Microsoft Teams Meeting add-in for Outlook?

Microsoft Teams is now available on Linux

March 2021 Exchange Server Security Updates for older Cumulative Updates of Exchange Server

Windows 10, version 1909 delivery options

Startup Boost FAQ

Released: December 2020 Quarterly Exchange Updates

Catch up on all blogs here!

Important Events:

· Mar 16^th – Microsoft App Assure

Exploring Purview’s REST API with Python

by Contributed | Mar 15, 2021 | Technology

This article is contributed. See the original author and article here.

In this article – we use common Python techniques to explore several Azure Purview built-in capabilities that are available through Purview Studio by taking advantage of the REST API.

In particular – the article is split up into 2 sections:

Column asset classifications: We explore a Python script executed in Azure Synapse with some reusable functions to programmatically access Purview’s Atlas REST API.

Purview Insights: The various levels of insights Purview generates is presented in the form of a report. We explore how to use Python to extract this data, and create a custom dashboard in Power BI.

The idea is, the exploratory techniques presented here can be extrapolated to come up with highly creative solutions! The full Python Scripts used to setup both pipelines are linked as GitHub Gists per section (highlighted in blue)

The first version of this article appeared on my blog here. The techniques outlined there have been improved in this version, and extended to form the Power BI reporting capabilities we’ll explore below.

Overview

Once Azure Purview is deployed and initially set up (i.e. Data Sources registered, Assets scanned and Classifications applied), there’s two high level approaches to interacting with the managed service:

Purview Studio: i.e. the visual interface (great overview here)

Purview REST API: i.e. the programmatic interface – based on Apache Atlas V2 REST API (Swagger definition available here). There are also other REST API’s Purview makes available to us under-the-hood while we navigate the web experience (such as api/browse, graphql), as we’ll see shortly.

So for example, every time we click around in Purview Studio, the Front-End Web App (i.e. what we see) is really systematically calling the REST APIs under-the-hood using the authenticated user’s OAuth token – you can press F12 in your Web Browser to examine how these API calls are structured:

Purview Studio & Purview REST API talking to each other

Amongst many other things, we can leverage this capability to programmatically access the valuable metadata made available through Purview’s features – such as Asset Scans and Classifications, to enable downstream Data Engineering capabilities/processes (that would otherwise be extremely cumbersome to automate and implement in a scalable manner).

In simple words, the idea is:

Purview gives us intelligent metadata: “These are the files on your Data Lake and the underlying columns that contain sensitive data”

And in return, we can say to Synapse: “Go ahead and use Spark Pools to drop, obfuscate, anonymize or otherwise deal with those columns”

The first step to unlock this is to programmatically enable Synapse Spark Pools to retrieve the metadata from Purview (in this case, the schema and sensitivity labels). This is the first scenario we are going to implement in this post.

Accessing Purview Asset list, Asset Schema and Column level Classifications in a Spark Dataframe

Once we have the metadata available in Spark as a Dataframe, there’s a whole world of existing/up-and-coming Data Engineering capabilities we can leverage to allow us to take action on this metadata, such as dynamically obfuscating columns containing PII (using libraries such as Microsoft Presidio, which comes with pip install, and can also be deployed to a Kubernetes Cluster).

If you’ve never heard of Microsoft Presidio, check out this live demo website here – it’s brilliant.

Similarly, we can use similar techniques to create Custom Power BI dashboards on top of Purview’s Insights data – using Python in Power BI to populate a dataset and create a Custom Report on the same data:

Creating a Custom Dashboard using Data from Purview Insights

This is the second scenario we are going to implement in this post. Let’s get started.

Pre-Requisites & Environment Setup

Purview and Data Estate setup

Our data state will be composed of an Azure Data Lake storage account which in turns will be scanned by Purview to index and classify the data. To quickly populate a Data Lake and set up Purview scans, follow this five-part tutorial series that comes with a helpful starter kit (zip file) to generate a variety of random datasets with PII and non-PII data.

App Registration/Service Principal

Once Purview is set up to scan the Data Lake, the next step is to create a Service Principal that will allow us to access the REST API – steps here.

To localize the Python script to point to our Purview implementation, we make note of the:

clientId

tenantId

clientSecret

Libraries used

Here’s the contents of the requirements.txt file used to install libraries on the Synapse Spark Pool:

jmespath
pyapacheatlas

JMESPath: link

JMESPath is a nifty library that makes it a breeze to navigate (query, filter etc.) complex JSON data in Python. We use it to comfortably parse and iterate through the JSON payloads received from the Purview REST API.

PyApacheAtlas: link

PyApacheAtlas is a Python library in development by a fellow Microsoft CSA – Will Johnson. Essentially, the library presents a programmatic wrapper around the Atlas API, accessible through a Client object.

For an excellent deep-dive on the internals of the Purview REST API, check out this tutorial from him:

We use a combination of the Purview REST API (via JMESPath), and the Client object to achieve getting the full list of Assets from Purview, and iterating to populate the corresponding metadata per Asset.

Scenario 1: Column asset classifications

GitHub gist: full script

Here’s a summarized view of what we achieve:

Extracting metadata from Purview with Synapse Spark Pools using Python

Let’s look at the relevant components from the script.

The first function azuread_auth is straightforward and not Purview specific – it simply allows us to authenticate to Azure AD using our Service Principal and the Resource URL we want to navigate (in this case, Purview: https://purview.azure.net:(

def azuread_auth(tenant_id: str, client_id: str, client_secret: str, resource_url: str):
    """
    Authenticates Service Principal to the provided Resource URL, and returns the OAuth Access Token
    """
    url = f"https://login.microsoftonline.com/{tenant_id}/oauth2/token"
    payload= f'grant_type=client_credentials&client_id={client_id}&client_secret={client_secret}&resource={resource_url}'
    headers = {
    'Content-Type': 'application/x-www-form-urlencoded'
    }
    response = requests.request("POST", url, headers=headers, data=payload)
    access_token = json.loads(response.text)['access_token']
    return access_token

We’re going to be passing around the access_token returned above every time we make a call to Purview’s REST API.

Next, we leverage PyApacheAtlas to return a client using purview_auth:

def purview_auth(tenant_id: str, client_id: str, client_secret: str, data_catalog_name: str):
    """
    Authenticates to Atlas Endpoint and returns a client object
    """
    oauth = ServicePrincipalAuthentication(
        tenant_id = tenant_id,
        client_id = client_id,
        client_secret = client_secret
    )
    client = PurviewClient(
        account_name = data_catalog_name,
        authentication = oauth
    )
    return client

Once we have our proof of authentication (access_token and client) – we’re ready to programmatically access the Purview REST API.

We use get_all_adls_assets to recursively retrieve all scanned assets from our Data Lake from the Purview REST API.

Note: this function intentionally traverses the tree structure until only assets remain (i.e. no folders are returned, only files).

The function below applies the simple recursion techniques I outlined in this article against our Data Lake and Purview API to retrieve asset names and schemas.

While this is fine for exploration, due diligence (i.e. implementing a more optimal, piecemeal approach) should be applied for Production implementations on a case-by-case basis to avoid long-running jobs.

The API parameter used to determine whether we hit the end is isLeaf:

def get_all_adls_assets(path: str, data_catalog_name: str, azuread_access_token: str, max_depth=1):
    """
    Retrieves all scanned assets for the specified ADLS Storage Account Container.

    Note: this function intentionally recursively traverses until only assets remain (i.e. no folders are returned, only files).
    """

    # List all files in path
    url = f"https://{data_catalog_name}.catalog.purview.azure.com/api/browse"

    headers = {
            'Authorization': f'Bearer {azuread_access_token}',
            'Content-Type': 'application/json'
            }

    payload="""{"limit": 100,
                "offset": null,
                "path": "%s"
                }""" % (path)

    response = requests.request("POST", url, headers=headers, data=payload)

    li = json.loads(response.text)

    # Return all files
    for x in jmespath.search("value", li):
        if jmespath.search("isLeaf", x):
            yield x

    # If the max_depth has not been reached, start
    # listing files and folders in subdirectories
    if max_depth > 1:
        for x in jmespath.search("value", li):
            if jmespath.search("isLeaf", x):
                continue
            for y in get_all_adls_assets(jmespath.search("path", x), data_catalog_name, azuread_access_token, max_depth - 1):
                yield y

    # If max_depth has been reached,
    # return the folders
    else:
        for x in jmespath.search("value", li):
            if jmespath.search("!isLeaf", x):
                yield x

Note a couple points regarding this function:

We can further expand the implementation by abstracting away the data source and making source_type into a parameter (i.e. besides ADLS, we can query metadata about other sources supported on Purview – e.g. SQL DB, Cosmos DB etc.).

We’ll just need to deal with curating the payload on a case-by-case basis, but the basic premise remains the same.

Note the limit: 100 parameter is there because I didn’t want to deal with API Pagination logic (the demo Data Lake is small).

This parameter can be increased for larger implementations up until we hit the upper limit defined by the API – at which point we need to implement pagination best practices into our script logic (no different than other Azure/non-Azure APIs).

For deeper folder structures, max_depth can be increased as desired

Once we have a list of all our assets, we can iterate through the list and retrieve the Schema and Classification from Purview inline:

files_df['schema'] = files_df.apply(lambda row: get_adls_asset_schema(assets_all, row['asset'], atlas_client), axis=1)

Where we use the client object we defined earlier to call get_adls_asset_schema:

def get_adls_asset_schema(assets_all: list, asset: str, purview_client: str):
    """
    Returns the asset schema and classifications from Purview
    """
    # Filter response for our asset of interest
    assets_list = list(filter(lambda i: i['name'] == asset, assets_all))

    # Find the guid for the asset to retrieve the tabular_schema or attachedSchema (based on the asset type)
    match_id = ""
    for entry in assets_list:
        # Retrieve the asset definition from the Atlas Client
        response = purview_client.get_entity(entry['id'])

        # API response is different based on the asset
        if asset.split('.', 1)[-1] == "json":
            filtered_response = jmespath.search("entities[?source=='DataScan'].[relationshipAttributes.attachedSchema[0].guid]", response)
        else:
            filtered_response = jmespath.search("entities[?source=='DataScan'].[relationshipAttributes.tabular_schema.guid]", response)

        # Update match_id if source is DataScan
        if filtered_response:
            match_id = filtered_response[0][0]

    # Retrieve the schema based on the guid match
    response = purview_client.get_entity(match_id)
    asset_schema = jmespath.search("[referredEntities.*.[attributes.name, classifications[0].[typeName][0]]]", response)[0]

    return asset_schema

Note a couple takeaways from here:

JMESPath is awesome

The Atlas API response is slightly different based on the filetype (e.g. json vs csv), hence we deal with it case-by-case.

This makes sense, since json technically has attachedSchema (i.e. Schema that comes as a part of the object itself), whereas csv is of type tabular_schema (i.e. Schema that Purview had to infer)

Finally, once the functions are done calling the API, we can call a display(files_df) on our DataFrame to get back the final output:

Final Output

Note: files_df is a Pandas DataFrame, but we can easily convert to Spark with files_df = spark.createDataFrame(files_df).

Shouldn’t make a difference for our purposes since the DataFrame is small.

Scenario 2: Purview Insights

GitHub gist: full script

Our goal is to create this Power BI Report – which provides us with the same data that Purview Studio makes visually available to us. The idea here is for us to be able to leverage the ability of Power BI to create Custom Reports:

Demo Power BI Report generated from Purview Insights data

We simply specify the script as a Python Data source – where the script is structured such that it queries Purview’s APIs to produce Pandas Dataframes:

Using the Python Script as a Power BI Data Source

Note: We acknowledge that this method of data extraction is experimental in nature, and is definitely not suitable for ingesting a large amount of data into Power BI.

In our case, since the Insights data is pre-computed by the Purview Engine already, this serves the end goal of creating simple Custom Reports (i.e. our Python script doesn’t have to work very hard to extract this data).

Finally, we can refresh this Power BI report as needed, to ingest the latest data points from Purview:

Refreshing the Power BI dataset, which executes the underlying Python query

Wrap Up

We explored how to call the Purview REST API with Python to programmatically obtain Purview Asset Metadata – i.e. Schema and Classifications into Synapse as a DataFrame. We also looked at how we can apply the same techniques to ingest data from Purview Insights, and create custom Power BI dashboards with ease.

Use Spark (Scala) to write data from ADLS to Synapse Dedicated Pool

by Contributed | Mar 15, 2021 | Technology

This article is contributed. See the original author and article here.

In this article, I would be talking about how can we write data from ADLS to Azure Synapse dedicated pool using AAD . We will be looking at direct sample code that can help us achieve that.

1. First step would be to import the libraries for Synapse connector. This is an optional statement.

2. Next step is to initialize variable to create/read data frames

Note : Above step can also be written in below format :

//val df = spark.read.csv(“abfss://synapse@mukund.dfs.core.windows.net/100SalesRecords.csv”)

3. Next step would be to use the write api in below format :

Execute the cell and you will be able to see the new table with data popped up:

Observation in Driver log with this exercise:

We find external data source, file format and external table created as well as dropped during this automated process.

More information about other options for dedicated pool and server less related read/write API’s in SPARK can be found out on this page.

« Older Entries

Next Entries »

HPC Performance and Scalability Results with Azure HBv3 VMs

Community Update – Introducing the Microsoft Azure Data Community Advisory Board

The March 12th Weekly Roundup is Posted!

· Mar 16^th – Microsoft App Assure

Exploring Purview’s REST API with Python

Use Spark (Scala) to write data from ADLS to Synapse Dedicated Pool

Recent Posts

Recent Comments

Archives

Categories

Meta

We look forward to meeting you

HPC Performance and Scalability Results with Azure HBv3 VMs

Community Update – Introducing the Microsoft Azure Data Community Advisory Board

The March 12th Weekly Roundup is Posted!

· Mar 16th – Microsoft App Assure

Exploring Purview’s REST API with Python

Use Spark (Scala) to write data from ADLS to Synapse Dedicated Pool

Recent Posts

Recent Comments

Archives

Categories

Meta

We look forward to meeting you

· Mar 16^th – Microsoft App Assure