This article is contributed. See the original author and article here.
On June 1, 2023, Microsoft Defender for IoT moved to site-based licensing for organizations looking to protect their operation technology (OT) environments. The previous Azure consumption model for this solution will no longer be available for purchase by new customers. Existing customers can choose to transition to site-based licensing or remain on the consumption model.
In today’s digital transformation, operational technology (OT) has become an important part of various industries, from power plants and manufacturing facilities to transportation systems and healthcare institutions. While OT systems play an essential role in smoothly operating critical infrastructure, adversaries often target vulnerabilities in these interconnected systems causing severe business and operational disruption, financial losses, reputational damage, and more. Microsoft Defender for IoT helps organizations reduce these risks by enabling security teams to identify and remediate vulnerable OT systems in their environment – limiting exposure to threats like ransomware and targeted malware attacks.
To help organizations evolve their defenses against the growing attacks on OT environments, we are thrilled to announce site-based licensing for Defender for IoT. This new model brings increased price predictability and flexibility to organizations with sites that vary in size by offering a tiered approach based on the maximum number of OT devices looking to be protected per site. With this solution, organizations can easily determine and manage the cost of securing their OT systems. We believe that by introducing site-based licensing, we are making it more convenient than ever for organizations to empower security teams with the tools needed to manage and protect their operational technology.
Note: A site is a physical location (facility, campus, office building, hospital, rig, etc.).
How site-based licensing works
Organizations that want to secure their OT environments with Defender for IoT will now be able to purchase annual licenses with standard pricing based on the maximum number of OT devices they wish to protect at each individual site. Prices are flat rates for each site size and are not prorated based on the numbers of devices. Site sizes are determined by the maximum number of devices per site.
Note: Defender for IoT site entitlement is licensed annually with standard pricing respective to each site tier.
For example, if an organization wanted to secure all OT devices with Defender for IoT across three of its sites – where site one has 90 OT devices, site two has 700 devices, and site three has 25 devices, the organization would have to buy an Extra-Small license for site one, a Large license for site two, and another Extra-Small license for site three.
Note: For scenarios where an organization wants to secure over 5000 OT devices at a single site, we ask that they contact their Microsoft sales representative.
Let us know what you think
We are excited to provide organizations with a more convenient way to consume Defender for IoT in a manner that is flexible enough to accommodate varying site sizes, while also being predictably priced. If you have any feedback, please feel free to let us know in the comments below.
This update to Process Monitor, a utility for observing real-time file system, Registry, and process or thread activity, improves handling of incomplete Procmon Log files (.pml), and restores “Copy All” functionality in the Event Properties window.
Data security has become one of the most critical security issues companies face, exacerbated by outdated approaches to data security and a fragmented solution landscape that can be expensive, hard to manage—and often ineffective. Microsoft Purview provides a comprehensive and holistic data security solution that helps customers secure their data, across clouds, apps and devices, by focusing on three key areas: discovering and protecting data, managing insider risk, and preventing data loss. When used together, customers can benefit from a cloud-based solution that helps secure all their data, on-premises or in the cloud, in emails, and in apps. These campaigns provide engaging content and insights to customers on achieving integrated data security that helps them protect data, manage insider risk, and prevent data loss, all while improving efficiency and saving costs.
Launch either of these partner-ready campaigns and go to market quicker to drive customer engagement and leads for Microsoft Data Security solutions and your services.
This article is contributed. See the original author and article here.
We continue to expand the Azure Marketplace ecosystem. For this volume, 168 new offers successfully met the onboarding criteria and went live. See details of the new offers below:
Get it now in our marketplace
Aribot: AI-Based Automated Threat Modeling: Automated Threat Modeling from Aristiun B.V. employs AI to expose security threats in application environments. Developers can use it to automatically create traceable security requirements across the lifecycle and auto-map the requirements to compliance frameworks.
CIS Oracle Linux 9 Benchmark L1: This offer from the Center for Internet Security (CIS) provides an image of Oracle Linux 9 that’s hardened according to a CIS Benchmark. CIS Benchmarks are vendor-agnostic consensus-based security configuration guides.
CKAN Secured and Supported by HOSSTED: This offer from HOSSTED provides CKAN on a Microsoft Azure virtual machine. CKAN is an open-source data management system that powers hundreds of data portals worldwide. This installation includes a support package from HOSSTED.
Connected Care – Healthcare Workflow Automation Platform: Delivered through Microsoft Azure, Konica Minolta Connected Care securely processes protected health information from diverse input sources (such as faxes, emails, and scans) and converts it into structured data.
Credivera Exchange: Credivera Exchange is a workforce management and digital identity platform that optimizes personal privacy and trust through verifiable credentials secured in a digital wallet. Reduce risk, liability, and uncertainty with Credivera.
Databook Strategic Relationship Management Platform: Databook gives sales professionals access to data-driven insights, helping enterprise go-to-market teams develop strategic relationships with customers. Databook reveals why companies are ready to buy, which business outcomes they’re seeking, and when deals are most likely to close.
Delinea Secret Server (Privileged Access Management): Secret Server, part of Delinea’s privileged access management and endpoint security offerings, integrates with Microsoft Sentinel to give organizations deep insight into privileged account usage so they can meet compliance mandates and detect internal network threats.
DNS Fetcher: DNS Fetcher is an online tool that enables system administrators, network engineers, and others to quickly and easily check the DNS information and records for a given domain name.
Docker on Ubuntu 23.04: This offer from Ntegral provides Docker on Ubuntu 23.04. Docker is a platform that developers and system administrators use to build, run, and share applications with containers.
Encrypted Conversational Portals: DropVault’s encrypted conversational portals let you securely share conversations and documents with customers or employees. Use DropVault with Azure storage options or with your existing business storage.
Hazelcast Secured and Supported by HOSSTED: This offer from HOSSTED provides Hazelcast on a Microsoft Azure virtual machine. Hazelcast is a distributed computation and storage platform for low-latency querying and aggregation. This installation includes a support package from HOSSTED.
Intelligent Assistant: ChatBot for Microsoft Teams: Top365’s Smart Assistant chatbot for Microsoft Teams uses AI to answer the day-to-day questions at your company, building a knowledge base from employees’ most common queries. This offer is available only in Brazilian Portuguese.
iPerf3 Server on Ubuntu 20.04 LTS: This offer from Virtual Pulse S. R. O. provides IPerf3 on Ubuntu 20.04 LTS. IPerf3 is a tool for network performance measurement and tuning. For each test, it reports the measured throughput, loss, and other parameters.
mirro.ai – Mood Analyzer: Mood Analyzer from mirro.ai mines your sales calls or support calls in order to assess a speaker’s emotions, energy level, engagement, stress, and fatigue. Mood Analyzer can quickly process all your recordings and extract audio snippets to save you time and assess staff performance.
MySQL on Ubuntu 23.04: This offer from Ntegral provides MySQL on Ubuntu 23.04. MySQL is an open-source relational database designed for application development. Ntegral’s images are up-to-date, secure, and built to work right away.
NATS Secured and Supported by HOSSTED: This offer from HOSSTED provides NATS on a Microsoft Azure virtual machine. NATS is an open-source messaging system that lets applications securely communicate across cloud, edge, or on-premises locations. This installation includes a support package from HOSSTED.
NiCE Active 365 Management Pack: This offer from NiCE IT Management Solutions provides a pack for Microsoft System Center Operations Manager that maps out your hybrid or cloud-only deployment of Microsoft 365. Get quick insights into licensing, users, health states, and more.
NiCE AIX Management Pack for Microsoft SCOM: This offer from NiCE IT Management Solutions provides a pack for Microsoft System Center Operations Manager that delivers monitoring of AIX systems in your IBM Power environment. Track availability, performance, security, and more.
NiCE Db2 Management Pack for Microsoft SCOM: This offer from NiCE IT Management Solutions provides a pack for Microsoft System Center Operations Manager that collects detailed data from Db2 instances without impacting performance. Monitor processes, files, databases, and more.
NiCE Domino Management Pack for Microsoft SCOM: This offer from NiCE IT Management Solutions provides a pack for Microsoft System Center Operations Manager that monitors your HCL Domino infrastructure and the state and performance of its components. Track server response time, availability, and bottlenecks.
NiCE Linux Power Management Pack for Microsoft SCOM: This offer from NiCE IT Management Solutions provides a pack for Microsoft System Center Operations Manager that enables centralized performance monitoring for Linux assets in your IBM Power environment. Ensure availability and enhance efficiency in your IT infrastructure.
NiCE MongoDB Management Pack for Microsoft SCOM: This offer from NiCE IT Management Solutions provides a pack for Microsoft System Center Operations Manager that enables centralized health and performance monitoring for your MongoDB environment. Spot anomalies and fix them before they escalate.
NiCE Oracle Management Pack for Microsoft SCOM: This offer from NiCE IT Management Solutions provides a pack for Microsoft System Center Operations Manager that monitors your Oracle database and storage infrastructure and reports server problems before they affect applications and end users.
NiCE PowerHA Management Pack for Microsoft SCOM: This offer from NiCE IT Management Solutions provides a pack for Microsoft System Center Operations Manager that facilitates the monitoring of IBM PowerHA technology. Get detailed data from your PowerHA environment using predefined event conditions and threshold monitors.
NiCE Veritas Management Pack for Microsoft SCOM: This offer from NiCE IT Management Solutions provides a pack for Microsoft System Center Operations Manager that facilitates the monitoring of Veritas Cluster Server objects. It delivers alerts, failover detection, and service group monitoring.
NiCE zLinux Management Pack for Microsoft SCOM: This offer from NiCE IT Management Solutions provides a pack for Microsoft System Center Operations Manager that will automatically discover all Linux computers on your IBM Z mainframe system. It will also monitor the health of disks, processors, and adapters.
Packer 1.3.4 with Support on Ubuntu 20.04 LTS: Packer users can define and enforce infrastructure configurations using HashiCorp Configuration Language (HCL). Its simple syntax lets DevOps teams easily provision and re-provision infrastructure across multiple cloud and on-premises data centers.
Red Hat JBoss Enterprise Application Platform: No matter the type of environment, Red Hat JBoss EAP can help deliver apps faster. It provides simplified deployment and full Jakarta EE performance and features a modular architecture that starts services only as they’re required.
RISKGRID: RISKGRID is a cloud-based platform that enhances the risk assessment process through visual grids. Inherent risks, controls, and residual risks are clearly represented, and all changes have a full audit history. Track and measure progress with built-in analytics and build actionable plans.
Safe365 – Intelligent Health and Safety: Get an instant snapshot of how your business ranks in its health and safety maturity with this suite of tools from Safe365. You can then plot a roadmap to health and safety excellence with clear, actionable steps centered around the Safe365 Safety Index.
SFTP – OpenSSH FTP Server on Ubuntu 18.04 LTS: Secure File Transfer Protocol (SFTP) transfers encrypted files using the main control connection. This provides a single, efficient, secure connection passing data through the firewall, resulting in faster data transfer speeds.
SFTP – OpenSSH FTP Server on Ubuntu 20.04 LTS: This offer from Virtual Pulse S.R.O. uses SSH2 protocol encryption standards to provide a single, efficient, and secure connection. It thwarts any hacking attempts while files are being transferred and significantly boosts data transfer speeds.
Smartstore Commerce Cloud: Extend your ecommerce business model with this open-source solution that can create multilingual, multicurrency shops, enable SEO-optimized, comprehensive product catalogues, and support fast and precise searches of products and categories.
Squid Proxy Server with Authentication on Ubuntu 18.04 LTS: This offer from Virtual Pulse S. R. O. provides Squid with Ubuntu 18.04 LTS. It contains built-in variable environments with back-end authentication functions so you can regulate employees from accessing prohibited entertainment sites during working hours.
Squid Proxy Server with Authentication on Ubuntu 20.04 LTS: This offer from Virtual Pulse S. R. O. provides Squid with Ubuntu 20.04 LTS. Squid routes content requests in ways that build cache server hierarchies and optimize network throughput.
Squid Easy on Ubuntu 22.04 Minimal: This offer from Virtual Pulse S. R. O. provides Squid with Ubuntu 20.04 LTS. Squid lets you cache your web content and improves response times while reducing network bandwidth usage.
Swoop Broker Portal: This finance matching and deal flow management portal gives brokers whole-of-market access to funders as well as grants and equity investors which allows them to close more deals in less time. Cut your business costs in one fell swoop.
Terraform 1.4.5 on Ubuntu 20.04 LTS: Terraform is a free, downloadable command line tool for providing infrastructure on any cloud provider and handling configuration, plugins, and state. Use it to specify on-premises and cloud resources in human-readable configuration files to use, version, and share.
Tyk API Gateway: Tyk API Gateway is a secure, open-source gateway for APIs and microservices. It supports REST, SOAP, GraphQL, and Kubernetes to make it easy to migrate to the cloud, adopt microservices, create new products, and grow your customer base.
Ubuntu 23.04 Minimal: This minimal Ubuntu server maintained by Cloud Infrastructure Services is designed for automated deployment at scale. It has a greatly reduced default package set so it’s smaller, boots faster, and requires fewer security updates over time.
VUSION Cloud: VUSION Cloud is a retail platform merchants and brands can use for electronic shelf label management and monitoring to increase in-store efficiency. Its resilient, elastic architecture provides high availability and on-demand capacity for provable management and updates, no matter the amount.
Zetaris – Fluid Data Vault: Through the Zetaris platform, the Fluid Data Vault toolkit enables integration of big data and data streaming sources. Use it to go directly from source systems to the data vault without having to replicate raw data across multiple storage layers.
Go further with workshops, proofs of concept, and implementations
ACP Azure Landing Zone Deployment: ACP IT Solutions will prepare a Microsoft Azure landing zone to act as the foundation of your Azure environment, managing the applications and services that are migrated. This offer is only available in Austria, Germany, and Switzerland.
AFRY Operational Data Layer: AFRY Operational Data Layer provides a solution to help retailers increase sales and free time for IT resources while providing real-time access to crucial business data through Microsoft Azure. It creates a single source of truth for price, inventory, and product data.
AI Discovery: 3-Hour Virtual Workshop: TEKenable’s artificial intelligence accelerator helps organizations solve business challenges using Microsoft Azure AI services. This discovery workshop includes an overview of AI capabilities and real-life examples of how AI can be used.
Audax Labs Product Engineering Services: Audax Labs provides innovative product engineering services for businesses, creating cutting-edge products using Microsoft Azure services like AI, IoT, Data Analytics, and Mixed Reality.
Azure Active Directory for B2C: 3-Day Workshop: Direct Experts will provide a thorough introduction to the capabilities of Microsoft Azure Active Directory for B2C, provide hands-on training, and deploy and configure it in your environment according to best practices.
Azure Active Directory: 3-Day Workshop: Direct Experts will provide a thorough introduction to the capabilities of Microsoft Azure Active Directory, provide hands-on training, and deploy and configure it in your environment according to best practices.
Azure Arc: 3-Day Workshop: Direct Experts will provide a thorough introduction to the capabilities of Microsoft Azure Arc, provide hands-on training, and deploy and configure it in your environment according to best practices.
Azure Automation and Infrastructure as Code: 3-Day Workshop: Direct Experts will provide a thorough introduction to the infrastructure as code capabilities of Microsoft Azure Automation, provide hands-on training, and deploy and configure it in your environment according to best practices.
Azure Back Up: 3-Day Workshop: Direct Experts will provide a thorough introduction to the capabilities of Microsoft Azure Backup, provide hands-on training, and deploy and configure it in your environment according to best practices.
Azure Migration with Pegasus One: Pegasus One provides end-to-end migration services of entire workloads from on-premises and other platforms to Microsoft Azure, conducting IT portfolio diagnostics to identify applications to be migrated and providing the overall total cost of ownership of migration.
Azure DevOps: 3-Day Workshop: Direct Experts will introduce you to the capabilities of Azure DevOps and then provide hands-on training to initiate your DevOps journey. Participants will also learn how to implement and manage advanced domains like Azure Active Directory and Azure Networking.
Azure File Sync Services: 3-Day Workshop: Azure File Sync is a service that allows you to cache several Azure file shares on an on-premises Windows Server or cloud VM. Direct Experts will help you explore the capabilities of Azure Files and Files Sync Services and show you how to configure them to your environment.
Azure Firewall: 3-Day Workshop: Azure Firewall is a managed, cloud-based network security service that protects your Azure Virtual Network resources. Direct Experts will help you explore the capabilities of Azure Firewall and show you how to configure it to your environment.
Azure IaaS Migration: 5-Day Proof of Concept: The experts from Elite Technology Solutions will review your current environment and create a roadmap to ensure a smooth transition of your IaaS services to Microsoft Azure. Full documentation and guided knowledge transfer will also be provided.
Azure Innovation Jumpstart – 6-Week Engagement: Neudesic will identify and accelerate digital innovation opportunities using Microsoft Azure, Microsoft Power Platform, and Azure data services. You’ll receive a transformation blueprint for Azure services directly supporting your stated business goals.
Azure Key Vault: 3-Day Workshop: In this hands-on workshop, Direct Experts will show you how to use Azure Key Vault so you can create and maintain keys that access and encrypt your cloud resources, apps, and solutions. Learn how you can enhance data protection and compliance.
Azure Kubernetes Service (AKS): 3-Day Workshop: Direct Experts will guide you to develop and deploy cloud-native apps in Microsoft Azure, datacenters, or at the edge with built-in code-to-cloud pipelines and guardrails. Get unified management and governance for on-premises, edge, and multi-cloud Kubernetes clusters.
Azure Landing Zone Foundation: 4-Week Deployment: Converge will map its processes to the Microsoft Cloud Adoption Framework and help design and create a customized landing zone environment on Microsoft Azure that aligns with your organization’s goals, compliance requirements, and scalability needs.
Azure Network Virtual Appliance (NVA): 3-Day Workshop: Direct Experts will provide an overview of how Azure Network Virtual Appliance (NVA) is used in Azure applications to enhance high availability. Participants will also learn how to implement, manage, and create a secure network boundary.
Azure OpenAI Services – Envisioning Workshop: Accelerate your generative AI knowledge with WinWire’s team of AI experts. This workshop introduces Microsoft Azure OpenAI Services and best practices to identify use cases that can help your business deliver maximum impact and ROI.
Azure OpenAI: Hands-on Training Sessions: MaibornWolff will offer a series of workshops to help you explore and leverage the power of Microsoft Azure OpenAI and its technologies like ChatGPT. You will learn how to develop generative AI applications, drive growth, improve efficiency, and gain a competitive edge.
Azure Sentinel: 3-Day Workshop: Direct Experts will introduce the core services offered by Microsoft Sentinel and will deploy it using best practices suited to your environment and business needs. Uncover sophisticated threats and respond decisively with this intelligent security information and event management solution.
Azure Virtual Desktop: 3-Day Workshop: Receive hands-on guidance from Direct Experts as you deepen your understanding of Azure Virtual Desktop and how it can be securely scaled and adapted to suit your remote work environment, budget, and business needs.
Azure VMware: 3-Node AV36 Cluster / 100 VMs – Implementation Services: Performance Technologies S.A. will implement Microsoft Azure VMware Solution so can you seamlessly migrate, extend, and run VMware workloads on Azure. This brings scalability and facilitates hybrid-cloud strategies.
Coforge FinOps – Cloud Financial Management: Focusing on cost visibility, cost control, and cost governance, Coforge will implement its FinOps service to enhance visibility and intelligence in your cloud platform. You’ll be able to lower costs and optimize resources with the right governance in place.
Coforge’s Customer 360 Solution: 6-Week Implementation: Using Microsoft Power BI and Azure services, Coforge will implement an accelerator for its Customer 360 Solution, which derives insights from multiple customer channels. Track customer satisfaction, predict churn, and manage client expectations with the help of Coforge.
Datacenter Migration to Azure: Agic Cloud will utilize Microsoft Azure and Microsoft 365 to guide and empower your IT department to manage and govern your datacenter migration process based on a defined and planned strategy. Reduce costs while increasing the performance, availability, and security of your workloads.
Disaster Recovery on Azure: 3-Day Workshop: Direct Experts will provide hands-on training and help you configure disaster recovery solutions on Azure tailored to your specific environment. Workshop participants will come away with enhanced knowledge and troubleshooting skills based on best practices.
Enterprise Integration Accelerator: 7-Week Implementation: Insight will extend your Azure landing zone and address key security, governance, cost control, and operational requirements through workshops, a knowledge transfer, infrastructure as code, and a DevOps-ready implementation framework.
ExSight: Advanced Data Analytics for Real-time Monitoring on Azure: Exist Software will implement its ExSight data analytics tool, which provides real-time monitoring of Azure systems and gives you instant visibility into any issues or anomalies. Dashboards can be customized so that they display the metrics that matter most to you.
Generative AI Adoption Framework: 4-Week Implementation: Through a workshop and a proof of concept, ENCAMINA will enable you to unleash the potential of generative AI. The framework will be aligned with your company’s business objectives and strategies. This service is available only in Spanish.
Generative AI eXplore: 2-Month Proof of Concept: In this proof of concept, iCubed will show your organization how generative AI works, explore its potential applications, and create a prototype with Microsoft Azure OpenAI Service to solve your business challenges.
Marketing Strategy and Campaign Data Analysis: 8-Week Implementation: Are you looking to boost your online presence and attract more customers? 54cuatro’s service, available only in Spanish, will create a custom analytics platform using Microsoft Azure Synapse that concentrates information in a data lake. This will improve audience targeting.
Microsoft Azure OpenAI Service: 1-Day Workshop: Discover the opportunities of generative AI, Azure OpenAI Service, Microsoft 365 Copilot, and GitHub’s Codex in this workshop from onepoint. You’ll explore use cases and leave with an accelerator kit. This offer is available only in French.
Oracle to Azure SQL: Database Migration: Mazzy Technologies will migrate your Oracle schema to PostgreSQL or Microsoft Azure SQL Database. Mazzy Technologies specializes in Azure migrations and application modernization for large enterprises and government agencies around the world.
SalzPoint Supply Chain Framework on Azure: 2-Week Implementation: SAVIC Technologies will implement SalzPoint, a supply chain management and sales team management software framework that sits on Microsoft Azure and provides real-time data for your business. The front-end interface can be customized using Microsoft Power Automate and Microsoft Power BI.
This article is contributed. See the original author and article here.
Introduction
Today there is a lot of interest around generative AI, specifically training and inferencing large language models (OpenAIGPT4, DALL.E2), Git copilot, Azure OpenAI service). Training these large language models requires lots of float-point performance and high interconnect network bandwidth. The Azure NDm_v4 virtual machine is an ideal choice for these types of demanding jobs (because it has 8 A100 GPU and each GPU has 200 Gbps of HDR InfiniBand). Kubernetes is a popular choice to deploy and manage containerized workloads on compute/gpu resources. The Azure Kubernetes service (AKS) simplifies Kubernetes cluster deployments. We show how to deploy an optimal NDm_v4 (A100) AKS cluster, making sure that all 8 GPU and 8 InfiniBand devices on each virtual machine come up correctly and are available to deliver optimal performance. A multi-node NCCL allreduce benchmark job is executed on the NDm_v4 AKS cluster to verify its deployed/configured correctly.
Procedure to deploy a NDmv4 (A100) AKS Cluster
We will deploy AKS cluster from the Azure cloud shell using Azure command line interface (azcli). The Azure cloud shell has azcli preinstalled, but if you prefer to install from your local workstation, instructions to install azcli are here.
Note: There are many other ways to deplot an AKS cluster (e.g. Azure Portal, ARM template, Bicep and terraform are also popular choices)
First we need to install the aks-preview azcli extension, to be able to deploy AKS and control AKS via azcli.
az extension add –name aks-preview
It is also necessary to register infiniBand support, to make sure all nodes in your pool can communicate over the same InfiniBand network.
az feature register –name AKSInfinibandSupport –namespace Microsoft.ContainerService
Create a resource group for the AKS cluster.
az group create –resource-group –location
For simplicity we will use the default kubenet networking (you could also deploy AKS using CNI and choose your own VNET), in the kubenet case AKS will deploy the VNET and subnet. System managed identity will be used for authentication. Ubuntu is chosen for the HostOS (The default AKS version deployed was 1.25.6 and the default Ubuntu HostOS is Ubuntu 22.04).
az aks create -g –node-resource-group -n –enable-managed-identity –node-count 2 –generate-ssh-keys -l –node-vm-size Standard_D2s_v3 –nodepool-name –os-sku Ubuntu –attach-acr
Then deploy the NDmv4 AKS pool. (Initially only one NDmv4 VM, later we will scale up the AKS cluster).
Note: Make sure you have sufficient NDmv4 quota in your subscription/location.
A specific tag (SkipGPUDriverInstall=true) needs to be set to prevent the GPU driver from being installed automatically (we will use the Nvidia GPU operator to install the InfiniBand driver instead). Some container images can be quite large and so we use a larger OS disk size (128 GB)
az aks nodepool add –resource-group –cluster-name –name –node-count 1 –node-vm-size Standard_ND96amsr_A100_v4 –node-osdisk-size 128 –os-sku Ubuntu –tags SkipGPUDriverInstall=true
Get credentials to connect and interact with the AKS Cluster.
az aks get-credentials –overwrite-existing –resource-group –name
Check that the AKS pools are ready.
kubectl get nodes
kubectl get nodes
Install NVIDIA network and gpu operators (they will be used to install specific GPU and InfiniBand drivers (in this case OFED 5.8-1.0.1.1.2 and GPU driver 525.60.13)
RUN apt update RUN apt-get -y install build-essential RUN apt-get -y install infiniband-diags RUN apt-get -y install openssh-server RUN apt-get -y install kmod COPY nccl-tests.sh . RUN ./nccl-tests.sh COPY ndv4-topo.xml .
Login to your Azure container registry, where your custom container will be stored.
az acr login -n
Build your container locally on a Ndmv4 VM. First change to the directory containing your Dockerfile.
docker build -t .azurecr.io/ .
Push your local container to your Azure container registry.
docker push .azurecr.io/
Run NCCL allreduce benchmark on NDmv4 AKS Cluster
The NVIDIA NCCL collective communication tests are ideal to verify that the NDv4 AKS cluster is set-up correctly for optimal performance. On 2 NDmv4 nodes (16 A100), NCCL allreduce should be ~186 GB/s.
We will use the docker container we created in the previous section and submit the NCCL allreduce benchmark using the Volcano scheduler.
Scale-up the NDmv4 AKS cluster to 2 NDmv4 VM’s (16 A100).
az aks nodepool scale –resource-group –cluster-name –name –node-count 2
Correct deployment of NDmv4 kubernetes pools using Azure Kubernetes service is critical to get the expected performance. NCCL collectives tests (e.g allreduce) are excellent benchmarks to verify the cluster is set-up correctly and achieving the expected high performance of NDmv4 VM’s.
Recent Comments