Azure Marketplace new offers – Volume 162

Azure Marketplace new offers – Volume 162

This article is contributed. See the original author and article here.











We continue to expand the Azure Marketplace ecosystem. For this volume, 120 new offers successfully met the onboarding criteria and went live. See details of the new offers below:





























































































































































































































































































































































































































































Get it now in our marketplace


activemq.png

ActiveMQ: Apache ActiveMQ is a messaging and integration patterns server, allowing users to foster communication from more than one client or server. This communication is managed with features such as computer clustering and ability to use any database as a JMS persistence provider besides virtual memory, cache, and journal persistency. 


AlmaLinux 8 from OpenLogic.png

AlmaLinux 8 from OpenLogic by Perforce: This minimum profile AlmaLinux 8-based image, built by OpenLogic by Perforce, allows users to run the AlmaLinux image on Microsoft Azure, start an SSH server, and provides login access to customize the instance. This image includes 9×5 weekday email support delivered by a team of US-based Enterprise Linux experts.


BiiTrails Business.png

BiiTrails Business: This Microsoft Azure-based BiiTrails blockchain service provides self-service traceability toolkits. This service is low cost, with pay-by-use SaaS, and offloaded IT cost. It’s also easy to use, with permissioned usage, a GUI template editor, and admin dashboard.


CentOS 7 from OpenLogic .png

CentOS 7 from OpenLogic by Perforce: This minimum profile CentOS 7-based image, built by OpenLogic by Perforce, allows users to run the CentOS 7 image on Microsoft Azure, start an SSH server, and provides login access to customize the instance. 


CentOS 8 from OpenLogic .png

CentOS 8 from OpenLogic by Perforce: This minimum profile CentOS 8-based image, built by OpenLogic by Perforce, allows users to run the CentOS 8 image on Microsoft Azure, start an SSH server, and provides login access to customize the instance. 


CentOS Stream Minimal.png

CentOS Stream: This is a minimal CentOS Stream image, mainly used as a common base system on top of which other appliances could be built and tested. It contains just enough packages to run within Microsoft Azure, bring up a SSH server and allow users to login.


CentOS Stream.png

CentOS Stream Minimal: This is a minimal CentOS Stream image with an auto-extending root filesystem, mainly used as a common base system on top of which other appliances could be built and tested. Microsoft Azure Linux Agent, cloud-init, as well as the CentOS Stream security updates available at the release date are included.


EspoCRM.png

EspoCRM: This solution is an open source web application, which enables users to view, enter, and analyze their organization’s relationships with customers and as well as with partners. It’s fast and highly configurable and presents users with a web-based CRM platform.


Explorium External Data Platform.png

Explorium External Data Platform: This solution enables organizations to automatically discover and use thousands of relevant data signals to improve predictions and machine learning model performance. It also allows for efficient integration of third-party data.


Exponent CMS.png

Exponent CMS: This Niles Partners image will configure and publish Exponent CMS, an open-source content management system written in PHP that helps you develop and easily manage dynamic websites without having to code pages or manage navigation.


Intelligent Document Processing.png

Intelligent Document Processing: Hexaware IDP uses Microsoft Azure cognitive services for automatic classification and extraction of data from scanned PDF and handwritten documents, improving accuracy by 80 percent and lowering cost by 30 percent.


kirby.png

Kirby: This Niles Partners offer configures and publishes Kirby, a lightweight, file-based content management system. Kirby provides multilanguage and multistore functionality, so you won’t need plugins if you intend to set up your site for multiple countries or stores.


mantis.png

Mantis: Get the big picture on your team’s performance and improve workflow and efficiency by deploying this Niles Partners image of Mantis, an open-source issue-tracking system and project management solution, to Microsoft Azure.


Modx.png

MODX: MODX Revolution is a content management system and application framework rolled into one. Deploy it on Microsoft Azure via this Niles Partners image to gain peace of mind that your websites will be secure and easily maintained with unparalleled performance. 


Monica Server Ready with Support from Linnovate.png

Monica Server Ready with Support from Linnovate: Monica provides an easy-to-use interface to log everything you know about your contacts. Be a better friend, family member, and co-worker by having vital information like important dates or gift ideas at your fingertips.


NagiOS.png

Nagios: This Microsoft Azure image from Niles Partners allows you to configure and publish Nagios Core, a user-friendly network monitoring solution for Linux operating systems and distributions that includes service state, process state, operating system metrics, and more.


OpenDocman.png

OpenDocMan: Configure and publish OpenDocman, an open-source document management system, to Microsoft Azure via this preconfigured image from Niles Partners. OpenDocMan supports multiple file types and offers a minimalist approach to user interface.


Oscommerce.png

osCommerce: This Microsoft Azure image from Niles Partners allows you to configure and publish osCommerce, an open-source e-commerce platform and online store management solution for any website or web server that has PHP and MySQL installed.


Azure Spend Optimization Service.png

PASOS – Paian Azure Spend and Optimization Service: PAIAN IT Solutions will monitor your Microsoft Azure consumption and give you suggestions for improving your Azure environment, allowing you to focus on your business processes, procedures, and employees.


Processwire.png

ProcessWire: Configure and publish ProcessWire, an open-source content management system, content management framework, and web application framework, via this preconfigured and ready-to-launch Microsoft Azure virtual image from Niles Partners.


RabbitMQ.png

RabbitMQ: This open source messaging broker system from Niles Partners is used for distributing notifications of change events. It is lightweight and simple to deploy and provides applications a common platform for sending and receiving messages on Microsoft Azure.


Rocky Linux 8 from OpenLogic .png

Rocky Linux 8 from OpenLogic by Perforce: Perforce provides this preconfigured image of Rocky Linux 8 from OpenLogic. This image enables users to run Rocky Linux image on Microsoft Azure, start a Secure Shell (SSH) server, and customize the instance.


RubyOnRails.png

Ruby on Rails: Niles Partners is configuring Ruby on Rails, a world-famous open source web application framework, and embedding it with Ubuntu and ready-to-launch image on Microsoft Azure. You can use this high-level programming language to build database-backed web applications ranging from easy to complex.


Spree Commerce.png

Spree Commerce: Easily launch, maintain, and scale your online stores across various platforms using this robust e-commerce solution embedded with Ubuntu along with ready-to-launch image on Microsoft Azure. Spree Commerce contains Linux, java, and Ruby Rails.


Strapi Accelerator.png

Strapi Accelerator: This premium image designed by Ntegral and optimized for production environments is an open-source headless content management system (CMS). Strapi Accelerator is an image based on Ubuntu 20.04.2 LTS, PostgreSQL 12, Nginx, PM2, and Strapi.


Syntheticus.ai-Synthetic Data Generator.png

Syntheticus.ai-Synthetic Data Generator: This B2B SaaS solution by Syntheticus GmbH allows you to synthetically generate data for all your AI and machine learning models. This artificially generated, yet anonymized data, mimics the original data and strengthens your foundation of trust, while mitigating privacy risks.


WildFly.png

WildFly: Niles Partners provides this preconfigured, ready-to-launch virtual machine image of Wildfly for Microsoft Azure. WildFly application builder is a lightweight, flexible tool that runs tremendously fast with a full J2EE stack including Java EE7.



Go further with partner consulting services


App Modernization with AKS- 5-Day Implementation.png

App Modernization with AKS: 5-Day Implementation: Modernize your business applications with this fully managed Azure Kubernetes Service (AKS) offered by Abtis GmbH. This implementation, available only in German, will help provision your first Kubernetes cluster infrastructure and introduce you to DevOps tools.


Application Modernization- 4-Week Assessment.png

Application Modernization: 4-Week Assessment: Cegeka’s four-week assessment serves as a starting point for defining your digital strategy and delivering an actionable IT roadmap and high-level application overview for modernizing your applications using Microsoft Azure.


Azure Data Platform- 2-Week Proof of Concept.png

Azure Data Platform: 2-Week Proof of Concept: Learn how to remove existing barriers between operational data, data warehouses, and analytics while gleaning actionable insights in this offer from Techedge. Their experts will deliver a proof of concept of an integrated data platform centered on Microsoft Azure Synapse Analytics and Power BI.


Azure Development Training- 1-Day Workshop.png

Azure Development Training: 1-Day Workshop: UPPER-LINK’s training workshop will help you identify options as you set out to modernize your existing applications based on your business requirements and develop the operational bricks for your project using Microsoft Azure services and tools. This offer is available only in French.


Azure DevOps Quickstart- 3-Day Proof of Concept.png

Azure DevOps Quickstart: 3-Day Proof of Concept: This offer from Ismile Technologies will help your organization understand core concepts of the Azure DevOps platform and streamline processes so you can learn to quickly launch your first app using Microsoft Azure.


Azure FastStart Service- 5-Day Implementation.png

Azure FastStart Service: 5-Day Implementation: Accelerate your journey to the cloud by collaborating with CANCOM consultants who will help deliver the base configuration of a Microsoft Azure environment to your organization. Set up subscriptions, identity, and security as part of this implementation.


Azure Glidepath for Governance Workshops.png

Azure GlidePath for Governance Workshops: Sirius’ workshop is based on Microsoft’s Azure Cloud Adoption Framework (CAF) and is geared toward helping your organization simplify the complex task of creating a governance program service by establishing a best-practice approach to security, governance, and cost control.


Azure Infrastructure & Data Assessment- 10-Day.png

Azure Infrastructure & Data: 10-Day Assessment: In this assessment, MineData will provide an end-to-end analysis of your organization’s entire IT estate along with a cost overview for migrating your virtual machines, storage, and databases to Microsoft Azure.


Azure Migration Readiness- 5-6 Week Assessment.png

Azure Migration Readiness: 5- to 6-Week Assessment: Experts from Ismile Technologies will help you migrate to Microsoft Azure by first determining the cloud maturity of your company. An analysis of your current infrastructure, creation of a migration roadmap based on key metrics and dependencies will be offered as a follow-up service.


Azure Migration- 5-Day Implementation.png

Azure Migration: 5-Day Implementation: Is your organization looking to adopt a cloud strategy? Through this implementation, the experts at Abtis will migrate your ten most important Windows and SQL Server-based workloads to Microsoft Azure. This offer is available only in German.


Azure Service Advisory- 4-Week Assessment.png

Azure Service Advisory: 4-Week Assessment: In this offer experts from Entelect will help you understand the myriad offerings and technologies that are part of Microsoft Azure’s evolving landscape. You will get a custom plan outlining which solutions within the Azure framework work best for your business needs.


Azure Stack Hub- 4-Week Implementation.png

Azure Stack Hub: 4-Week Implementation: myCloudDoor consultants will help you build and run applications in an autonomous cloud that is completely or partially disconnected from the internet. Gain flexibility and control and easily transfer your app models between Microsoft Azure and Azure Stack Hub.


Azure Synapse Analytics- 2-Hour Workshop.png

Azure Synapse Analytics: 2-Hour Workshop: Get a free education on the Microsoft Azure Synapse Analytics solution from the experts at Altron Karabina so you can identify where this solution can be utilized in your organization. This workshop will help you create a reliable data foundation for your business questions.


Azure Virtual Desktop Journey- 2-Hour Briefing .png

Azure Virtual Desktop Journey: 2-Hour Briefing: In this free briefing, ACP IT Solutions will discuss the benefits of Azure Virtual Desktop, cover infrastructure and costs, identify automation opportunities, and more. This offer is available only in German.


Azure Virtual Desktop- 1-Day Quick-Start Workshop.png

Azure Virtual Desktop: 1-Day Quick-Start Workshop: Appsphere’s consultants will show you how Microsoft Azure Virtual Desktop works and the advantages it offers to you and your company. A comprehensive proof of concept will be provided so you can enable a secure remote desktop experience from virtually anywhere.


Azure Virtual Desktop- 4-Day Implementation.png

Azure Virtual Desktop: 4-Day Implementation: IT sure GmbH will analyze your current environment and optimize it for Microsoft Azure Virtual Desktop, enabling you to manage demanding environments such as CAD workstations. This offer is available only in German.


Azure Virtual Desktop- 4-Week Implementation.png

Azure Virtual Desktop: 4-Week Implementation: As part of their managed service the experts from Long View Systems will help design and deploy Microsoft Azure Virtual Desktop so your organization can close the skills gap and enable a secure remote desktop experience from virtually anywhere.


Azure-Driven ML & Data Science- 4-Week Proof of Concept.png

Azure-Driven ML & Data Science: 4-Week Proof of Concept: Arinti will help guide you through your Microsoft Azure AI journey in this four-week engagement. Deliverables include a data audit report, roadmap for future AI implementations, estimate to scale the proof of concept to a production environment, and more.


Cloud Data Migration- 2-Day Workshop.png

Cloud Data Migration: 2-Day Workshop: Available only in German, Saracus Consulting’s workshop will teach you how to successfully migrate your on-premises database to Microsoft Azure. Learn how you can benefit from the elasticity and agility of Azure services.


Cloud Readiness- 5-Day Assessment.png

Cloud Readiness: 5-Day Assessment: Looking to migrate your applications to the cloud? Cegeka’s five-day Cloud Readiness assessment will help you create a strategic roadmap to ensure you accomplish a successful migration to Microsoft Azure.


CloudTrack Governance Journey- 3-Day Workshop.png

CloudTrack Governance Journey: 3-Day Workshop: Atea’s CloudTrack Governance Journey includes three workshops utilizing Microsoft Azure best practices and Atea’s experience configuring Azure environments as a proven method for successfully moving your organization to the cloud.


Costs Optimization- 2-Week Assessment.png

Costs Optimization: 2-Week Assessment: Available only in Spanish, Orion 2000’s Cost Optimization assessment includes an analysis and evaluation of your organization’s Microsoft Azure environment to help you reduce unnecessary costs and identify potential savings.


Data & Analytics Strategy- 5-Day Assessment.png

Data & Analytics Strategy: 5-Day Assessment: Obungi experts will help you use your data as a driver of success by looking at the state of your system, identifying strengths and weaknesses, and jointly developing a target landscape and roadmap based on Microsoft Azure services.


Data Architecture- Half-Day Workshop.png

Data Architecture: Half-Day Workshop: Learn how to more effectively use your data via Microsoft Azure and this free, individually tailored consultation. Experts from Zoi TechCon with work with you to look at integration, metadata, and governance best practices for your data architecture.


Data Platform Modernization- 10-Week Implementation.png

Data Platform Modernization: 10-Week Implementation: This consultation with Business Integration Partners will assess your infrastructure and applications, define future scenarios, and implement a solution for data platform modernization to accelerate your digital transformation.


Data Warehouse - Synapse Analytics- 1-Day Workshop.png

Data Warehouse – Synapse Analytics: 1-Day Workshop: Learn from the experts at Obungi how a modern data warehouse based on Microsoft Azure Synapse Analytics can combine traditional data warehousing with big data and data science to uncover hidden insights and make informed decisions.


DataCenter Modernization 6-Week Implementation.png

DataCenter Modernization 6-Week Implementation: Get help moving to Microsoft Azure with this consulting offer from IT Quest Solutions. You will receive a technical assessment, a cost analysis, and a migration plan detailing how to move your workloads to the cloud.


DevOps Consulting- 2-Week Assessment.png

DevOps Consulting: 2-Week Assessment: This consultation with RCR will improve how you produce and operate applications in Microsoft Azure DevOps through the effective execution of processes, practices, and use of tools that automate the development cycle.


External Identity Access Management- 1-Day Briefing.png

External Identity Access Management: 1-Day Briefing: Avaleris will provide high-level recommendations for an optimal path toward deployment of external identity solutions that will protect your organization from threats and ease onboarding for partners and customers.


Infrastructure Provisioning- 2-Week Assessment.png

Infrastructure Provisioning: 2-Week Assessment: LTTS will help you plan automated infrastructure provisioning using Terraform or Microsoft Azure Resource Manager (ARM) templates, reducing the time it takes you to provision cloud resources from weeks to minutes.


Intelligent Spaces- 4-Week Assessment.png

Intelligent Spaces: 4-Week Assessment: GlobalLogic will assess your office space management system and propose an improved or new solution based on Microsoft Azure, Dynamics 365, and Power BI that will enable people to be in a more safe and comfortable environment.


Introduction to Azure Purview- 1-Day Workshop.png

Introduction to Azure Purview: 1-Day Workshop: This consultation will use practical examples to show the functions and benefits of Microsoft Azure Purview unified data governance service for the different user groups in your company. This offer is only available in German.


Linux Migration- 5-Day Implementation.png

Linux Migration: 5-Day Implementation: Get professional migration for your Linux and open source database workloads. Abtis will migrate your ten most important workloads on the basis of Linux and open source databases to Microsoft Azure safely and quickly. This offer is only available in German.


Machine Learning- 1-Day Workshop.png

Machine Learning: 1-Day Workshop: Learn about the functionality and advantages of machine learning with AppSphere’s offering. Get an overview of the terminology and basic statistical methods and create self-learning data sets with Microsoft Azure Machine Learning Studio.


Modern Workplace Jumpstart- 1-Week Workshop.png

Modern Workplace Jumpstart: 1-Week Workshop: AppSphere‘s consultants will develop an IT architecture/landscape that is heavily based on Microsoft cloud solutions like Office 365 and Azure services, to meet demands for mobility, collaboration, and communication.


Oracle Migration to Azure- 2-Hour Briefing.png

Oracle Migration to Azure: 2-Hour Briefing: This offering from Dimension Data will introduce you to managing and optimizing your Oracle footprint and technology costs by migrating your Oracle workloads to Microsoft Azure (Oracle on Azure) or to PostgreSQL on Azure.


Oracle_PostgreSQL Migration- 6-Week Assessment.png

Oracle/PostgreSQL Migration: 6-Week Assessment: AKVELON will perform this migration to Microsoft Azure PostgreSQL database infrastructure with an option to use state-of-the-art server solutions. Additional services to migrate the application infrastructure are available.


Quick Azure Virtual Desktop- 4-Week Implementation.png

Quick Azure Virtual Desktop: 4-Week Implementation: Ignite’s offering consists of a structured service that will allow a customer to perform a fast standard deployment of an environment of virtualized desktops and applications running in the Microsoft Azure cloud.


SAP on Azure- 2-Week Assessment.png

SAP on Azure: 2-Week Assessment: Reply AG will assess and plan an individual migration to Microsoft Azure based on customer needs, combined with its extensive experiences with SAP systems. This offer will bring best practices and individual class together.


StoreSimple- 4-Week Implementation.png

StoreSimple: 4-Week Implementation: Extended support for Microsoft Azure StorSimple will cease in December 2022. SoftJam will analyze your StorSimple usage pattern and identify the best IaaS/PaaS/SaaS solution to replace it, granting the same level of reliability.


Veeam Cloud Backup for Microsoft 365- 4-Week Implementation.png

Veeam Cloud Backup for Microsoft 365: 4-Week Implementation: Sentia’s offering follows recommendations and best practices from the Microsoft Cloud Adoption Framework for Azure. It implements the concept of an Azure landing zone, providing the foundation for additional Azure workloads.


Win with Analytics - Azure Synapse- 2-Week Proof of Concept.png

Win with Analytics – Azure Synapse: 2-Week Proof of Concept: Altron Karabina will help build “one analytic view’ using Microsoft Azure Synapse Analytics that can impact your business now. Get a high-level analysis and the creation of a one-page statement of success document.



Contact our partners



ADiTaaS IT Service Management



Amdocs Service and Network Automation Solution



Arimac FinTech Suite



Automated Data Harmonization



Axway Managed File Transfer



Azure Sentinel from Atos



Azure Sentinel Managed Service



BAYZYEN



bonbon shop



BusinessNow



CallMiner Eureka



CarnaLife System



Clinicone Telemedicine



CODA Footprint Cloud Appliance



Codestone Managed Services Offering



CogEra



Data Leak Prevention: 4-Hour Workshop



DeviceOn/Kiosk+ with Anomaly Detection Service



DNS Guard



Document Locator



EDGYneer



eGovern File Share Migration to Microsoft 365



Enlabeler Annotate



ESET PROTECT



ESET PROTECT for MSP



Eugenie AI



Exterro Forensic Tool Kit Single Server



Exterro FTK Central 7.4 for Microsoft Azure



Exterro FTK Enterprise for Microsoft Azure



Exterro FTK Lab for Microsoft Azure



EY Modern Finance



FortiGate Next-Generation Firewall for Azure Stack



IDEMIA Smart Connect Consumer



IDEMIA Smart Connect M2M



IFS Cloud



Infosys Modernization Suite



INTAIN Structured Finance



Kofax SignDoc Cloud



Kofax TotalAgility



mcframe SIGNAL CHAIN



Multiple Choice Quiz Corrector API



NetApp Cloud Backup



Nuboj



ogamma Visual Logger for OPC IoT Edge Module



paydash



SmartOMS v1



SOH_Sintra Omnichannel Hub



SolarWinds SQL Sentry



Tripwire IP360 (Device Profiler Ev) 9.2.1



Tripwire IP360 (VnE Manager Ev) 9.2.1



Vector Center Perception Reality Engine



Meet a recent Microsoft Learn Student Ambassador graduate: Nandita Gaur

Meet a recent Microsoft Learn Student Ambassador graduate: Nandita Gaur

This article is contributed. See the original author and article here.

This is the next installment of our blog series highlighting Microsoft Learn Student Ambassadors who achieved the Gold milestone and have recently graduated from university. Each blog in the series features a different student and highlights their accomplishments, their experience with the Student Ambassadors community, and what they’re up to now.


 


Today we’d like to introduce Nandita Gaur who is from India and recently graduated from ABES Engineering College with a degree in Computer Science and Engineering.


Student_Developer_Team_0-1629833405373.png


 


Responses have been edited for clarity and length. 


 


When you joined the Student Ambassador community in 2019, did you have specific goals you wanted to reach, such as a particular skill or quality?  What were they?  Did you achieve them? How has the program impacted you in general? 


 


Microsoft has always been my dream organization, so when I got to know about this amazing program by Microsoft for students, I had to join. Before joining the community, my goals were oriented towards my personal growth. I wanted to enhance my resume by learning new skills, meet new people around the globe working on various technologies, and meet people of Microsoft.  Now that I have graduated from this program, I have grown so much as a person. I have achieved all my goals. In fact, a lot more than that. I am a lot more confident in public speaking skills as compared to before. I have learned about Cloud Computing, Machine Learning, and Artificial Intelligence, gained knowledge about various Microsoft products, and I have met various impactful personalities around the globe.


 


The Student Ambassadors community has impacted me so much. My mindset has changed. I have realized that emphasizing just only on the personal growth is not going to help you much in the life. It’s all about making an impact. It’s about how many people are going to get benefitted by the work you do.


 


What are the accomplishments that you’re the proudest of and why?


 


I have conducted many events, thus impacting a lot of people, but one thing that I am truly proud of is winning the Azure Developer Stories contest, a blogging contest held in April 2020 wherein we had to document a project based on Machine Learning. I didn’t really know Machine Learning before this contest, but since it was declared during lockdown, I had all the time to study. So I referred to Microsoft Learn and based on all the knowledge I gathered from it, I made a project on COVID-19 Analysis using Python. I just couldn’t believe it when the results were announced. I was declared winner among all the Student Ambassadors of India. This boosted my confidence a lot.


 


“Nothing is tough; all it takes is some dedication.”


 


I was too reluctant to start Machine Learning then because it covers a lot of mathematics, something which I tried to avoid for as long as possible. I couldn’t find good resources to study Machine Learning online, and this contest by the Student Ambassador community introduced me to a well-structured course on Microsoft Learn on Machine Learning. I had no reason to procrastinate. I had the all the resources. I had to start learning.


 


I am really proud of all the learning I have gathered about Machine Learning and fighting the habit of procrastination.


 


What do you have planned after graduation? 


 


I will be working as a Support Engineer with the Microsoft India CE&S team on Dynamics 365. I also plan to keep mentoring the students of my college so that they can achieve more than they think they can.


 


If you could redo your time as a Student Ambassador, is there anything you would have done differently?


 


I could have made much more connections. Although I have made a lot of friends, I was reluctant in the beginning to talk to anyone. I didn’t prefer speaking much. If I had spoken more, then I would have probably got the chance to be a speaker at Microsoft Build or Ignite.


 


If you were to describe the community to a student who is interested in joining, what would you say about it to convince him or her to join?


 


It is a wonderful opportunity that grooms your personality and helps you evolve as a person. You get to meet talented people across the globe and learn various technologies with them and make strong connections that may help you in your career. You get to know what’s going inside Microsoft and about the Microsoft mission, its culture and values, and you build a close connection with Microsoft employees who mentor you in making projects, provide you valuable career tips, and also provide you with various speaking opportunities at international conferences. You will know what’s going around the world in the field of technology and have a clearer picture of how technology can be used to create an impact in this world.


 


What advice would you give to new Student Ambassadors?


 


Push aside your inhibitions and start talking around. Start discussions, involve yourself in conversations, and conduct useful events that may help the students of your local community at University.


 


Just organizing events is not helpful. You have to attend sessions too. All the speakers from Microsoft are immensely talented professionals who have interesting knowledge to offer that is going to help you at every point in your career. You have a lot to take and to offer. So take full advantage of the opportunities that the Student Ambassadors team is offering you.


 


Do you have a motto in life, a guiding principle that drives you?


 


During lockdown I was much too demotivated. There was a lot of negativity in the environment. To top it off, placement season for post-graduation job was overhead. I had lost my productivity because of all the chaos around and felt like I am making no progress in life. Luckily I landed on this song called “Hall of Fame” by an Irish band “The Script”. It is an inspirational song that says you can do anything you set your mind to as long as you believe in yourself and try.  It motivated me to get up and start working, so I made this song my guiding principle.


 


What is one random fact few people know about you?


 


I am good at palmistry. My classmates in the high school consulted with me, showing their hands to me to know about their future, personality, and what could be done to improve it. Even teachers too! I enjoyed this fame but eventually realized that this does not help with anything except unnecessary worrying among the folks for their future. When I moved to college, I kept this skill as a secret. Actually, I have given up this job completely, so please don’t consult me for this (LOL).


 


Good luck to you in your journey, Nandita!

DevOps Primer (Part 1)

DevOps Primer (Part 1)

This article is contributed. See the original author and article here.

Get started with DevOps Guest post by Charlie Johnstone, Curriculum & Quality Leader for Computing, Film & TV at New College Lanarkshire: Microsoft Learn for Educator Ambassador



What is DevOps


DevOps enables better communications between developers, operations, quality and security professionals in an organisation, it is not software or hardware and not just a methodology, it is so much more! What it does is bring together the people in your teams (both developers and ops people), products and processes to deliver value to your end users.


 


This blog will focus on some of the tools and services used within Azure DevOps to build test and deploy your projects wherever you want to deploy, whether it be on prem or in the cloud.


 


This blog will be delivered in multiple parts, in this part, following a short primer, I will discuss part of the planning process using Azure Boards


image002.gifPlan:

 



In the plan phase, the DevOps teams will specify and detail what the application will do. They may use tools like Kanban boards and Scrum for this planning.


 


Develop:


This is fairly obvious, this is mainly focused on coding, testing and reviewing. Automation is important here using automated testing and continuous integration (CI). In Azure, this would be done in a Dev/Test environment


 


Deliver:


In this phase, the application is deployed to a production environment, including the application’s infrastructure requirements. At this stage, the applications should be made available to customers, and should be scalable.


 


Operate:


Once in the production environment, the applications need monitoring to ensure high availability, if issues are found then maintenance and troubleshooting are necessary.



Each of these phases relies on each other and, to some degree, involves each of the aforementioned roles.


 


DevOps Practices


Continuous Integration (CI) & Continuous Delivery (CD)


Continuous Integration allows developers to merge code updates into the main code branch. Every time new code is committed, automated testing takes place to ensure stability of the main branch with bugs identified prior to merging.


 


Continuous Delivery automatically deploys new versions of your applications into your production environment.


 


Used together as CI/CD, you will benefit from automation all the way from committing new code to its deployment in your production environment, this allows incremental changes in code to be quickly and safely deployed.


 


Tools for CI/CD include Azure Pipelines and GitHub Actions.


 


Version Control


Version control systems basically track the history of changes to a project’s code. In an environment where where multiple developers are collaborating on a project, version control is vital. Tools like Git provide for development teams to collaborate on projects in writing code. Version control systems allow for code changes happening in the same files, dealing with conflicts and rolling back to previous states where necessary.


 


Azure Boards


This is where you can begin to manage your projects by creating your work items. Azure Boards has native support for Kanban and Scrum as well as reporting tools and customisable dashboards, and is fully scalable.


 


We are going to use Basic process for this walkthrough, other available process types are Agile, Scrum and CMMI.


 


To begin your project, go to https://dev.azure.com/ and sign in. The first task you need to do is optionally to create a new organization.


 



image004.jpg


 


After selecting “Continue”, you will have the opportunity to name your organisation and choose the region where you want your project hosted.


 


image006.jpg


 


Following this step we will create our new project, for the purposes of this article, I’ve named it “BBlog2 Project”, made it private, selected Git for version control and chosen to use Basic process for work items.


 


image008.jpg


 


The next step is to create the “Boards” you will be using


 


image010.jpg


 


It is worth a look at the screens on the welcome dialog. Once you have done this you will see a screen similar to that below.


 


image012.jpg


 


This is where we will define our work items. I have created some simple items for demonstration purposes. Having created these items, the next screen will show how simple it is to change the status of an item.


 


image014.jpg


 


Once your project is properly underway, it is very easy to change a work item’s state from To Do, to Doing and finally to Done. This gives you a simple visual view of where your work items are. The next 2 screens show all my work items created both in the Boards and Work Items tabs, but there’s still work to be done here, as you’ll see, all items are currently unassigned and no schedules have been created.


 


image016.jpg


 


image018.jpg


 


For the next screen I have set the dates for the project, using the default “Sprint 1” Iteration name.


 


image020.jpg


 


Having done some tasks slightly out of order, my next task was to create my team, I would have been better doing this earlier. To do this, I returned to “Project Settings” (bottom left of screen) and selected the “Teams” page below.


At this stage I was the only team member of the only team


 


image022.jpg


 


At this stage it’s a simple process to add team members by selecting “Add” on the right of the screen and searching your Azure AD for your desired team members.


 


On completion of this process you should see a fully populated team as below, names and emails blurred for privacy reasons


 


image024.jpg


 


At this point if we return to our Boards tab, and select a work item, you will see (highlighted) that the items are still unassigned, clicking this area will allow you to assign this task to a member of your team.


 


image026.jpg


 


The final screen below shows that all the Work Items have now been assigned. The team members will then start to work on the items and change the state form To Do, to Doing. When a task is completed, it can then be updated to Done.


 


This is a very straightforward tool to use, and I have only really touched the surface of it, as a getting started guide, the next item in this series will be on Pipelines.


 


My main source for this post has been an excellent new resource https://docs.microsoft.com/en-us/learn/modules/get-started-with-devops/. For more useful information on Azure DevOps services, another great resource is https://docs.microsoft.com/en-us/azure/devops/get-started/?view=azure-devops.


 


The reason for the focus on Azure Boards is that my team is embarking on a new journey, we are beginning to teach DevOps to our first year students. Microsoft has provided great resources which are assisting us in this endeavour.


 


For students and faculty, Microsoft offers $100 of Azure credit per year on validation of your status as a student or educator, just follow the link here for Microsoft Azure for Student 


 


This is far from the only DevOps resource offered by Microsoft. For a some more introductory information for educators wishing to become involved with DevOps, a great quick read is https://docs.microsoft.com/en-us/azure/education-hub/azure-dev-tools-teaching/about-program. This provides an introduction on how to get your students started with Azure and gives you and your students the opportunity to claim your free $100 in order study Azure; download a wealth of free software; get free licences for Azure DevOps and get started with how computing works now and in the future.


 


My team has only been working with Azure since the beginning of the 2021, initially focusing on the Fundamentals courses AZ-900 (Azure Fundamentals) and AI-900 (Azure AI Fundamentals)


 


We are adding DP-900 (Azure Data Fundamentals) and SC-900 (Security, Compliance, and Identity Fundamentals) to the courses we offer to our first year students.


 


Our second and third-year students are being given the opportunity to move to role based certifications through a pilot programme for AZ-104 (Microsoft Azure Administrator) to make their employment prospects much greater.


 


Our experience of these courses to date has been great, the students have been very engaged with many taking multiple courses. Our industry contacts have also taken notice with one large organisation offering our students a month’s placement in order to develop a talent stream.


 


My recommendations for how to approach the fundamentals courses is possibly slightly unusual. Though at this stage I think the most important courses for students to study are AZ-900 to learn about cloud computing in general and the tools and services within Azure; and DP-900 because data drives everything! I would start the students journey with AI-900, this is a great introduction to artificial intelligence services and tools in Azure, which like the other fundamentals courses, contains excellent labs for students to complete and does not require coding skills. The reason I recommend starting with AI-900 is that it provides a great “hook”, students love this course and on completion want more. This has made our job of engaging the students in the, arguably, more difficult courses quite straightforward.


 


If you don’t feel ready to teach complete courses or have a cohort for whom it wouldn’t be appropriate, Microsoft is happy for you to use their materials in a piecemeal manner, just pick out the parts you need. My team are going to do this with local schools, our plan is to give in introduction to all the fundamentals courses already mentioned over 10 hours.


 


To get fully involved and access additional great resources, sign up either as an individual educator or as an institution to the Microsoft Learn Educator Programme.


 



Education needs to move away from just developing software for PC and on-prem environments and embrace the cloud, services such as Azure is not the future, it is NOW! It’s time to get on board or risk your graduates being irrelevant to the modern workplace.


 


 

Performance considerations for large scale deep learning training on Azure NDv4 (A100) series

Performance considerations for large scale deep learning training on Azure NDv4 (A100) series

This article is contributed. See the original author and article here.

Background


The field of Artificial Intelligence is being applied to more and more application areas, such as self-driving cars, natural language processing, visual recognition, fraud detection and many more.


A subset of artificial intelligence is Deep learning (DL), which is used to develop some of the more sophisticated training model, using deep neural networks (DNN) trying to mimic the human brain. Today, some of the largest DL training models can be used to do very complex and creative tasks like write poetry, write code, and understand the context of text/speech.


 


 


CormacGarvey_0-1630093976365.pngCormacGarvey_1-1630094022720.png


 


These large DL models are possible because of advancements in DL algorithms (DeepSpeed )), which maximize the efficiency of GPU memory management. Traditionally, DL models were very parallel floating-point intensive and so performed well on GPU’s, the newer more memory efficient algorithms made it possible to run much larger DL models but at the expense of significantly more inter-node communication operations, specifically, allreduce and alltoall collective operations.


 


Modern DL training jobs require large Clusters of multi-GPUs with high floating-point performance connected with high bandwidth, low latency networks. The Azure NDv4 VM series is designed specifically for these types of workloads. ND96asr_v4 has 8 A100 GPU’s connected via NVlink3, each A100 has access to 200 Gbps HDR InfiniBand, a total of 1.6 Tbps internode communication is possible.


We will be focusing on HPC+AI Clusters built with the ND96asr_v4 virtual machine type and providing specific performance optimization recommendations to get the best performance.


 


CormacGarvey_0-1630094620374.png


 


 


 


 


 


 


 


 


 


 


 


Deep Learning hardware and software stack


The Deep learning hardware and software stack is much more complicated compared to traditional HPC. From the hardware perspective CPU and GPU performance is important, especially floating-point performance and the speed in which data is moved from CPU (host) to GPU (device) or GPU (device) to GPU (device). There are many popular Deep learning frameworks e.g. pytorch, tensorflow, Caffe and CNTK. NCCL is one of the popular collective communication library for Nvidia GPU’s and low-level mathematics operations is dependent on the CUDA tools and libraries. We will touch on many parts of this H/W and S/W stack in this post.


 


CormacGarvey_0-1630095555848.png


 


How to deploy an HPC+AI Cluster (using NDv4)


In this section we discuss some deployment options.


 


Which image to use


It’s recommended that you start with one of the Azure Marketplace images that support NDv4. The advantage of using one of these Marketplace images is the GPU driver, InfiniBand drivers, CUDA, NCCL and MPI libraries (including rdma_sharp_plugin) are pre-installed and should be fully functional after booting up the image.



  • ubuntu-hpc 18.04  (microsoft-dsvm:ubuntu-hpc:1804:latest)

    • Ubuntu is a popular DL linux OS and the most amount of testing on NDv4 was done with version 18.04.



  • Ubuntu-hpc 20.04 (microsoft-dsvm:ubuntu-hpc:2004:latest)

    • Popular image in DL community.



  • CentOS-HPC 7.9 (OpenLogic:CentOS-HPC:7_9-gen2:latest)

    • More popular in HPC, less popular in AI.

    • NOTE: By default the NDv4 GPU NUMA topology is not correct, you need to apply this patch.




Another option, especially if you want to customize your image is to build your own custom image. The best place to start is the azhpc-images GitHub repository, which contains all the scripts used to build the HPC marketplace images.


You can then use packer or Azure Image builder to build the image and Azure Shared image gallery to store, use, share and distribute images.


 


Deployment options


In this section we will explore some options to deploy an HPC+AI NDv4 cluster.



  • Nvidia Nephele

    • Nephele is an open source GitHub repository, primary developers are Nvidia. It’s based on terraform and ansible. It also deploys a SLURM scheduler with container support, using enroot and pyxis.

    • It’s a good and proven benchmark environment.



  • AzureML

    • Is the Azure preferred AI platform, it’s an Azure ML service. It can easily deploy as code a cluster using batch or AKS, upload your environment, create a container, and submit your job. You can monitor and review resulting using the Azure machine learning studio GUI.

    • Maybe less control with specific tuning optimizations.



  • Azure CycleCloud

    • Is an Azure dynamic provisioning and VM autoscaling service that supports many traditional HPC schedulers like PBS, SLURM, LSF etc.

    • By default, containers are not supported and if you would like to have SLURM supporting containers you would need to manually integrate enroot and pyxis with cycleCloud+SLURM.

    • Currently, does not support Ubuntu 20.04.



  • AzureHPC

    • Is an open source framework that can combine many different build blocks to create complex and customized deployments in Azure.

    • It’s designed as a flexible deployment environment for prototyping, testing, and benchmarking, it’s not designed for production.

    • Does not support ubuntu (only CentOS).



  • Azure HPC on-Demand Platform (az-hop)

    • Is designed to be a complete E2E HPC as a service environment, its deployed using terraform and ansible and uses CycleCloud for its dynamic provisioning and autoscaling capabilities. It also supports OnDemand to provide a web interface to the HPC environment.

    • Currently, only supports PBS and does not have any container support.

    • Currently, supports CentOS-HPC based images (no Ubuntu).




 


CormacGarvey_0-1630098217325.png


 


NDv4 tuning considerations


In this section we will look at a couple of areas that should be carefully considered to make sure your large DL training job is running optimally on NDv4.


 


GPU tuning


Here is the procedure to set the GPU’s to maximum clock rates and to then reset the GPU clock rate after your job is completed. The procedure for GPU id 0 is shown, need to do this procedure for all GPUs.


 


First get maximum graphics and memory clock frequencies


max_graphics_freq=$(nvidia-smi -i 0 –query-gpu=clocks.max.graphics –format=csv,noheader,nounits)
max_memory_freq=$( nvidia-smi -i 0 –query-gpu=clocks.max.mem –format=csv,noheader,nounits)
echo “max_graphics_freq=$max_graphics_freq MHz, max_memory_freq=$max_memory_freq MHz”
max_graphics_freq=1410 MHz, max_memory_freq=1215 MHz

 


Then set the GPUs to the maximum and memory clock frequencies.


sudo nvidia-smi -I 0 -ac $max_memory_freq, $max_graphics_freq
Applications clocks set to “(MEM 1215, SM 1410)” for GPU 00000001:00:00.0
All done.

 


Finally, when job is finished, reset the graphics and memory clock frequencies.


sudo nvidia-smi -i 0 -rgc
All done.

 


NCCL tuning



  • It is recommended that you use NCCL version >= 2.9.9, especially at higher parallel scales.

    • export LD_LIBRARY_PATH==/path/to/libnccl.so (or if necessary LD_PRELOAD=/path/to/libnccl.so)



  • Use a specific topology file for ND96asr_v4 and set its location.

    • You can get the ND96asr_v4 topology file here.

    • export NCCL_TOPO_FILE=/path/to/topology.txt



  • Using relaxed ordering for PCI operations is a key mechanism to get maximum performance when targeting memory attached to AMD 2nd Gen EPYC CPUs.

    • export NCCL_IB_PCI_RELAXED_ORDERING=1

    • export UCX_IB_PCI_RELAXED_ORDERING=on



  • This is needed to make sure the correct topology is recognized.

    • export CUDA_DEVICE_ORDER=PCI_BUS_ID



  • Use eth0 (front-end network interface) to start up processes but use ib0 for processes to communicate.

    • export NCCL_SOCKET_IFNAME=eth0



  • It’s recommended to print NCCL debug information to verify that the correct environmental variables are set and correct plugins are used (e.g RDMA SHARP plugin).

    • For Initial testing and verification, to check that parameters, environmental variables, and plugins are set correctly.

      • export NCCL_DEBUG=INFO



    • Set to WARNING once you have confidence in your environment.

      • export NCCL_DEBUG=WARNING





  • Enable NCCL RDMA Sharp Plugin, has a big impact on performance and should always be enabled. There are a couple of ways to enable the plugin.

    • source hpcx-init.sh && hpcx_load

    • LD_LIBRARY_PATH=/path/to/plugin/{libnccl-net.so,libsharp*.so}:$LD_LIBRARY_PATH  (or LD_PRELOAD)

    • Note: SHARP is currently not enabled on ND96asr_v4.

    • Check NCCL_DEBUG=INFO output to verify its loaded.

      • x8a100-0000:60522:60522 [5] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so





  • Lowering the message size threshold to determine if messages are broken up to use adaptive routing may improve the performance of smaller message sizes.

    • export NCCL_IB_AR_THRESHOLD=0



  • Also consider NCCL_TOPO=ring or tree (as experiment/debugging, but defaults are generally good)


 


MPI considerations


When MPI is used with NCCL, MPI is primarily used just to start-up the processes and NCCL is used for efficient collective communication.


You can start processes by explicitly executing mpirun or via a Scheduler MPI integration (e.g SLURM srun command.).


If you have flexibility on the choice of MPI library, then HPCX is the preferred MPI library due to its performance and features.


 


It is required to disable Mellanox hierarchical Collectives (HCOLL) when using MPI with NCCL.


mpirun –mca coll_hcoll_enable 0  or export OMPI_MCA_COLL_HCOLL_ENABLE=0

 


Process pinning optimizations


The first step is to determine what is the correct CPU (NUMA) to GPU topology. To see where the GPU’s are located, you can use


ltopo or nvidia-smi topo -m

to get this information or use the check application pinning tool (contained in the azurehpc Github repo (see experimental/check_app_pinning_tool)


 


./check_app_pinning.py

Virtual Machine (Standard_ND96asr_v4) Numa topology

NumaNode id Core ids GPU ids
============ ==================== ==========
0 [‘0-23′] [2, 3]
1 [’24-47′] [0, 1]
2 [’48-71′] [6, 7]
3 [’72-95’] [4, 5]


 


We can see that 2 GPU’s are located in each NUMA domain and that the GPU id order is not 0,1,2,3,4,5,6,7, but 3,2,1,0,7,6,5,4. To make sure all GPU’s are used and running optimally we need to make sure that 2 processes are mapped correctly and running in each NUMA domain. There are several ways to force the correct gpu to cpu mapping. In SLURM we can map GPU ids 0,1 to NUMA 1,


GPU ids 2,3 to NUMA 0, GPU ids 4,5 to NUMA 3 and GPU ids 6,7 to NUMA 2 with the following explicit mapping using the SLURM srun command to launch processes.


 


srun –cpu-bind=mask_cpu:ffffff000000,ffffff000000,ffffff,ffffff,ffffff000000000000000000,ffffff000000000000000000,ffffff000000000000,ffffff00000000000

 


A similar gpu to cpu mapping is possible with HPCX MPI, setting the following environmental variable and mpirun arguments


export  CUDA_VISIBLE_DEVICES=2,3,0,1,6,7,4,5
–map-by ppr:2:numa (Add :pe=N, if running hybrid parallel (threads in addition to processes)

Then you can use the AzureHPC check_app_pinning.py tool as your job runs to verify if processes/threads are pinned optimally.


 


I/O tuning


Two aspects of I/O need to be addressed.



  1. Scratch Storage

    1. This type of storage needs to be fast (high throughput and low latency); the training job needs to read data, process the data and use this storage location as scratch space as the job runs.

    2. In an ideal case you would use the local SSD on each VM directly. The NDv4 has a local SSD already mounted at /mnt (2.8 TB), it also has 8 NVMe SSD devices, which when configured and mounted (see below), have ~7 TB capacity.

    3. If you need a shared filesystem for scratch, combining all NVMe SSD’s and creating a PFS system may be great option from a cost and performance perspective assuming it has sufficient capacity one way to do this is with BeeOND, if not there are other storage options to explore (IaaS Lustre PFS, Azure ClusterStor and Azure Netapp files).



  2. Checkpoint Storage

    1. Large DL training jobs can run for Weeks depending on how many VM’s are used, just like any HPC cluster you can have failures (e.g. InfiniBand, memory DIM, ECC error GPU memory etc). It’s critical to have a checkpointing strategy, know the checkpoint interval (e.g. when data is checkpointed), each time how much data is transferred and have a storage solution in place that can satisfy that capacity and performance requirements. If Blob Storage can meet the storage performance, it’s a great option.




 


How to set-up and configure the NDv4 local NVMe SSD’s


ND96asr_v4 virtual machine contains 8 NVMe SSD devices. You can combine the 8 devices into a striped raid 0 device, that can then be used to create an XFS (or ext4) filesystem and mounted. The script below can be run on all NDv4 VM’s with a parallel shell (e.g pdsh) to create a ~7TB local scratch space (/mnt_nvme).


The resulting local scratch space has a read and write I/O throughput of ~8 GB/s.


 


 

#!/bin/bash 
mkdir /mnt/resource_nvme
mdadm --create /dev/md128 --level 0 --raid-devices 8 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1 /dev/nvme6n1 /dev/nvme7n1
mkfs.xfs /dev/md128
mount /dev/md128 /mnt/resource_nvme

 


 


CormacGarvey_0-1630103159493.png


 


CormacGarvey_1-1630103193234.png


Restricting data transfer to BLOB storage using the azcopy tool


The process described here is specific to azcopy, but the same principals can be applied to any of the language specific SDK (e.g BLOB API via python).


In this example, lets assume that we have as single BLOB storage account with an ingress limit of 20 Gbps. At each checkpoint, 8 files (corresponding to each GPU) need to be copied to the BLOB storage account, each file will be transferred with its own azcopy. We choose that each azcopy can transfer data at a maximum transfer speed of 2300 Mbps (2300 x 8 = 18400 < 20000 Gpbs) to avoid throttling. The ND96asr_v4 has 96 vcores and so we choose that each azcopy can use 10 cores, so each instance of azcopy gets enough cores and other processes have some additional vcores.


 


export AZCOPY_CONCURRENCY_VALUE=10
azcopy cp ./file “blob_storage_acc_container” –caps-mbps 2300

 


DeepSpeed and Onnx Runtime (ORT)


The performance of large scale  DL training models built with the pytorch framework can be significantly improved by utilizing DeepSpeed and/or Onnx runtime. It can be straight forward to enable DeepSpeed and Onnx runtime by importing a few extra modules and replacing a few lines of code with some wrapper functions. If using DeepSpeed and Onnx its best practice to apply Onnx first and then DeepSpeed.


 


HPC+AI NDv4 cluster health checks


Within Azure there is automated testing to help identify unhealthy VM’s. Our testing processes and procedures continue to improve, but still it is possible for an unhealthy VM to be not identified by our testing and to be deployed. Large DL training jobs typically require many VM’s collaborating and communicating with each other to complete the DL job. The more VM’s deployed the greater the change that one of them may be unhealthy, resulting in the DL job failing or underperforming. It is recommended that before starting a large scale DL training job to run some health checks on your cluster to verify it’s performing as expected.


 


Check GPU floating-point performance


Run high performance linpack (HPL) on each VM, its convenience to use the version contained in the Nvidia hpc-benchmarks container. (Note: This is a non-optimized version of HPL so the numbers reported are ~5-7% slower than the optimized container. It will give you good node to node variation numbers and identify a system that is having CPU, memory, or GPU issues).


 


 

#!/bin/bash
#SBATCH -t 00:20:00
#SBATCH --ntasks-per-node=8
#SBATCH -o logs/%x_%j.log

CONT='nvcr.io#nvidia/hpc-benchmarks:20.10-hpl'
MOUNT='/nfs2/hpl/dats/hpl-${SLURM_JOB_NUM_NODES}N.dat:/workspace/hpl-linux-x86_64/sample-dat/HPL-dgx-a100-${SLURM_JOB_NUM_NODES}N.dat'
echo "Running on hosts: $(echo $(scontrol show hostname))"

export NCCL_DEBUG=INFO
export OMPI_MCA_pml=ucx
export OMPI_MCA_btl=^openib,smcuda

CMD="hpl.sh --cpu-affinity 24-35:36-47:0-11:12-23:72-83:84-95:48-59:60-71 --cpu-cores-per-rank 8 --gpu-affinity 0:1:2:3:4:5:6:7 --mem-affinity 1:1:0:0:3:3:2:2  --ucx-affinity ibP257p0s0:ibP258p0s0:ibP259p0s0:ibP260p0s0:ibP261p0s0:ibP262p0s0:ibP263p0s0:ibP264p0s0 --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-dgx-a100-${SLURM_JOB_NUM_NODES}N.dat"
srun --gpus-per-node=8 --container-image="${CONT}" --container-mounts="${MOUNT}" ${CMD}

 


You should see ~95 TFLOPs DP on ND96asr_v4 (which has 8 A100 GPU’s)


 


Check host to device and device to host transfer bandwidth


The CUDA bandwidthTest is a convenience way to verify that the host to gpu and gpu to host  data bandwidth speeds are good. Below is an example testing gpu id = 0, you would run a similar test for the other 7 gpu_ids, paying close attention to what NUMA domains they are contained in.


numactl –cpunodebind=1 –membind=1 ./bandwidthTest –dtoh –htod –device=0

[CUDA Bandwidth Test] – Starting…
Running on…

Device 0: A100-SXM4-40GB
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 26.1

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 25.0

Result = PASS


The expected host to device and device to host transfer speed is > 20 GB/s.


This health check and many more detailed tests to diagnose unhealthy VMs can be found in the azhpc-diagnostics Github repository.


 


Check the InfiniBand network and NCCL performance


Running a NCCL allreduce and/or alltoall benchmark at the scale you plan on running your deep learning training job is a great way to identify problems with the InfiniBand inter-node network or with NCCL performance.


Here is a SLURM script to run a NCCL alltoall benchmark (Note: using SLURM container integration with enroot+pyxis to use the Nvidia pytorch container.)


 


 

#!/bin/bash
#SBATCH -t 00:20:00
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8
#SBATCH -o logs/%x_%j.log
export UCX_IB_PCI_RELAXED_ORDERING=on 
       UCX_TLS=rc 
       NCCL_DEBUG=INFO 
       CUDA_DEVICE_ORDER=PCI_BUS_ID 
       NCCL_IB_PCI_RELAXED_ORDERING=1 
       NCCL_TOPO_FILE=/workspace/nccl/nccl-topology.txt

CONT="nvcr.io#nvidia/pytorch:21.05-py3"
MOUNT="/nfs2/nccl:/workspace/nccl_284,/nfs2/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.1-0.6.6.0-ubuntu18.04-x86_64:/opt/hpcx,/nfs2/nccl_2.10.3-1/nccl:/workspace/nccl"

export OMPI_MCA_pml=ucx
export OMPI_MCA_btl=^openib
export OMPI_MCA_COLL_HCOLL_ENABLE=0

srun --ntasks=$SLURM_JOB_NUM_NODES --container-image "${CONT}" 
    --container-name=nccl 
    --container-mounts="${MOUNT}" 
    --ntasks-per-node=1 
    bash -c "apt update && apt-get install -y infiniband-diags"

srun --gpus-per-node=8 
    --ntasks-per-node=8 
    --container-name=nccl 
    --container-mounts "${MOUNT}" 
    bash -c 'export LD_LIBRARY_PATH=/opt/hpcx/nccl_rdma_sharp_plugin/lib:/opt/hpcx/sharp/lib:/workspace/nccl/build/lib:$LD_LIBRARY_PATH && /workspace/nccl/nccl-tests/build/alltoall_perf -b8 -f 2 -g 1 -e 8G'

 


 


Then submit the above script on for example 4 ND96asr_v4 VM’s


sbatch -N 4 ./nccl.slrm

 Similarly, for allreduce, just change the executable to be all_reduce_perf.


The following plots show NCCL allreduce and alltoall expect performance on ND96asr_v4.


 


CormacGarvey_0-1630106993288.png


CormacGarvey_1-1630107011467.png


 


Summary


Large scale DL models are becoming very complex and sophisticated being applied to many application areas. The computational and network resources to train these large modern DL models can be quite substantial. The Azure NDv4 series is designed specifically for these large scale DL computational, network and I/O requirements.


Several key performance optimization tips and tricks are discussed to allow you to get the best possible performance running your large deep learning model on Azure NDv4 series.


 


Credits


To would like to acknowledge the significant contribution of my Colleagues at Microsoft to this post. Jithin Jose provided the NCCL scaling performance data and was the primary contributor to the NCCL, MPI an GPU tuning parameters, he also helped review this document. I would also like to thank Kanchan Mehrotra and Jon Shelley for reviewing this document and providing outstanding feedback.