Blog - Page 275 of 937 - Dr. Ware Technology Services

DevOps Primer (Part 1)

by Contributed | Aug 29, 2021 | Technology

This article is contributed. See the original author and article here.

Get started with DevOps Guest post by Charlie Johnstone, Curriculum & Quality Leader for Computing, Film & TV at New College Lanarkshire: Microsoft Learn for Educator Ambassador

What is DevOps

DevOps enables better communications between developers, operations, quality and security professionals in an organisation, it is not software or hardware and not just a methodology, it is so much more! What it does is bring together the people in your teams (both developers and ops people), products and processes to deliver value to your end users.

This blog will focus on some of the tools and services used within Azure DevOps to build test and deploy your projects wherever you want to deploy, whether it be on prem or in the cloud.

This blog will be delivered in multiple parts, in this part, following a short primer, I will discuss part of the planning process using Azure Boards

Plan:

In the plan phase, the DevOps teams will specify and detail what the application will do. They may use tools like Kanban boards and Scrum for this planning.

Develop:

This is fairly obvious, this is mainly focused on coding, testing and reviewing. Automation is important here using automated testing and continuous integration (CI). In Azure, this would be done in a Dev/Test environment

Deliver:

In this phase, the application is deployed to a production environment, including the application’s infrastructure requirements. At this stage, the applications should be made available to customers, and should be scalable.

Operate:

Once in the production environment, the applications need monitoring to ensure high availability, if issues are found then maintenance and troubleshooting are necessary.

Each of these phases relies on each other and, to some degree, involves each of the aforementioned roles.

DevOps Practices

Continuous Integration (CI) & Continuous Delivery (CD)

Continuous Integration allows developers to merge code updates into the main code branch. Every time new code is committed, automated testing takes place to ensure stability of the main branch with bugs identified prior to merging.

Continuous Delivery automatically deploys new versions of your applications into your production environment.

Used together as CI/CD, you will benefit from automation all the way from committing new code to its deployment in your production environment, this allows incremental changes in code to be quickly and safely deployed.

Tools for CI/CD include Azure Pipelines and GitHub Actions.

Version Control

Version control systems basically track the history of changes to a project’s code. In an environment where where multiple developers are collaborating on a project, version control is vital. Tools like Git provide for development teams to collaborate on projects in writing code. Version control systems allow for code changes happening in the same files, dealing with conflicts and rolling back to previous states where necessary.

Azure Boards

This is where you can begin to manage your projects by creating your work items. Azure Boards has native support for Kanban and Scrum as well as reporting tools and customisable dashboards, and is fully scalable.

We are going to use Basic process for this walkthrough, other available process types are Agile, Scrum and CMMI.

To begin your project, go to https://dev.azure.com/ and sign in. The first task you need to do is optionally to create a new organization.

After selecting “Continue”, you will have the opportunity to name your organisation and choose the region where you want your project hosted.

Following this step we will create our new project, for the purposes of this article, I’ve named it “BBlog2 Project”, made it private, selected Git for version control and chosen to use Basic process for work items.

The next step is to create the “Boards” you will be using

It is worth a look at the screens on the welcome dialog. Once you have done this you will see a screen similar to that below.

This is where we will define our work items. I have created some simple items for demonstration purposes. Having created these items, the next screen will show how simple it is to change the status of an item.

Once your project is properly underway, it is very easy to change a work item’s state from To Do, to Doing and finally to Done. This gives you a simple visual view of where your work items are. The next 2 screens show all my work items created both in the Boards and Work Items tabs, but there’s still work to be done here, as you’ll see, all items are currently unassigned and no schedules have been created.

For the next screen I have set the dates for the project, using the default “Sprint 1” Iteration name.

Having done some tasks slightly out of order, my next task was to create my team, I would have been better doing this earlier. To do this, I returned to “Project Settings” (bottom left of screen) and selected the “Teams” page below.

At this stage I was the only team member of the only team

At this stage it’s a simple process to add team members by selecting “Add” on the right of the screen and searching your Azure AD for your desired team members.

On completion of this process you should see a fully populated team as below, names and emails blurred for privacy reasons

At this point if we return to our Boards tab, and select a work item, you will see (highlighted) that the items are still unassigned, clicking this area will allow you to assign this task to a member of your team.

The final screen below shows that all the Work Items have now been assigned. The team members will then start to work on the items and change the state form To Do, to Doing. When a task is completed, it can then be updated to Done.

This is a very straightforward tool to use, and I have only really touched the surface of it, as a getting started guide, the next item in this series will be on Pipelines.

My main source for this post has been an excellent new resource https://docs.microsoft.com/en-us/learn/modules/get-started-with-devops/. For more useful information on Azure DevOps services, another great resource is https://docs.microsoft.com/en-us/azure/devops/get-started/?view=azure-devops.

The reason for the focus on Azure Boards is that my team is embarking on a new journey, we are beginning to teach DevOps to our first year students. Microsoft has provided great resources which are assisting us in this endeavour.

For students and faculty, Microsoft offers $100 of Azure credit per year on validation of your status as a student or educator, just follow the link here for Microsoft Azure for Student

This is far from the only DevOps resource offered by Microsoft. For a some more introductory information for educators wishing to become involved with DevOps, a great quick read is https://docs.microsoft.com/en-us/azure/education-hub/azure-dev-tools-teaching/about-program. This provides an introduction on how to get your students started with Azure and gives you and your students the opportunity to claim your free $100 in order study Azure; download a wealth of free software; get free licences for Azure DevOps and get started with how computing works now and in the future.

My team has only been working with Azure since the beginning of the 2021, initially focusing on the Fundamentals courses AZ-900 (Azure Fundamentals) and AI-900 (Azure AI Fundamentals)

We are adding DP-900 (Azure Data Fundamentals) and SC-900 (Security, Compliance, and Identity Fundamentals) to the courses we offer to our first year students.

Our second and third-year students are being given the opportunity to move to role based certifications through a pilot programme for AZ-104 (Microsoft Azure Administrator) to make their employment prospects much greater.

Our experience of these courses to date has been great, the students have been very engaged with many taking multiple courses. Our industry contacts have also taken notice with one large organisation offering our students a month’s placement in order to develop a talent stream.

My recommendations for how to approach the fundamentals courses is possibly slightly unusual. Though at this stage I think the most important courses for students to study are AZ-900 to learn about cloud computing in general and the tools and services within Azure; and DP-900 because data drives everything! I would start the students journey with AI-900, this is a great introduction to artificial intelligence services and tools in Azure, which like the other fundamentals courses, contains excellent labs for students to complete and does not require coding skills. The reason I recommend starting with AI-900 is that it provides a great “hook”, students love this course and on completion want more. This has made our job of engaging the students in the, arguably, more difficult courses quite straightforward.

If you don’t feel ready to teach complete courses or have a cohort for whom it wouldn’t be appropriate, Microsoft is happy for you to use their materials in a piecemeal manner, just pick out the parts you need. My team are going to do this with local schools, our plan is to give in introduction to all the fundamentals courses already mentioned over 10 hours.

To get fully involved and access additional great resources, sign up either as an individual educator or as an institution to the Microsoft Learn Educator Programme.

For institutions use https://fsiwwlprd.powerappsportals.com/institute/

For individual educators use https://fsiwwlprd.powerappsportals.com/SignIn?returnUrl=%2F

Education needs to move away from just developing software for PC and on-prem environments and embrace the cloud, services such as Azure is not the future, it is NOW! It’s time to get on board or risk your graduates being irrelevant to the modern workplace.

Performance considerations for large scale deep learning training on Azure NDv4 (A100) series

by Contributed | Aug 28, 2021 | Technology

This article is contributed. See the original author and article here.

Background

The field of Artificial Intelligence is being applied to more and more application areas, such as self-driving cars, natural language processing, visual recognition, fraud detection and many more.

A subset of artificial intelligence is Deep learning (DL), which is used to develop some of the more sophisticated training model, using deep neural networks (DNN) trying to mimic the human brain. Today, some of the largest DL training models can be used to do very complex and creative tasks like write poetry, write code, and understand the context of text/speech.

These large DL models are possible because of advancements in DL algorithms (DeepSpeed )), which maximize the efficiency of GPU memory management. Traditionally, DL models were very parallel floating-point intensive and so performed well on GPU’s, the newer more memory efficient algorithms made it possible to run much larger DL models but at the expense of significantly more inter-node communication operations, specifically, allreduce and alltoall collective operations.

Modern DL training jobs require large Clusters of multi-GPUs with high floating-point performance connected with high bandwidth, low latency networks. The Azure NDv4 VM series is designed specifically for these types of workloads. ND96asr_v4 has 8 A100 GPU’s connected via NVlink3, each A100 has access to 200 Gbps HDR InfiniBand, a total of 1.6 Tbps internode communication is possible.

We will be focusing on HPC+AI Clusters built with the ND96asr_v4 virtual machine type and providing specific performance optimization recommendations to get the best performance.

Deep Learning hardware and software stack

The Deep learning hardware and software stack is much more complicated compared to traditional HPC. From the hardware perspective CPU and GPU performance is important, especially floating-point performance and the speed in which data is moved from CPU (host) to GPU (device) or GPU (device) to GPU (device). There are many popular Deep learning frameworks e.g. pytorch, tensorflow, Caffe and CNTK. NCCL is one of the popular collective communication library for Nvidia GPU’s and low-level mathematics operations is dependent on the CUDA tools and libraries. We will touch on many parts of this H/W and S/W stack in this post.

How to deploy an HPC+AI Cluster (using NDv4)

In this section we discuss some deployment options.

Which image to use

It’s recommended that you start with one of the Azure Marketplace images that support NDv4. The advantage of using one of these Marketplace images is the GPU driver, InfiniBand drivers, CUDA, NCCL and MPI libraries (including rdma_sharp_plugin) are pre-installed and should be fully functional after booting up the image.

ubuntu-hpc 18.04 (microsoft-dsvm:ubuntu-hpc:1804:latest)
- Ubuntu is a popular DL linux OS and the most amount of testing on NDv4 was done with version 18.04.

Ubuntu-hpc 20.04 (microsoft-dsvm:ubuntu-hpc:2004:latest)
- Popular image in DL community.

CentOS-HPC 7.9 (OpenLogic:CentOS-HPC:7_9-gen2:latest)
- More popular in HPC, less popular in AI.
- NOTE: By default the NDv4 GPU NUMA topology is not correct, you need to apply this patch.

Another option, especially if you want to customize your image is to build your own custom image. The best place to start is the azhpc-images GitHub repository, which contains all the scripts used to build the HPC marketplace images.

You can then use packer or Azure Image builder to build the image and Azure Shared image gallery to store, use, share and distribute images.

Deployment options

In this section we will explore some options to deploy an HPC+AI NDv4 cluster.

Nvidia Nephele
- Nephele is an open source GitHub repository, primary developers are Nvidia. It’s based on terraform and ansible. It also deploys a SLURM scheduler with container support, using enroot and pyxis.
- It’s a good and proven benchmark environment.

AzureML
- Is the Azure preferred AI platform, it’s an Azure ML service. It can easily deploy as code a cluster using batch or AKS, upload your environment, create a container, and submit your job. You can monitor and review resulting using the Azure machine learning studio GUI.
- Maybe less control with specific tuning optimizations.

Azure CycleCloud
- Is an Azure dynamic provisioning and VM autoscaling service that supports many traditional HPC schedulers like PBS, SLURM, LSF etc.
- By default, containers are not supported and if you would like to have SLURM supporting containers you would need to manually integrate enroot and pyxis with cycleCloud+SLURM.
- Currently, does not support Ubuntu 20.04.

AzureHPC
- Is an open source framework that can combine many different build blocks to create complex and customized deployments in Azure.
- It’s designed as a flexible deployment environment for prototyping, testing, and benchmarking, it’s not designed for production.
- Does not support ubuntu (only CentOS).

Azure HPC on-Demand Platform (az-hop)
- Is designed to be a complete E2E HPC as a service environment, its deployed using terraform and ansible and uses CycleCloud for its dynamic provisioning and autoscaling capabilities. It also supports OnDemand to provide a web interface to the HPC environment.
- Currently, only supports PBS and does not have any container support.
- Currently, supports CentOS-HPC based images (no Ubuntu).

NDv4 tuning considerations

In this section we will look at a couple of areas that should be carefully considered to make sure your large DL training job is running optimally on NDv4.

GPU tuning

Here is the procedure to set the GPU’s to maximum clock rates and to then reset the GPU clock rate after your job is completed. The procedure for GPU id 0 is shown, need to do this procedure for all GPUs.

First get maximum graphics and memory clock frequencies

max_graphics_freq=$(nvidia-smi -i 0 –query-gpu=clocks.max.graphics –format=csv,noheader,nounits)

max_memory_freq=$( nvidia-smi -i 0 –query-gpu=clocks.max.mem –format=csv,noheader,nounits)

echo “max_graphics_freq=$max_graphics_freq MHz, max_memory_freq=$max_memory_freq MHz”

max_graphics_freq=1410 MHz, max_memory_freq=1215 MHz

Then set the GPUs to the maximum and memory clock frequencies.

sudo nvidia-smi -I 0 -ac $max_memory_freq, $max_graphics_freq

Applications clocks set to “(MEM 1215, SM 1410)” for GPU 00000001:00:00.0

All done.

Finally, when job is finished, reset the graphics and memory clock frequencies.

sudo nvidia-smi -i 0 -rgc

All done.

NCCL tuning

It is recommended that you use NCCL version >= 2.9.9, especially at higher parallel scales.
- export LD_LIBRARY_PATH==/path/to/libnccl.so (or if necessary LD_PRELOAD=/path/to/libnccl.so)

Use a specific topology file for ND96asr_v4 and set its location.
- You can get the ND96asr_v4 topology file here.
- export NCCL_TOPO_FILE=/path/to/topology.txt

Using relaxed ordering for PCI operations is a key mechanism to get maximum performance when targeting memory attached to AMD 2nd Gen EPYC CPUs.
- export NCCL_IB_PCI_RELAXED_ORDERING=1
- export UCX_IB_PCI_RELAXED_ORDERING=on

This is needed to make sure the correct topology is recognized.
- export CUDA_DEVICE_ORDER=PCI_BUS_ID

Use eth0 (front-end network interface) to start up processes but use ib0 for processes to communicate.
- export NCCL_SOCKET_IFNAME=eth0

It’s recommended to print NCCL debug information to verify that the correct environmental variables are set and correct plugins are used (e.g RDMA SHARP plugin).
- For Initial testing and verification, to check that parameters, environmental variables, and plugins are set correctly.
  - export NCCL_DEBUG=INFO
- Set to WARNING once you have confidence in your environment.
  - export NCCL_DEBUG=WARNING

Enable NCCL RDMA Sharp Plugin, has a big impact on performance and should always be enabled. There are a couple of ways to enable the plugin.
- source hpcx-init.sh && hpcx_load
- LD_LIBRARY_PATH=/path/to/plugin/{libnccl-net.so,libsharp*.so}:$LD_LIBRARY_PATH (or LD_PRELOAD)
- Note: SHARP is currently not enabled on ND96asr_v4.
- Check NCCL_DEBUG=INFO output to verify its loaded.
  - x8a100-0000:60522:60522 [5] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so

Lowering the message size threshold to determine if messages are broken up to use adaptive routing may improve the performance of smaller message sizes.
- export NCCL_IB_AR_THRESHOLD=0

Also consider NCCL_TOPO=ring or tree (as experiment/debugging, but defaults are generally good)

MPI considerations

When MPI is used with NCCL, MPI is primarily used just to start-up the processes and NCCL is used for efficient collective communication.

You can start processes by explicitly executing mpirun or via a Scheduler MPI integration (e.g SLURM srun command.).

If you have flexibility on the choice of MPI library, then HPCX is the preferred MPI library due to its performance and features.

It is required to disable Mellanox hierarchical Collectives (HCOLL) when using MPI with NCCL.

mpirun –mca coll_hcoll_enable 0  or export OMPI_MCA_COLL_HCOLL_ENABLE=0

Process pinning optimizations

The first step is to determine what is the correct CPU (NUMA) to GPU topology. To see where the GPU’s are located, you can use

ltopo or nvidia-smi topo -m

to get this information or use the check application pinning tool (contained in the azurehpc Github repo (see experimental/check_app_pinning_tool)

./check_app_pinning.py


Virtual Machine (Standard_ND96asr_v4) Numa topology
NumaNode id        Core ids            GPU ids

============ ====================    ==========

0                  [‘0-23′]            [2, 3]

1                  [’24-47′]           [0, 1]

2                  [’48-71′]           [6, 7]

3                  [’72-95’]           [4, 5]

We can see that 2 GPU’s are located in each NUMA domain and that the GPU id order is not 0,1,2,3,4,5,6,7, but 3,2,1,0,7,6,5,4. To make sure all GPU’s are used and running optimally we need to make sure that 2 processes are mapped correctly and running in each NUMA domain. There are several ways to force the correct gpu to cpu mapping. In SLURM we can map GPU ids 0,1 to NUMA 1,

GPU ids 2,3 to NUMA 0, GPU ids 4,5 to NUMA 3 and GPU ids 6,7 to NUMA 2 with the following explicit mapping using the SLURM srun command to launch processes.

srun –cpu-bind=mask_cpu:ffffff000000,ffffff000000,ffffff,ffffff,ffffff000000000000000000,ffffff000000000000000000,ffffff000000000000,ffffff00000000000

A similar gpu to cpu mapping is possible with HPCX MPI, setting the following environmental variable and mpirun arguments

export  CUDA_VISIBLE_DEVICES=2,3,0,1,6,7,4,5

–map-by ppr:2:numa                                      (Add :pe=N, if running hybrid parallel (threads in addition to processes)

Then you can use the AzureHPC check_app_pinning.py tool as your job runs to verify if processes/threads are pinned optimally.

I/O tuning

Two aspects of I/O need to be addressed.

Scratch Storage
1. This type of storage needs to be fast (high throughput and low latency); the training job needs to read data, process the data and use this storage location as scratch space as the job runs.
2. In an ideal case you would use the local SSD on each VM directly. The NDv4 has a local SSD already mounted at /mnt (2.8 TB), it also has 8 NVMe SSD devices, which when configured and mounted (see below), have ~7 TB capacity.
3. If you need a shared filesystem for scratch, combining all NVMe SSD’s and creating a PFS system may be great option from a cost and performance perspective assuming it has sufficient capacity one way to do this is with BeeOND, if not there are other storage options to explore (IaaS Lustre PFS, Azure ClusterStor and Azure Netapp files).

Checkpoint Storage
1. Large DL training jobs can run for Weeks depending on how many VM’s are used, just like any HPC cluster you can have failures (e.g. InfiniBand, memory DIM, ECC error GPU memory etc). It’s critical to have a checkpointing strategy, know the checkpoint interval (e.g. when data is checkpointed), each time how much data is transferred and have a storage solution in place that can satisfy that capacity and performance requirements. If Blob Storage can meet the storage performance, it’s a great option.

How to set-up and configure the NDv4 local NVMe SSD’s

ND96asr_v4 virtual machine contains 8 NVMe SSD devices. You can combine the 8 devices into a striped raid 0 device, that can then be used to create an XFS (or ext4) filesystem and mounted. The script below can be run on all NDv4 VM’s with a parallel shell (e.g pdsh) to create a ~7TB local scratch space (/mnt_nvme).

The resulting local scratch space has a read and write I/O throughput of ~8 GB/s.

#!/bin/bash 
mkdir /mnt/resource_nvme
mdadm --create /dev/md128 --level 0 --raid-devices 8 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1 /dev/nvme6n1 /dev/nvme7n1
mkfs.xfs /dev/md128
mount /dev/md128 /mnt/resource_nvme

Restricting data transfer to BLOB storage using the azcopy tool

The process described here is specific to azcopy, but the same principals can be applied to any of the language specific SDK (e.g BLOB API via python).

In this example, lets assume that we have as single BLOB storage account with an ingress limit of 20 Gbps. At each checkpoint, 8 files (corresponding to each GPU) need to be copied to the BLOB storage account, each file will be transferred with its own azcopy. We choose that each azcopy can transfer data at a maximum transfer speed of 2300 Mbps (2300 x 8 = 18400 < 20000 Gpbs) to avoid throttling. The ND96asr_v4 has 96 vcores and so we choose that each azcopy can use 10 cores, so each instance of azcopy gets enough cores and other processes have some additional vcores.

export AZCOPY_CONCURRENCY_VALUE=10

azcopy cp ./file “blob_storage_acc_container” –caps-mbps 2300

DeepSpeed and Onnx Runtime (ORT)

The performance of large scale DL training models built with the pytorch framework can be significantly improved by utilizing DeepSpeed and/or Onnx runtime. It can be straight forward to enable DeepSpeed and Onnx runtime by importing a few extra modules and replacing a few lines of code with some wrapper functions. If using DeepSpeed and Onnx its best practice to apply Onnx first and then DeepSpeed.

HPC+AI NDv4 cluster health checks

Within Azure there is automated testing to help identify unhealthy VM’s. Our testing processes and procedures continue to improve, but still it is possible for an unhealthy VM to be not identified by our testing and to be deployed. Large DL training jobs typically require many VM’s collaborating and communicating with each other to complete the DL job. The more VM’s deployed the greater the change that one of them may be unhealthy, resulting in the DL job failing or underperforming. It is recommended that before starting a large scale DL training job to run some health checks on your cluster to verify it’s performing as expected.

Check GPU floating-point performance

Run high performance linpack (HPL) on each VM, its convenience to use the version contained in the Nvidia hpc-benchmarks container. (Note: This is a non-optimized version of HPL so the numbers reported are ~5-7% slower than the optimized container. It will give you good node to node variation numbers and identify a system that is having CPU, memory, or GPU issues).

#!/bin/bash
#SBATCH -t 00:20:00
#SBATCH --ntasks-per-node=8
#SBATCH -o logs/%x_%j.log

CONT='nvcr.io#nvidia/hpc-benchmarks:20.10-hpl'
MOUNT='/nfs2/hpl/dats/hpl-${SLURM_JOB_NUM_NODES}N.dat:/workspace/hpl-linux-x86_64/sample-dat/HPL-dgx-a100-${SLURM_JOB_NUM_NODES}N.dat'
echo "Running on hosts: $(echo $(scontrol show hostname))"

export NCCL_DEBUG=INFO
export OMPI_MCA_pml=ucx
export OMPI_MCA_btl=^openib,smcuda

CMD="hpl.sh --cpu-affinity 24-35:36-47:0-11:12-23:72-83:84-95:48-59:60-71 --cpu-cores-per-rank 8 --gpu-affinity 0:1:2:3:4:5:6:7 --mem-affinity 1:1:0:0:3:3:2:2  --ucx-affinity ibP257p0s0:ibP258p0s0:ibP259p0s0:ibP260p0s0:ibP261p0s0:ibP262p0s0:ibP263p0s0:ibP264p0s0 --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-dgx-a100-${SLURM_JOB_NUM_NODES}N.dat"
srun --gpus-per-node=8 --container-image="${CONT}" --container-mounts="${MOUNT}" ${CMD}

You should see ~95 TFLOPs DP on ND96asr_v4 (which has 8 A100 GPU’s)

Check host to device and device to host transfer bandwidth

The CUDA bandwidthTest is a convenience way to verify that the host to gpu and gpu to host data bandwidth speeds are good. Below is an example testing gpu id = 0, you would run a similar test for the other 7 gpu_ids, paying close attention to what NUMA domains they are contained in.

numactl –cpunodebind=1 –membind=1 ./bandwidthTest –dtoh –htod –device=0
[CUDA Bandwidth Test] – Starting…

Running on…
 Device 0: A100-SXM4-40GB

 Quick Mode
 Host to Device Bandwidth, 1 Device(s)

 PINNED Memory Transfers

   Transfer Size (Bytes)        Bandwidth(GB/s)

   32000000                     26.1
 Device to Host Bandwidth, 1 Device(s)

 PINNED Memory Transfers

   Transfer Size (Bytes)        Bandwidth(GB/s)

   32000000                     25.0
Result = PASS

The expected host to device and device to host transfer speed is > 20 GB/s.

This health check and many more detailed tests to diagnose unhealthy VMs can be found in the azhpc-diagnostics Github repository.

Check the InfiniBand network and NCCL performance

Running a NCCL allreduce and/or alltoall benchmark at the scale you plan on running your deep learning training job is a great way to identify problems with the InfiniBand inter-node network or with NCCL performance.

Here is a SLURM script to run a NCCL alltoall benchmark (Note: using SLURM container integration with enroot+pyxis to use the Nvidia pytorch container.)

#!/bin/bash
#SBATCH -t 00:20:00
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8
#SBATCH -o logs/%x_%j.log
export UCX_IB_PCI_RELAXED_ORDERING=on 
       UCX_TLS=rc 
       NCCL_DEBUG=INFO 
       CUDA_DEVICE_ORDER=PCI_BUS_ID 
       NCCL_IB_PCI_RELAXED_ORDERING=1 
       NCCL_TOPO_FILE=/workspace/nccl/nccl-topology.txt

CONT="nvcr.io#nvidia/pytorch:21.05-py3"
MOUNT="/nfs2/nccl:/workspace/nccl_284,/nfs2/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.1-0.6.6.0-ubuntu18.04-x86_64:/opt/hpcx,/nfs2/nccl_2.10.3-1/nccl:/workspace/nccl"

export OMPI_MCA_pml=ucx
export OMPI_MCA_btl=^openib
export OMPI_MCA_COLL_HCOLL_ENABLE=0

srun --ntasks=$SLURM_JOB_NUM_NODES --container-image "${CONT}" 
    --container-name=nccl 
    --container-mounts="${MOUNT}" 
    --ntasks-per-node=1 
    bash -c "apt update && apt-get install -y infiniband-diags"

srun --gpus-per-node=8 
    --ntasks-per-node=8 
    --container-name=nccl 
    --container-mounts "${MOUNT}" 
    bash -c 'export LD_LIBRARY_PATH=/opt/hpcx/nccl_rdma_sharp_plugin/lib:/opt/hpcx/sharp/lib:/workspace/nccl/build/lib:$LD_LIBRARY_PATH && /workspace/nccl/nccl-tests/build/alltoall_perf -b8 -f 2 -g 1 -e 8G'

Then submit the above script on for example 4 ND96asr_v4 VM’s

sbatch -N 4 ./nccl.slrm

Similarly, for allreduce, just change the executable to be all_reduce_perf.

The following plots show NCCL allreduce and alltoall expect performance on ND96asr_v4.

Summary

Large scale DL models are becoming very complex and sophisticated being applied to many application areas. The computational and network resources to train these large modern DL models can be quite substantial. The Azure NDv4 series is designed specifically for these large scale DL computational, network and I/O requirements.

Several key performance optimization tips and tricks are discussed to allow you to get the best possible performance running your large deep learning model on Azure NDv4 series.

Credits

To would like to acknowledge the significant contribution of my Colleagues at Microsoft to this post. Jithin Jose provided the NCCL scaling performance data and was the primary contributor to the NCCL, MPI an GPU tuning parameters, he also helped review this document. I would also like to thank Kanchan Mehrotra and Jon Shelley for reviewing this document and providing outstanding feedback.

Microsoft Azure Cosmos DB Guidance

by Scott Muniz | Aug 27, 2021 | Security, Technology

This article is contributed. See the original author and article here.

CISA is aware of a misconfiguration vulnerability in Microsoft’s Azure Cosmos DB that may have exposed customer data. Although the misconfiguration appears to have been fixed within the Azure cloud, CISA strongly encourages Azure Cosmos DB customers to roll and regenerate their certificate keys and to review Microsoft’s guidance on how to Secure access to data in Azure Cosmos DB.

Troubleshooting Azure Active Directory Integrated Authentication in Azure SQL

by Contributed | Aug 27, 2021 | Technology

This article is contributed. See the original author and article here.

Integrated authentication provides a secure and easy way to connect to Azure SQL Database and SQL Managed Instance. It leverages hybrid identities that coexist both on traditional Active Directory on-premises and in Azure Active Directory.

At the time of writing Azure SQL supports Azure Active Directory Integrated authentication with SQL Server Management Studio (SSMS) either by using credentials from a federated domain or via a managed domain that is configured for seamless single sign-on for pass-through and password hash authentication. More information here Configure Azure Active Directory authentication – Azure SQL Database & SQL Managed Instance & Azure Synapse Analytics | Microsoft Docs

We recently worked on an interesting case where our customer was getting the error “Integrated Windows authentication supported only in federation flow” when trying to use AAD Integrated authentication with SSMS.

Recently they have migrated from using ADFS (Active Directory Federation Services) to SSSO for PTA (Seamless Single Sign-on for Pass-through Authentication). To troubleshoot the issue, we performed the following checks.

Validating setup for SSSO for PTA

Ensure you are using the latest version of Azure AD Connect

Validate the Azure AD Connect status with the Azure portal https://aad.portal.azure.com

Verify the below features are enabled
- Sync Status
- Seamless single sign-on
- Pass-through authentication

Testing Seamless single sign on works correctly using a web browser

Follow the steps here and navigate to https://myapps.microsoft.com Be sure to either clear the browser cache or use a new private browser session with any of the supported browsers in private mode.

If you successfully signed in without providing the password, you have tested that SSSO with PTA is working correctly.

Now the question is. Why the sign in is failing with SSMS?

For that we turned to grab a capture using Fiddler

Collecting a Fiddler trace

The following link has a set of instructions on how to go about setting up Fiddler classic to collect a trace. Troubleshooting problems related to Azure AD authentication with Azure SQL DB and DW – Microsoft Tech Community

Once Fiddler is ready, I recommend that you pre-filter the capture by process as to only capture traffic that is originating from SSMS. That would prevent capturing traffic that is unrelated to our troubleshooting.

Clear the current session if there are any frames that were captured before setting the filter

Reproduce the issue

Stop the capture and save the file

When we reviewed the trace, we saw a few interesting things

We can only see a call to login.windows.net which is one of the endpoints that helps us use Azure Active Directory authentication.

For SSSO for PTA we would expect to see subsequent calls to https://autologon.microsoftazuread-sso.com which were not present in the trace.

This Azure AD URL should be present in the Intranet zone settings, and it is rolled out by a group policy object in the on premises Active Directory.

A key part on the investigation was finding that the client version is 1.0.x.x as captured on the Request Headers. This indicates the client is using the legacy Active Directory Authentication Library (ADAL)

Why is SSMS using a legacy component?

The SSMS version on the developer machine was the latest one so we needed to understand how the application is loading this library. For that we turned to Process Monitor (thanks Mark Russinovich)

We found that SSMS queries a key in the registry to find what DLL to use to support the Azure Active Directory Integrated authentication.

Using the below PowerShell cmdlets, we were able to find the location of the library on the filesystem

Set-Location -Path HKLM:
Get-ItemProperty -Path SOFTWAREWOW6432NodeMicrosoftMSADALSQL | Select-Object -Property TargetDir

Checking on the adalsql.dll details we confirmed this is the legacy library

As SSMS is a 32 bit application it loads the DLL from the SysWOW64 location. If your application is 64 bit you may opt to check the registry key HKLM:SOFTWAREMicrosoftMSADALSQL

A clean install of the most recent version of SSMS creates a different DLL with the most up to date library

In this case the developer machine ended up having up that registry location modified and pointing to the legacy client (adalsql.dll). As the newer DLL (adal.dll) was already installed on the system the end user simply made the change to use the adal.dll on the registry.

It is important to be aware of this situation. Installing older versions of software like SSMS, SSDT (SQL Server Data Tools), Visual Studio etc. may end up modifying the registry key and pointing to the legacy ADAL client.

Cheers!

FBI Releases Indicators of Compromise Associated with Hive Ransomware

by Scott Muniz | Aug 27, 2021 | Security, Technology

This article is contributed. See the original author and article here.

The Federal Bureau of Investigation (FBI) has released a Flash report detailing indicators of compromise (IOCs) and tactics, techniques, and procedures (TTPs) associated with ransomware attacks by Hive, a likely Ransomware-as-a-Service organization consisting of a number of actors using multiple mechanisms to compromise business networks, exfiltrate data and encrypt data on the networks, and attempt to collect a ransom in exchange for access to the decryption software.

CISA encourages users and administrators to review the technical details, IOCs, and TTPs in FBI Flash MC-000150-MW and apply the recommend mitigations.

 ICSJWG 2021 Fall Virtual Meeting

by Scott Muniz | Aug 27, 2021 | Security, Technology

This article is contributed. See the original author and article here.

The Industrial Control Systems Joint Working Group (ICSJWG) will hold the virtual 2021 ICSJWG Fall Meeting, September 21—22, 2021. ICSJWG meetings facilitate relationship building among critical infrastructure stakeholders and owners/operators of industrial control systems, idea exchange regarding critical issues affecting industrial control systems (ICS) cybersecurity, and information sharing to reduce the risk to the nation’s industrial control systems.

The ICSJWG bi-annual meeting will feature two full days of presentations, a Table-Top Exercise introductory session, technical workshop activities, and a CISA ICS Training overview. Register no later than September 17, 2021 to attend. Visit the ICSJWG website or the ICSJWG 2021 Fall Virtual Meeting website for more information.

Route leads with dynamic assignment rules

by Contributed | Aug 26, 2021 | Dynamics 365, Microsoft 365, Technology

This article is contributed. See the original author and article here.

Lead routingthe process of distributing incoming leads among sales repscan be simple or complex. If you have sophisticated requirements for making those lead assignments, review these tips for using the standard seller information and dynamic matching to streamline rule configuration in the Sales Premium for Dynamics 365 Sales.

A simple approach to lead routing is to make a list of all your sales reps, and then assign each new lead to the next seller in sequence (round robin) or based on availability (load balancing). This can be achieved by using the assignment rule shown in the following screenshot. This configuration will assign all the leads that are part of the segment Leads from web to sellers in a round-robin way.

More sophisticated requirements can also be configured by using the lead assignment rules. The rules can identify the most appropriate seller based on the fields of incoming leads. Seller availability and capacity can also be considered in the rule. The following two options are available to select sellers:

Use existing fields from seller records in Dynamics 365.
Use seller attributes defined for assignment rules. More information:Manage seller attributes

Using the first option to manage lead distribution is simpler in terms of onboarding and management because it uses the seller information that already exists in Dynamics 365. The following example shows a rule to assign leads to sellers who are based out of Seattle:

Dynamic matching eliminates manual rulesetting

Dynamic matching reduces the effort of having to write and maintain multiple static rules for each permutation and combination of values. Suppose we want to distribute leads based on country. For example, we can have leads from the United States assigned to any seller focused on US clients and leads from India assigned to any seller focused on India.

If we try to create static rules for each assignment by country, we’d need as many rules as countries. If the organization is serving 150 countries, we might need to create 150 static rules. If we wanted to use more attributes, such as Zip Code and currency, the rule count would multiply exponentially. A simpler approach is to use the dynamic match capability. In the scenario described above, where leads are assigned to sellers based on country, we can have a single rule as follows:

Here, the rule will check the country of the lead (Lead.Country) and match it with seller’s country. When there is an exact match, the lead is assigned to the appropriate seller.

Note: Country as used here is a global option set defined for both lead and system user entities. You could instead use the look up mechanism to find a match. In this way, you can use any string type of field (a single line of text). However, the string type of lead field will need a logical name on the right side as shown in the screenshot below:

ZIP/Postal Code as selected for the first field, on the left side, refers to a system user field that needs to match with the lead field called address_1_postalcode, which is the logical name for the field ZIP/Postal Code in Lead.

Bulk import of user fields for lead assignment

Once the admin identifies the relevant lead and user fields for lead assignment, the challenge can be to populate the user fields that will be used in lead routing.This can be done by the sales manager using a CSV file. The file is reviewed, and then imported in bulk as shown in How to import data.

Rule management

Here’s a scenario in which an organization would like to route leads based on a few parameters:country (option set), Zip Code(string), and currency (look up). If all the three parameters match with seller, lead should be routed to that seller, otherwise it should try to match the country only.In this scenario, we can have two assignment rules in the following evaluation order (rules get evaluated in the order specified):

The first rule matches on all three parameters (country, Zip Code, and currency).
The second rule matches based on country.

Configuration for the rule to match country, Zip Code, and currency:

Configuration for the rule that finds a match only based on country:

Note that these scenarios simply depend on the user information that is entered into the system whenever a new seller is added. The new seller can begin to get leads right away, as long as all the routing-relevant fields are populated as part of onboarding.

Next steps

For more information on how to manage assignment rules for lead routing, review the documentation.

The post Route leads with dynamic assignment rules appeared first on Microsoft Dynamics 365 Blog.

Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.

How Azure Security can help Federal Agencies meet Cybersecurity Executive Order Requirements

by Contributed | Aug 26, 2021 | Technology

This article is contributed. See the original author and article here.

In May 2021, the Biden Administration signed Executive Order (EO) 14028, placing cloud security at the forefront of national security. Federal agencies can tap into Microsoft’s comprehensive cloud security strategy to navigate the EO requirements with ease. The integration between Azure Security Center and Azure Sentinel allows agencies to leverage an existing, cohesive architecture of security products rather than attempting to blend various offerings. Our security products, which operate at cloud-speed, provide the needed visibility into cloud security posture while also offering remediation from the same pane of glass. Built-in automation reduces the burden on security professionals and encourages consistent, real-time responses to alerts or incidents.

The Azure Security suite helps federal agencies and partners improve their cloud security posture and stay compliant with the recent EO. While there are many areas Azure Security can support, this blog will focus on how Azure Security Center and Azure Sentinel can empower federal agencies to address the following EO goals:

Modernize and Implement Stronger Cybersecurity Standards in the Federal Government

Improve Detection of Cybersecurity Incidents on Federal Government Networks

Microsoft applies its industry-leading practices to Azure Security products, generating meaningful insights about security posture that simplify the process of protecting federal agencies and result in cost and time savings.

Azure Security Center (ASC) is a unified infrastructure security management system that strengthens the security posture of your data centers. Azure Defender, part of Azure Security Center, provides advanced threat protection across your hybrid workloads in the cloud – whether they’re in Azure or not – as well as on-premises.

Azure Sentinel, our cloud-native security information event management (SIEM) and security orchestration automated response (SOAR) solution is deeply integrated with Azure Security Center and provides security information event management and security orchestration automated response.

Note: For more information on products and features available in Azure Government, please refer to: Azure service cloud feature availability for US government customers | Microsoft Docs

Modernize and Implement Stronger Cybersecurity Standards in the Federal Government

Section three of the EO emphasizes the push toward cloud adoption and the need for proper cloud security. It highlights the necessity of a federal cloud security strategy, governance framework, and reference architecture to drive cloud adoption. There are significant security benefits when using the cloud over traditional on-premises data centers by centralizing data and providing continuous monitoring and analytics.

Azure Sentinel contains workbooks, visual representations of data, that help federal agencies gain insight into their security posture. Section three of the EO mandates Zero Trust planning as a requirement, which can be daunting to implement. The Zero Trust (TIC3.0) Workbook provides a visualization of Zero Trust principles mapped to the Trusted Internet Connections (TIC) framework. After aligning TIC 3.0 Security Capabilities to Zero Trust Principles and Pillars, this workbook shares easy-to-implement recommendations, log sources, automations, and more to empower federal agencies looking to build Zero Trust into cloud readiness. Read more about the Zero Trust (TIC3.0) Workbook.

For federal agencies beginning their digital transformations, ASC provides robust features out of the box to secure your environment and accelerate secure cloud adoption by leveraging existing best practices and guardrails.

ASC continuously scans your hybrid cloud environment and provides recommendations to help you harden your attack surface against threats. Azure Security Benchmark (ASB) is the baseline and driver for these recommendations. ASB is a Microsoft-authored, Azure-specific set of guidelines for security and compliance best practices based on common compliance frameworks. Azure Security Benchmark builds on the controls from the Center for Internet Security (CIS) and the National Institute of Standards and Technology (NIST) with a focus on cloud-centric security. ASB empowers teams to leverage the dynamic nature of the cloud and continuously deploy new resources by providing the needed visibility into the posture of these resources as well as easy to follow steps for remediation. With over 150+ built-in recommendations, ASB evaluates Azure resources across 11 controls, including network security, data protection, logging and threat detection, incident response, governance and strategy, and more.

Government agencies have complex compliance requirements that can be streamlined through Azure Security Benchmark. ASB provides federal agencies with a strong baseline to assess the health of their Azure resources. Teams can complement this visibility by including additional regulatory compliance standards or their own custom policy. Azure Security Center’s regulatory compliance dashboard provides insights into compliance posture against compliance requirements, including NIST SP 800-53, SWIFT CSP CSCF-v2020, Azure CIS 1.3.0, and more.

We recently released Regulatory Compliance in Workflow Automation, where changes in regulatory compliance standards can trigger real-time responses, such as notifying relevant stakeholders, launching a change management process, or applying specific remediation steps. Building in automation allows organizations to improve security posture by ensuring the proper steps are completed consistently and automatically, according to predefined requirements. Automation also reduces the burden on your security teams by streamlining repeatable tasks. Read more about how to build in automation for regulatory compliance.

With visibility and remediation all from the same dashboard, ASB and other out-of-the-box regulatory compliance initiatives empower security teams to get immediate, actionable insights into their security posture. Leveraging Microsoft best practices, built with Azure in mind, federal agencies can tap into the security of the cloud without committing resources to build new frameworks.

Using Azure Security Center’s regulatory compliance feature and workbooks in Azure Sentinel, federal agencies can tap into Microsoft best practices and existing frameworks, regardless of where they may be in their cloud journeys, to get and stay secure. These products not only provided heightened visibility into cloud security posture, but also provide steps for remediation to harden your attack surface and prevent attacks. These tools harness the power of automation, AI/ML, and more to reduce the burden on your security teams and allow them to focus on what matters.

Improve Detection of Cybersecurity Incidents on Federal Government Networks

The objective of section seven of the EO is to promote cross-government collaboration and information sharing by enabling a government-wide endpoint detection and response (EDR) system.

Integrating Azure Security Center and Azure Sentinel provides federal agencies with increased visibility to proactively identify threats and build in automated responses. Through Azure Sentinel, agencies can ensure they have the appropriate tools, whether that be automated responses or access to logs, to contain and remediate threats.

In addition to providing cloud security posture management, Azure Security Center has a cloud workload protection platform, commonly referred to as Azure Defender. Azure Defender provides advanced, intelligent protection for a variety of resource types, including servers, Kubernetes, container registries, SQL database servers, storage, and more. Read more about resource types covered by Azure Defender.

When Azure Defender detects an attempt to compromise your environment, it generates a security alert. Security alerts contain details of the affected resource, suggested remediation steps, and refer to recommendations to help harden your attack surface to protect against similar alerts in the future. In some scenarios, logic apps can also be triggered. Like automated responses to deviations in regulatory compliance standards, logic apps allow for consistent responses to Azure Security Center alerts.

Azure Defender not only has a breadth of coverage across many resource types, but also depth in coverage by resource type. Given the increase in frequency and complexity of attacks, organizations require dynamic threat detections. Azure Defender benefits from security research and data science teams at Microsoft who are continuously monitoring the threat landscape, leading to the constant tuning of detections as well as the inclusion of additional detections for greater coverage. Azure Defender incorporates integrated threat intelligence, behavioral analytics, and anomaly detection to identify threats across your environment.

Azure Sentinel is a central location to collect data at scale – across users, devices, applications, and infrastructure – and to conduct investigation and response.

There are two ways that Azure Sentinel can ingest data: data connectors and continuous export.

Azure Sentinel comes with built-in connectors for many Microsoft products, allowing for out-of-the-box, real-time integration. The Azure Defender connector facilitates the streaming of Azure Defender security alerts into Azure Sentinel, where you can view, analyze, and respond to alerts in a broader organizational threat context.

In addition to bringing Azure Defender alerts, organizations can stream alerts from other Microsoft products, including Microsoft 365 sources such as Office 365, Azure Active Directory, Microsoft Defender for Identity, or Microsoft Cloud App Security.

Continuous export in Azure Security Center allows for the streaming of not only Azure Defender alerts but also secure score and regulatory compliance insights.

After connecting data sources to Azure Sentinel, out-of-the-box, built-in templates guide the creation of threat detection rules. Our team of security experts created rule templates based on known threats, common attack vectors, and suspicious activity escalation chains. Creating rules based on these templates will continuously scan your environment for suspicious activity and create incidents when alerts are generated. You can couple built-in fusion technology, machine learning behavioral analytics, anomaly rules, or scheduled analytics rules with your own custom rules to ensure Azure Sentinel is scanning your environment for relevant threats.

Automation rules in Azure Sentinel help triage incidents. These rules can automatically assign incidents to the right team, close noisy incidents or known false positives, change alert severity, or add tags.

Automation rules are also used to run playbooks in response to incidents. Playbooks, which are based on workflows built-in Azure Logic Apps, are a collection of processes that are run in response to an alert or incident. This feature allows for predefined, consistent, and automated responses to Azure Sentinel activity, reducing the burden on your security team and allowing for close to real-time responses to alerts or incidents.

Due to the integrated nature of our threat protection suite, completing investigation and remediation of an Azure Defender alert in Azure Sentinel will still update the alerts status in the Azure Security Center portal. For example, when an alert is closed in Azure Sentinel, that alert will display as closed in Azure Security Center as well (and visa versa)!

At Microsoft, we are excited about the opportunity to expand our partnerships with federal agencies as we work to improve cloud security, and in doing so, improve national security.

For more information, please visit our Cyber EO resource center.