End-to-end Stable Diffusion test on Azure NC A100/H100 MIG

End-to-end Stable Diffusion test on Azure NC A100/H100 MIG

This article is contributed. See the original author and article here.

You’re welcome to follow my GitHub repo and give it a star:https://github.com/xinyuwei-david/david-share.git


 


E2E Stable Diffusion on A100 MIG


A100/H100 are High end Training GPU, which could also work as Inference. In order to save compute power and GPU memory, We could use NVIDIA Multi-Instance GPU (MIG), then we could run Stable Diffusion on MIG.
I do the test on Azure NC A100 VM.


Config MIG


Enable MIG on the first physical GPU.


root@david1a100:~# nvidia-smi -i 0 -mig 1

After the VM reboot, MIG has been enabled.


xinyuwei_0-1724492592775.png


Lists all available GPU MIG profiles:


#nvidia-smi mig -lgip

xinyuwei_1-1724492621713.png


At this moment, we need to calculate how to maximise utilize the GPU resource and meet the compute power and GPU memory for SD.


I divide A100 to four parts: ID 14×3 and ID 20×1


root@david1a100:~# sudo nvidia-smi mig -cgi 14,14,14,20 -C
Successfully created GPU instance ID 5 on GPU 0 using profile MIG 2g.20gb (ID 14)
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 5 using profile MIG 2g.20gb (ID 1)
Successfully created GPU instance ID 3 on GPU 0 using profile MIG 2g.20gb (ID 14)
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 3 using profile MIG 2g.20gb (ID 1)
Successfully created GPU instance ID 4 on GPU 0 using profile MIG 2g.20gb (ID 14)
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 4 using profile MIG 2g.20gb (ID 1)
Successfully created GPU instance ID 13 on GPU 0 using profile MIG 1g.10gb+me (ID 20)
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 13 using profile MIG 1g.10gb (ID 0)

xinyuwei_2-1724492662594.png



Persist the MIG configuratgion



After reboot the VM, CPU MIG configuration will be lost, so I need to setup bash script.


#vi /usr/local/bin/setup_mig.sh

 

!/bin/bash
nvidia-smi -i 0 -mig 1
sudo nvidia-smi mig -dgi
sudo nvidia-smi mig -cgi 14,14,14,20 -C

 


 


Grant execute permission:


chmod +x /usr/local/bin/setup_mig.sh

Create a system service:


vi /etc/systemd/system/setup_mig.service

 

[Unit]  
Description=Setup NVIDIA MIG Instances  
After=default.target  

[Service]  
Type=oneshot  
ExecStart=/usr/local/bin/setup_mig.sh  

[Install]  
WantedBy=default.target  

 


 


Enable and start setup_mig.service:


sudo systemctl daemon-reload 
sudo systemctl enable setup_mig.service
sudo systemctl status setup_mig.service

Prepare MIG Container environment


Install Docker and NVIDIA Container Toolkit on VM


 

sudo apt-get update  
sudo apt-get install -y docker.io  
sudo apt-get install -y aptitude  
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)  
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -  
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list  
sudo apt-get update  
sudo aptitude install -y nvidia-docker2  
sudo systemctl restart docker  
sudo aptitude install -y nvidia-container-toolkit  
sudo systemctl restart docker  

 


 


Configure create Container script on VM


#vi createcontainer.sh

 

#!/bin/bash

# 容器名称数组
CONTAINER_NAMES=("mig1_tensorrt_container" "mig2_tensorrt_container" "mig3_tensorrt_container" "mig4_tensorrt_container")

# 删除已有的容器
for CONTAINER in "${CONTAINER_NAMES[@]}"; do
  if [ "$(sudo docker ps -a -q -f name=$CONTAINER)" ]; then
    echo "Stopping and removing container: $CONTAINER"
    sudo docker stop $CONTAINER
    sudo docker rm $CONTAINER
  fi
done

# 获取MIG设备的UUID
MIG_UUIDS=$(nvidia-smi -L | grep 'MIG' | awk -F 'UUID: ' '{print $2}' | awk -F ')' '{print $1}')
UUID_ARRAY=($MIG_UUIDS)

# 检查是否获取到足够的MIG设备UUID
if [ ${#UUID_ARRAY[@]} -lt 4 ]; then
  echo "Error: Not enough MIG devices found."
  exit 1
fi

# 启动容器
sudo docker run --gpus '"device='${UUID_ARRAY[0]}'"' -v /mig1:/mnt/mig1 -p 8081:80 -d --name mig1_tensorrt_container nvcr.io/nvidia/pytorch:24.05-py3 tail -f /dev/null
sudo docker run --gpus '"device='${UUID_ARRAY[1]}'"' -v /mig2:/mnt/mig2 -p 8082:80 -d --name mig2_tensorrt_container nvcr.io/nvidia/pytorch:24.05-py3 tail -f /dev/null
sudo docker run --gpus '"device='${UUID_ARRAY[2]}'"' -v /mig3:/mnt/mig3 -p 8083:80 -d --name mig3_tensorrt_container nvcr.io/nvidia/pytorch:24.05-py3 tail -f /dev/null
sudo docker run --gpus '"device='${UUID_ARRAY[3]}'"' -v /mig4:/mnt/mig4 -p 8084:80 -d --name mig4_tensorrt_container nvcr.io/nvidia/pytorch:24.05-py3 tail -f /dev/null

# 打印容器状态
sudo docker ps
sudo ufw allow 8081
sudo ufw allow 8082
sudo ufw allow 8083
sudo ufw allow 8084
sudo ufw reload

 


 


Check container is accessible from outside.


In container, start 80 listener:


root@david1a100:~# sudo docker exec -it mig1_tensorrt_container /bin/bash
root@b6abf5bf48ae:/workspace# python3 -m http.server 80
Serving HTTP on 0.0.0.0 port 80 (http://0.0.0.0:80/) …
167.220.233.184 – – [23/Aug/2024 10:54:47] “GET / HTTP/1.1” 200 –

Curl from my laptop:


(base) PS C:Usersxinyuwei> curl http://20.5.**.**:8081

StatusCode : 200
StatusDescription : OK
Content : <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd“>



Directory listing fo…<BR />RawContent : HTTP/1.0 200 OK<BR />Content-Length: 594<BR />Content-Type: text/html; charset=utf-8<BR />Date: Fri, 23 Aug 2024 10:54:47 GMT<BR />Server: SimpleHTTP/0.6 Python/3.10.12

In container, ping google.com:


root@david1a100:~#sudo docker exec -it mig1_tensorrt_container /bin/bash
root@b6abf5bf48ae:/workspace# pip install ping3
root@b6abf5bf48ae:/workspace# ping3 www.google.com
ping ‘www.google.com‘ … 2ms
ping ‘www.google.com‘ … 1ms
ping ‘www.google.com‘ … 1ms
ping ‘www.google.com‘ … 1ms
Related useful commands.

 


Do SD inference test in Container.


Check tensorrt version in container:


root@david1a100:/workspace# pip show tensorrt
Name: tensorrt
Version: 10.2.0
Summary: A high performance deep learning inference library
Home-page: https://developer.nvidia.com/tensorrt
Author: NVIDIA Corporation
Author-email:
License: Proprietary
Location: /usr/local/lib/python3.10/dist-packages
Requires:
Required-by:

Do SD test via github examples, in container:


git clone –branch release/10.2 –single-branch https://github.com/NVIDIA/TensorRT.git 
cd TensorRT/demo/Diffusion
pip3 install -r requirements.txt

Genarate inmage 1024*1024 image from test.


python3 demo_txt2img.py “a beautiful photograph of Mt. Fuji during cherry blossom” –hf-token=$HF_TOKEN

We could check the speed of generating image in different:


In MIG1 container, which has 2 GPC and 20G memory:


xinyuwei_1-1724493606940.png


In mig4 container,  which has 2 GPC and 20G memory:


xinyuwei_2-1724493638243.png


Check The output image is as following, copy it to VM and download it.


#cp ./output/* /mig1

xinyuwei_4-1724493676675.png


Compare Int8 inference speed and quality on H100 GPU


Tested Stable Diffusion XL1.0 on a single H100 to verify the effects of int8. NVIDIA claims that on H100, INT8 is optimised over A100.


#python3 demo_txt2img_xl.py “a photo of an astronaut riding a horse on mars” –hf-token=$HF_TOKEN –version=xl-1.0

xinyuwei_5-1724493774491.png


Image generation effect:


xinyuwei_6-1724493800466.png


Use SDXL & INT8 AMMO quantization:


python3 demo_txt2img_xl.py “a photo of an astronaut riding a horse on mars” –version xl-1.0 –onnx-dir onnx-sdxl –engine-dir engine-sdxl –int8

After executing the above command, 8-bit quantisation of the model will be performed first.



Building TensorRT engine for onnx/unetxl-int8.l2.5.bs2.s30.c32.p1.0.a0.8.opt/model.onnx: engine/unetxl-int8.l2.5.bs2.s30.c32.p1.0.a0.8.trt10.0.1.plan

Then do inference

xinyuwei_7-1724493870133.png

Check generated image:


xinyuwei_8-1724493894501.png

We see that the quality of the generated images is the same, and the file sizes are almost identical as well.


xinyuwei_9-1724493915961.png

We observe that the inference speed of INT8 increased by 20% compared to FP16.





How to make AI training faster

How to make AI training faster

This article is contributed. See the original author and article here.

You’re welcome to follow my GitHub repo and give it a star:https://github.com/xinyuwei-david/david-share.git


xinyuwei_0-1724472736934.png


 


Factors Affecting AI Training Time


In deep learning training, the calculation of training time involves multiple factors, including the number of epochs, global batch size, micro batch size, and the number of computing devices, among others. Below is a basic formula illustrating the relationship between these parameters (note that this is just a basic illustrative formula, mainly explaining proportional and inversely proportional relationships; actual training may require considering more factors):


xinyuwei_10-1724466430536.png


Among them—



  • Epochs refer to the number of times the model processes the entire training dataset.

  • Total Number of Samples is the total number of samples in the training dataset.

  • Global Batch Size is the total number of data samples processed in each training iteration.

  • Time per Step is the time required for each training iteration, which depends on hardware performance, model complexity, optimization algorithms, and other factors.

  • Number of Devices is the number of computing devices used for training, such as the number of GPUs.


This formula provides a basic framework, but please note that the actual training time may be influenced by many other factors, including I/O speed, network latency (for distributed training), CPU-GPU communication speed, The Frequency of Hardware Failures During GPU Training, etc. Therefore, this formula can only serve as a rough estimate, and the actual training time may vary.


 


Detailed explanations


The training time of a deep learning model is determined by multiple factors, including but not limited to the following:



  • Number of Epochs: An epoch means that the model has processed the entire training dataset once. The more epochs, the more data the model needs to process, and thus the longer the training time.

  • Global Batch Size: The global batch size is the total number of data samples processed in each training iteration. The larger the global batch size, the more data is processed in each iteration, which may reduce the number of iterations required per epoch, potentially shortening the total training time. However, if the global batch size is too large, it may lead to memory overflow.

  • Micro Batch Size: The micro batch size refers to the number of data samples processed by each computing device in each training iteration. The larger the micro batch size, the more data each device processes per iteration, which may improve computational efficiency and thus shorten training time. However, if the micro batch size is too large, it may lead to memory overflow.

  • Hardware Performance: The performance of the computing devices used (such as CPUs, GPUs) will also affect training time. More powerful devices can perform computations faster, thereby shortening training time.

  • Model Complexity: The complexity of the model (such as the number of layers, number of parameters, etc.) will also affect training time. The more complex the model, the more computations are required, and thus the longer the training time.

  • Optimization Algorithm: The optimization algorithm used (such as SGD, Adam, etc.) and hyperparameter settings like learning rate will also affect training time.

  • Parallel Strategy: The use of parallel computing strategies such as data parallelism, model parallelism, etc., will also affect training time.



There are many factors that determine the length of training time, and they need to be considered comprehensively based on the specific training task and environment.

So, in this formula


xinyuwei_11-1724468441652.png










Time per step should be understood as primarily related to the computational power of the GPU.”Time per Step,” that is, the time required for each training step, is determined by multiple factors, including but not limited to the following:

  • Hardware Performance: The performance of the computing devices used (such as CPUs, GPUs) will directly affect the speed of each training iteration. More powerful devices can perform computations faster.

  • Model Complexity: The complexity of the model (such as the number of layers, number of parameters, etc.) will also affect the time for each training iteration. The more complex the model, the more computations are required.

  • Optimization Algorithm: The optimization algorithm used (such as SGD, Adam, etc.) will also affect the time for each training iteration. Some optimization algorithms may require more complex computational steps to update the model parameters.

  • Data type used in training:Different data types used in training have significant effect on time per step. Data types include FP32, FP/BF16, FP8, etc.


Training steps


So, what determines the total training steps?”Total Training Steps” is determined by the number of training epochs and the number of steps per epoch. Specifically, it equals the number of epochs multiplied by the number of steps per epoch. This can be expressed with the following formula:
 









xinyuwei_12-1724468479243.png

 


Global Batch Size


So, what determines the Global Batch Size?

 

xinyuwei_13-1724468504561.png

global_batch_size = 
gradient_accumulation_steps 
* nnodes (node mumbers) 
* nproc_per_node (GPU in one node) 
* per_device_train_batch_si(micro bs size) 









Assume a scenario:






batch_size = 10  # Batch size  
total_num = 1000  # Total number of training data  


When training one batch of data and updating the gradient once (gradient accumulation steps = 1):


 

train_steps = total_num / batch_size = 1000 / 10 = 100  

 


This means there are 100 steps per epoch, and the gradient update steps are also 100.
When the memory is insufficient to support a batch size of 10, we can use gradient accumulation to reduce the size of each micro-batch. Suppose we set the gradient accumulation steps to 2:


 

gradient_accumulation_steps = 2  
micro_batch_size = batch_size / gradient_accumulation_steps = 10 / 2 = 5  

 


This means that for each gradient update, we accumulate data from 2 micro-batches, with each micro-batch size being 5. This reduces memory pressure, but the data size per gradient update remains 10 data points.

Result:



  • The number of training steps per epoch (train_steps) remains 100 because the total amount of data and the number of steps per epoch have not changed.

  • The gradient update steps remain 100 because each gradient update accumulates data from 2 micro-batches.


It is important to note that when using gradient accumulation, each training step handles the accumulation of gradients from multiple micro-batches, which may slightly increase the computation time per step. Therefore, if memory is sufficient, it is better to increase the batch size to reduce the number of gradient accumulations. When memory is insufficient, gradient accumulation is an effective method.

The global batch size significantly impacts the training effectiveness of the model. Generally, a larger global batch size provides more accurate gradient estimates, aiding model convergence. However, it also increases memory pressure on each device. If memory resources are limited, using a large global batch size may not be feasible.

In such cases, gradient accumulation can be used. By training with a smaller micro-batch size on each device, we reduce memory pressure while maintaining a large global batch size for accurate gradient estimates. This allows training large models on limited hardware resources without sacrificing the global batch size.

In summary, gradient accumulation is a trade-off strategy to balance global batch size and training effectiveness when memory resources are limited.



So, if we look at these two formulas:


xinyuwei_14-1724469770773.png


 


xinyuwei_15-1724469780649.png


The larger the global batch size, the shorter the total training time, provided that there is no OOM (Out of Memory) and the GPU computational power is not fully utilized.


 


The Relationship Between Data Parallelism and Batch Size












 This section essentially analyzes this formula:


global_batch_size = 
gradient_accumulation_steps 
* nnodes (The number of nodes is, in effect, the PP) 
* nproc_per_node (The number of cards per node is, in effect, the TP) 
* per_device_train_batch_si(micro bs size) 


In distributed deep learning, data parallelism is a common strategy. The training data is split into multiple small batches and distributed to different computing nodes. Each node has a copy of the model and trains on its data subset, speeding up the training process.

At the end of each training step, the model weights of all nodes are synchronized using the AllReduce operation. AllReduce aggregates gradients from all nodes and broadcasts the result back, allowing each node to update its model parameters.

If training on a single device, AllReduce is not needed as all computations occur on the same device. However, in distributed training, especially with data parallelism, AllReduce or similar operations are necessary to synchronize model parameters across devices.

Many deep learning frameworks (e.g., PyTorch, TensorFlow) use NVIDIA’s NCCL for communication across multiple GPUs. Each GPU trains on its data subset and synchronizes model weights using NCCL’s AllReduce at the end of each step.

Although AllReduce is commonly used in data parallelism, other NCCL operations may be employed depending on the framework and strategy.

Data parallelism (DP) and micro batch size are interrelated. DP involves training on multiple devices, each processing a portion of the data. Micro batch size is the number of samples each device processes per iteration. With DP, the original batch size is split into micro batches across devices. Without DP or model parallelism (MP), micro batch size equals global batch size. With DP or MP, the global batch size is the sum of all micro batches.

DP can be applied on multiple devices within a single server or across multiple servers. Setting DP to 8 means training on 8 devices, either on the same server or distributed across servers.

Pipeline parallelism (PP) is a different strategy where different model parts run on different devices. Setting DP to 8 in PP means 8 devices process data in parallel at each pipeline stage.

In summary, DP and PP can be used simultaneously on devices within a single server or across multiple servers.












 

Announcing End of Support for Dynamics 365 Project Service Automation (PSA)

Announcing End of Support for Dynamics 365 Project Service Automation (PSA)

This article is contributed. See the original author and article here.

On March 19th, 2024, we announced the end of support of Dynamics 365 Project Service Automation on commercial cloud.

For Project Service Automation customers on US government cloud, we will have a future announcement regarding upgrade and the availability of Project Operations.

Beginning March 31st, 2025, Microsoft will no longer support PSA on commercial cloud environments. There will not be any feature enhancements, updates, bug fixes, or other updates to this offering. Any support ticket logged for the PSA commercial cloud will be closed with instructions to upgrade to Dynamics 365 Project Operations.   

We strongly encourage all customers of PSA commercial cloud to start planning your upgrade process as soon as possible so you can to take advantage of many new Project Operations features such as:  

  • Integration with Project for the Web with many new advanced scheduling features 
  • Project Budgeting and Time-phased forecasting   
  • Date Effective price overrides  
  • Revision and Activation on Quotes    
  • Material usage recording in projects and tasks  
  • Subcontract Management  
  • Advances and Retained-based contracts  
  • Contract not-to-exceed  
  • Task and Progress based billing  
  • Multi-customer contracts  
  • AI and Copilot based experiences.  

Upgrade from Project Service Automation to Project Operations | Microsoft Learn 

Project Service Automation end of life FAQ | Microsoft Learn   

Feature changes from Project Service Automation to Project Operations | Microsoft Learn 

Project Service Automation to Project Operations project scheduling conversion process | Microsoft Learn 

Plan your work in Microsoft Project with the Project Operations add-in | Microsoft Learn 

Learn more about Dynamics 365 Project Operations 

Project Operations was first released in October 2020 as a comprehensive product to manage Projects from inception to close by bringing together the strengths of Dataverse, Microsoft Dynamics 365 Finance and Supply Chain Management, and Project for the web assets.

Want to learn more about Project Operations? Check this link and navigate to our detailed documentation!  

Want to try Project Operations? Click here and sign up for a 30-day trial!  

The post Announcing End of Support for Dynamics 365 Project Service Automation (PSA) appeared first on Microsoft Dynamics 365 Blog.

Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.

SMB security hardening in Windows Server 2025 & Windows 11

SMB security hardening in Windows Server 2025 & Windows 11

This article is contributed. See the original author and article here.

Heya folks, Ned here again. Last November, Microsoft launched the Secure Future Initiative (SFI) to prepare for the increasing scale and high stakes of cyberattacks. SFI brings together every part of Microsoft to advance cybersecurity protection across our company and products.


Windows has focused on security options with each major release, and Windows 11 24H2 and Windows Server 2025 are no exception: they include a dozen new SMB features that make your data, your users, and your organization safer – and most are on by default. Today I’ll explain their usefulness, share some demos, and point to further details.


 


The new OSes will soon be generally available and you can preview them right now: download Windows Server 2025 and Windows 11 24H2.


 


On to the security.


 


SMB signing required by default


 


What it is


We now require signing by default for all Windows 11 24H2 SMB outbound and inbound connections and for all outbound connections in Windows Server 2025. This changes legacy behavior, where we required SMB signing by default only when connecting to shares named SYSVOL and NETLOGON and where Active Directory domain controllers required SMB signing for their clients.


 


How it helps you


SMB signing has been available for decades and prevents data tampering and relay attacks that steal credentials. By requiring signing by default, we ensure that an admin or user must opt out of this safer configuration, instead of requiring them to be very knowledgeable about SMB network protocol security and turn signing on.


 


Learn more



 


SMB NTLM blocking


 


Picture2.png


 


What it is


The SMB client now supports blocking NTLM authentication for remote outbound connections. This changes the legacy behavior of always using negotiated authentication that could downgrade from Kerberos to NTLM.


 


How it helps you


Blocking NTLM authentication prevents tricking clients into sending NTLM requests to malicious servers, which counteracts brute force, cracking, relay, and pass-the-hash attacks. NTLM blocking is also required for forcing an organization’s authentication to Kerberos, which is more secure because it verifies identities with its ticket system and better cryptography. Admins can specify exceptions to allow NTLM authentication over SMB to certain servers.


 


Learn more



 


SMB authentication rate limiter


 


What it is


The SMB server service now throttles failed authentication attempts by default. This applies to SMB sharing files on both Windows Server and Windows.


 


How it helps you


Brute force authentication attacks bombard the SMB server with multiple username and password-guesses and the frequency can range from dozens to thousands of attempts per second. The SMB authentication rate limiter is enabled by default with a 2 second delay between each failed NTLM or Local KDC Kerberos-based authentication attempt. An attack that sends 300 guesses per second for 5 minutes, for example – 90,000 attempts – would now take 50 hours to complete. An attacker is far more likely to simply give up than keep trying this method.


 


Learn more



 


SMB insecure guest auth now off by default in Windows Pro editions


 


What it is


Windows 11 Pro no longer allows SMB client guest connections or guest fallback to an SMB server by default. This makes Windows 11 Pro operate like Windows 10 and Windows 11 Enterprise, Education, and Pro for Workstation editions have for years.


 


How it helps you


Guest logons don’t require passwords & don’t support standard security features like signing and encryption. Allowing a client to use guest logons makes the user vulnerable to attacker-in-the-middle scenarios or malicious server scenarios – for instance, a phishing attack that tricks a user into opening a file on a remote share or a spoofed server that makes a client think it’s legitimate. The attacker doesn’t need to know the user’s credentials and a bad password is ignored. Only third-party remote devices might require guest access by default. Microsoft-provided operating systems haven’t enabled guest in server scenarios since Windows 2000.


 


Learn more



 


SMB dialect management


 


Picture3.png


 


What it is


You can now mandate the SMB 2 and 3 protocol versions used.


 


How it helps you


Previously, the SMB server and client only supported automatically negotiating the highest matched dialect from SMB 2.0.2 to 3.1.1. This means you can intentionally block older protocol versions or devices from connecting. For example, you can specify connections to only use SMB 3.1.1, the most secure dialect of the protocol. The minimum and maximum can be set independently on both the SMB client and server, and you can set just a minimum if desired.


 


Learn more



 


SMB client encryption mandate now supported


 


What it is


The SMB client now supports requiring encryption of all outbound SMB connections.


 


How it helps you


Encryption of all outbound SMB client connections enforces the highest level of network security and brings management parity to SMB signing. When enabled, the SMB client won’t connect to an SMB server that doesn’t support SMB 3.0 or later, or that doesn’t support SMB encryption. For example, a third-party SMB server might support SMB 3.0 but not SMB encryption. Unlike SMB signing, encryption is not required by default.


 


Learn more



 


Remote Mailslots deprecated and disabled by default


 


What it is


Remote Mailslots are deprecated and disabled by default for SMB and for DC locator protocol usage with Active Directory.


 


How it helps you


The Remote Mailslot protocol is an obsolete, simple, unreliable, IPC method first introduced in MS DOS. It is completely unsafe and has no authentication or authorization mechanisms.


 


Learn more



 


SMB over QUIC in Windows Server all editions


 


2024-08-23_08-28-33.png


 


What it is


SMB over QUIC is now included in all Windows Server 2025 editions (Datacenter, Standard, Azure Edition), not just on Azure Edition like it was in Windows Server 2022.


 


How it helps you


SMB over QUIC is an alternative to the legacy TCP protocol and is designed for use on untrusted networks like the Internet. It uses TLS 1.3 and certificates to ensure that all SMB traffic is encrypted and usable through edge firewalls for mobile and remote users without the need for a VPN. The user experience does not change at all.


 


Learn more



 


SMB over QUIC client access control


 


What it is


SMB over QUIC client access control lets you restrict which clients can access SMB over QUIC servers. The legacy behavior allowed connection attempts from any client that trusts the QUIC server’s certificate issuance chain.


 


How it helps you


Client access control creates allow and block lists for devices to connect to the file server. A client would now need its own certificate and be on an allow list to complete the QUIC connection before any SMB connection occurs. Client access control gives organizations more protection without changing the authentication used when making the SMB connection and the user experience does not change. You can also completely disable the SMB over QUIC client or only allow connection to specific servers.


 


Learn more



 


SMB alternative ports


 


What it is


You can use the SMB client to connect to alternative TCP, QUIC, and RDMA ports than their IANA/IETF defaults of 445, 5445, and 443.


 


How it helps you


With Windows Server, this allows you to host an SMB over QUIC connection on an allowed firewall port other than 443. You can only connect to alternative ports if the SMB server is configured to support listening on that port. You can also configure your deployment to block configuring alternative ports or specify that ports can only connect to certain servers.


 


Learn more



 


SMB Firewall default port changes


 


What it is


The built-in firewall rules don’t contain the SMB NetBIOS ports anymore.


 


How it helps you


The NetBIOS ports were only necessary for SMB1 usage, and that protocol is deprecated and removed by default. This change brings SMB firewall rules more in line with the standard behavior for the Windows Server File Server role. Administrators can reconfigure the rules to restore the legacy ports.


 


Learn more



 


SMB auditing improvements


 


What it is


SMB now supports auditing use of SMB over QUIC, missing third party support for encryption, and missing third party support for signing. These all operate at the SMB server and SMB client level.


 


How it helps you


It is much easier for you to determine if Windows and Windows Server devices are making SMB over QUIC connections. It is also much easier to determine if third parties support signing and encryption before mandating their usage.


 


Learn more



 


Summary


 


With the release of Windows Server 2025 and Windows 11 24H2, we have made the most changes to SMB security since the introduction of SMB 2 in Windows Vista. Deploying these operating systems fundamentally alters your security posture and reduces risk to this ubiquitous remote file and data fabric protocol used by organizations worldwide.


 


For more information on changes in Windows Server 2025, visit Windows Server Summit 2024 – March 26-28, 2024 | Microsoft Event. You will find dozens of presentations and demos on the latest features arriving this fall in our latest operating system.


 


And remember, you can try all of this right now: preview Windows Server 2025 and Windows 11 24H2.


 


Until next time,


 


– Ned Pyle

A better Phi Family is coming – multi-language support, better vision, intelligence MOEs

A better Phi Family is coming – multi-language support, better vision, intelligence MOEs

This article is contributed. See the original author and article here.


Phi3getstarted.png

 




 


After the release of Phi-3 at Microsoft Build 2024, it has received different attention, especially the application of Phi-3-mini and Phi-3-vision on edge devices. In the June update, we improved Benchmark and System role support by adjusting high-quality data training. In the August update, based on community and customer feedback, we brought Phi-3.5-mini-128k-instruct multi-language support, Phi-3.5-vision-128k with multi-frame image input, and provided Phi-3.5 MOE newly added for AI Agent. Next, let’s take a look



Multi-language support


In previous versions, Phi-3-mini had good English corpus support, but weak support for non-English languages. When we tried to ask questions in Chinese, there were often some wrong questions, such as


Lee_Stott_1-1724196256927.png

 





Obviously, this is a wrong answer


But in the new version, we can have better understanding and corpus support with the new Chinese prediction support

Lee_Stott_2-1724196257055.png

 







You can also try the enhancements in different languages, or in the scenario without fine-tuning and RAG, it is also a good model.


Code Sample:  https://github.com/microsoft/Phi-3CookBook/blob/main/code/09.UpdateSamples/Aug/phi3-instruct-demo.ipynb



Better vision



Phi-3.5-Vision enables Phi-3 to not only understand text and complete dialogues, but also have visual capabilities (OCR, object recognition, and image analysis, etc.). However, in actual application scenarios, we need to analyze multiple images to find associations, such as videos, PPTs, books, etc. In the new Phi-3-Vision, multi-frame or multi-image input is supported, so we can better complete the inductive analysis of videos, PPTs, and books in visual scenes.



As shown in this video






We can use OpenCV to extract key frames. We can extract 21 key frame images from the video and store them in an array.


images = [] 
placeholder = “” 
for i in range(1,22): 
    with open(“../output/keyframe_”+str(i)+“.jpg”, “rb”) as f:

        images.append(Image.open(“../output/keyframe_”+str(i)+“.jpg”))
        placeholder += f”n”







Combined with Phi-3.5-Vision’s chat template, we can perform a comprehensive analysis of multiple frames.

Lee_Stott_3-1724196257060.png



This allows us to more efficiently perform dynamic vision-based work, especially in edge scenarios.



Code Sample: https://github.com/microsoft/Phi-3CookBook/blob/main/code/09.UpdateSamples/Aug/phi3-vision-demo.ipynb



Intelligence MOEs



In order to achieve higher performance of the model, in addition to computing power, model size is one of the key factors to improve model performance. Under a limited computing resource budget, training a larger model with fewer training steps is often better than training a smaller model with more steps.



Mixture of Experts Models (MoEs) have the following characteristics:




  • Faster pre-training speed than dense models

  • Faster inference speed than models with the same number of parameters

  • Requires a lot of video memory because all expert systems need to be loaded into memory

  • There are many challenges in fine-tuning, but recent research shows that instruction tuning for mixed expert models has great potential.




Now there are a lot of AI Agents applications, we can use MOEs to empower AI Agents. In multi-task scenarios, the response is faster.



We can explore a simple scenario where we want to use AI to help us write Twitter based on some content and translate it into Chinese and publish it to social networks. We can combine Phi-3.5 MOEs to complete this. We can use Prompt to set and arrange tasks, such as blog content publishing, translated content, and the best answer.



“””

sys_msg = “””You are a helpful AI assistant, you are an agent capable of using a variety of tools to answer a question. Here are a few of the tools available to you:

 Blog: This tool helps you describe a certain knowledge point and content, and finally write it into Twitter or Facebook style content
 Translate: This is a tool that helps you translate into any language, using plain language as required
 Final Answer: the final answer tool must be used to respond to the user. You must use this when you have decided on an answer.

To use these tools you must always respond in JSON format containing `“tool_name”` and `“input”` keyvalue pairs. For example, to answer the question, “Build Muliti Agents with MOE models” you must use the calculator tool like so:

{
    “tool_name”: “Blog”,
    “input”: “Build Muliti Agents with MOE models”
}

Or to translate the question “can you introduce yourself in Chinese” you must respond:

{
    “tool_name”: “Search”,
    “input”: “can you introduce yourself in Chinese”
}

Remember just output the final result, ouput in JSON format containing `“agentid”`,`“tool_name”` , `“input”` and `“output”`  keyvalue pairs .:

[
    {   “agentid”: “step1”,
        “tool_name”: “Blog”,
        “input”: “Build Muliti Agents with MOE models”,
        “output”: “………”
    },

    {   “agentid”: “step2”,
        “tool_name”: “Search”,
        “input”: “can you introduce yourself in Chinese”,
        “output”: “………”
    },
    {
        “agentid”: “final”
        “tool_name”: “Result”,
        “output”: “………”
    }
]

The users answer is as follows.

“”





We can see that by telling the model what skills it needs to have and the arrangement of tasks, we can assign them to different tasks through Phi-3.5 MOEs to complete related work.





Write something about Generative AI with MOEs , translate it to Chinese



The result :



[
    {   
        “agentid”: “step1”, 
        “tool_name”: “Blog”,
        “input”: “Generative AI with MOE”, 
        “output”: “Generative AI with MOE (Mixture of Experts) is a powerful approach that combines the strengths of generative models and the flexibility of MOE architecture. This hybrid model can generate high-quality, diverse, and contextually relevant content, making it suitable for various applications such as content creation, data augmentation, and more.”
    },
    {   
        “agentid”: “step2”,
        “tool_name”: “Translate”,   
        “input”: “Generative AI with MOE is a powerful approach that combines the strengths of generative models and the flexibility of MOE architecture. This hybrid model can generate high-quality, diverse, and contextually relevant content, making it suitable for various applications such as content creation, data augmentation, and more.”, 
        “output”: “基于生成AI的MOE(Mixture of Experts)是一种强大的方法,它结合了生成模型的优势和MOE架构的灵活性。这种混合模型可以生成高质量、多样化且上下文相关的内容,使其适用于各种应用,如内容创建、数据增强等。”
    },
    { 
         “agentid”: “final”,
         “tool_name”: “Result”,
         “output”: “基于生成AI的MOE(Mixture of Experts)是一种强大的方法,它结合了生成模型的优势和MOE架构的灵活性。这种混合模型可以生成高质量、多样化且上下文相关的内容,使其适用于各种应用,如内容创建、数据增强等。”
    }
]






If conditions permit, we can more smoothly integrate the Phi-3 MOEs model into frameworks such as AutoGen, Semantic Kernel, and Langchain.



Code Sample: https://github.com/microsoft/Phi-3CookBook/blob/main/code/09.UpdateSamples/Aug/phi3_moe_demo.ipynb



Thoughts on SLMs


 



SLMs do not replace LLMs but give GenAI a broader scenario. The update of Phi-3 allows more edge devices to have better support, including text, chat, and vision. In modern AI Agents application scenarios, we hope to have more efficient task execution efficiency. In addition to computing power, MoEs are the key to solving problems. Phi-3 is still iterating, and I hope everyone will pay more attention and give us better feedback.