Azure Neural TTS previews a new contextual voice model for long-form paragraph reading

Azure Neural TTS previews a new contextual voice model for long-form paragraph reading

This article is contributed. See the original author and article here.

This blog is co-authored with Shaofei Zhang, Xi Wang, Lei He, Sheng Zhao


 


Azure Neural Text-to-Speech (Neural TTS) has made rapid progress in speech quality, with voice models closely mirror natural speech at the sentence level (see more details here). However, in the real-life scenarios, TTS is used not just to read out single sentences but content within a long-form context such as paragraphs in web pages, audiobooks, or video subtitles.


 


In general, TTS synthesis generates paragraph or long-form audios by sentence. After each sentence is synthesized, they are concatenated together into paragraphs. Sentence-level speech synthesis only takes account of the single sentence information, with no consideration of the content before or after that sentence, even in the long-form context. The sentence generated with this approach would sound the same no matter how its context changes in paragraphs. As we know, scenarios like audiobooks or web content read-aloud usually consist of many paragraphs with context. Listeners usually expect that synthesized sentences in different contexts should sound dynamic in the pitch, rhythm, intonation, etc. in such scenarios so it creates a sense of coherence between sentences. Without the contextual information taken into the model, the sentence-level synthetic voices can be more monotone and less expressive, leading the listeners to feel less engaged.


 


In this blog, we introduce a new technical innovation that considers contextual information to model TTS voices for paragraph or long-form content reading. This new technology significantly improves the coherence and expressiveness when generating long audios, using Paragraph MOS (Mean Opinion Score) as metrics. With this new technology, we are glad to announce the public preview of Roger, a contextual voice model in English (US), to enable customers to generate more expressive and natural-sounding long-form audio content using Azure Neural TTS service.


 


Introducing Roger: the new contextual voice model


 


Roger is a voice that understands the paragraph content and automatically identifies the context for each sentence. Hence, he can adjust the pitch, rhythm, and intonation along with the context, and insert natural pauses as needed when reading paragraphs.


 


Listen to the samples below and hear how Roger expressively read paragraphs. Samples generated using the same voice’s sentence-level model are also provided as a baseline so you can compare how they differ in pause, intonation, rhythm, etc. inside and cross sentences.


 





























Table 1: samples from Roger (sentence-level voice model as a baseline)



Scripts



Sentence Model



Contextual Model



A druid was a man of mystery and power, respected by all save the most arrogant churchmen and lords. ” Why have we been brought here, druid? ” Farrel asked. ” No doubt we will learn that in time, ” Valin said. ” But I sense our circle is not yet complete. Look, another joins us. ” With his short wooden wand he gestured down the hill, to where a maiden was stepping among the fallen stones. She wore a gown of white linen, loose and frayed.









Many others were asking different questions. Questions like “if they made it, why are they still sending people out?” and “I’m next, I don’t want to go, what’s down there?” float through our minds like clouds. Even as hushed whispers begin to break the silence, my Squad remains silent.



He then gazed at every soldier on that trench giving each of them a nod.” Do and die boys, do and die!” The twenty eight repeated the chant as the sound of the ghastly GEOMs began to resound.” Do and Die!” their voices drowning the GEOM swarm, taunting the monsters to finish them off.



 


With the integration of contextual information, the generated speech sounds more expressive and coherent in terms of the prosody inside sentence and cross sentences, and for the content that is very rich and expressive, the audio from contextual model is more stable and natural.


 


The technology behind


 


In this section, we describe the technology behind the new contextual voice model and introduce how we integrate contextual information into neural TTS voice modeling and build the text-based contextual encoder for long-form audio generation, followed by the quality benefit measurement.


 


Integrating contextual information to neural TTS modeling


Given the same sentence with different context, the prosody of the generated speech could be different. Based on this concept, we use a text-based contextual encoder to extract the contextual information from paragraph text and use this information as a condition on the current sentence’s phoneme embedding to relieve one-to-many problems in the conventional TTS. To model long-form context, the long-form data such as paragraph text and paragraph-level voice samples are needed. As shown in Figure 1, we bring the phoneme embedding and contextual information from text-based contextual encoder together and then go through the same sentence level modeling process to generate the mel-spectrograms for audio waves.


Figure 1. Overall architecture of the contextual voice modelFigure 1. Overall architecture of the contextual voice model


 


Text-based contextual encoder


As shown in Figure 2, we use two branches to extract phoneme level and sequence level semantic and syntax information separately. Given a paragraph text, a token-level statistics extractor is used to calculate the syntactic information which covers the correlation among token, sentence, and paragraph. For each sentence centered in a predefined range of context, we use a pretrained model to extract the token-level semantic context embedding derived as current sentence token embedding and context sentence token embedding.


 


For the phoneme-level information extraction branch, token-level embedding of current sentence and statistical features will be concatenated and then go through a phoneme level contextual encoder to up-sample and encode to phoneme level contextual embedding.


Figure 2. Text-based contextual encoderFigure 2. Text-based contextual encoder


 


For the sequence-level information extraction branch, context sentence token embedding will go through a sequence-level contextual encoder to do some pooling and be encoded to a contextual representation which not only contains current but also historical and future information. Then these two-level contextual features will be combined as the final contextual information.


 


Quality benefits


We conducted a 10-point scale paragraph MOS to measure the voice quality of the proposed contextual model. The difference between paragraph MOS and general sentence MOS is mainly in two aspects: 1) the audios tested and cases for judgement are paragraphs, not single sentences; 2) The testers are asked not only to focus on the voice quality of a single sentence, but also the coherence cross sentences.


 


In this paragraph MOS test, we prepared 30 paragraphs including conversations and narrations from 10 genre novels for each TTS system (sentence-level model and contextual model), and 7 paragraphs for human recordings. 20 native speakers listened to the given audios and gave a score on a 10-point scale according to the overall impression (see table 2 for the definition of overall impression), along with some specific metrics, including naturalness, pleasantness, speech pause, stress, intonation, emotion, style matchiness and listening effort etc.  


 













Table 2: Metric Definition of Paragraph MOS



Overall impression



How is your overall impression on this content reading, considering the inside and cross sentences? Consider if the voice is clear, natural, expressive, easy to understand and pleasant to listen to.



 


Paragraph MOS result shows that the contextual model significantly reduces the paragraph MOS gap with recording from -0.2 (sentence-level model) to -0.06 (contextual model), which is around 70% paragraph MOS reduction, and closely mirror human speech at the paragraph level.


 


Here are samples from contextual model and human recording for your reference.


 



















Table 3: Samples from contextual model vs. human recording



Scripts



Contextual Model



Human



It was two stories high; showed no windows, nothing but a door on the lower story, and a blind forehead of a discolored wall, on the upper; the bore in every feature, the marks of prolonged and sordid negligence. The door, which was equipped with neither bell nor knocker, was blistered and distained. Tramps slouched in the recess and struck matches on the panels; children kept shop upon the steps; the schoolboy had tried his knife on the moldings; and for close on a generation no one had appeared to drive away these random visitors or to repair their ravages. Mr. Enfield and the lawyer were on the other side of the by-street; but when they came abreast at the entry, the former lifted up his cane and pointed. ” Did you ever remark that door? ” He asked; and when his companion had replied in the affirmative, ” It is connected in my mind, ” added he, ” with a very odd story. ” ” Indeed? ” Said Mr. Utterson, with a slight change of voice, ” and what was that? “



 


How to use it


 


With the above contextual modeling, an en-US voice (RogerNeural) is released to Azure TTS platform. You can find voice samples and sample code from Voice Gallery. Non developers can also create audio content directly without writing a single line of code, using our audio content creation tool. It’s recommended that long-form context (such as paragraph text, not single sentence) should be provided as text input when using Roger to generate audio content.


 


Get started


 


Neural Text-to-Speech (Neural TTS), a powerful speech synthesis capability of Azure Cognitive Services, enables users to convert text to lifelike speech. It can be used in various scenarios including voice assistant, content read-aloud capabilities, accessibility tools, and more. Azure Neural TTS has been integrated into many Microsoft Flag products, such as Edge Read Aloud, Word Read Aloud, Outlook Read Aloud etc. It’s also been adopted by many customers, such as AT&TDuolingoProgressive, and more. Up to now, 140 languages and variants are supported in Azure TTS, user can choose from more than 400 pre-set voices or use our Custom Neural Voice service to create their voice instead.


 


To explore the capabilities of Neural TTS with its different voice offerings, try the demo.


 


For more information:


Providing High Availability (HA) to your appliances in Azure Stack HCI

Providing High Availability (HA) to your appliances in Azure Stack HCI

This article is contributed. See the original author and article here.

Imagine this, you have your favorite network virtual appliances (NVAs) sitting in front of your virtual machines (VMs). For one reason or another, you prefer to use firewall or intrusion detection/prevention systems from third-party providers to sit in front of your production or development virtual machine pools. To ensure that your production or development pools are secured by your choice of firewall, you’d usually want to put up more than one firewall virtual machine and have a load balancer sitting in front of these machines to ensure availability. However, with virtual machine pools that can number in the hundreds with multiple port rules for different data streams, this quickly becomes a management nightmare when it comes to assigning load balancing rules. The question then becomes, how do we have the best of both worlds: availability and manageability?


 


Well, now you can configure High Availability (HA) Ports load balancing rules on SDN for a pool of NVAs so that your NVAs will remain available and easy to manage. This is done by configuring your traffic type to All and setting your frontend and backend ports to 0 for your load balancing rule. With this, you can manage high availability of your NVAs with a single load balancing rule.


 


A video demo is linked below to show how you would configure this load balancing rule for yourself through Windows Admin Center:


p.gif


Windows Admin Center:


 



  1. Ensure that you have configured a public or private VIP for your load balancing rule as well as the backend pool of NVAs for HA ports.

  2. Windows Admin Center will require that you also have a health probe enabled for all load balancing rules.

  3. For HA Ports, ensure that traffic type has been set to ALL and that the Frontend and Backend ports are set to 0.

  4. Every other input is up to your discretion!


For more information regarding configuring HA Ports, please follow this link to our technical documentation Click Here!


Thanks for bearing with me, and I hope you found this blog to be helpful. If you happen to give this new feature a try and would like to give some feedback, then please reach out to sdn_feedback@microsoft.com. Lastly, if you’d like to learn more about Software Load Balancers and SDN, here are some resources to read up on.


Using Azure Container Registry to build Docker images for Java projects

Using Azure Container Registry to build Docker images for Java projects

This article is contributed. See the original author and article here.

In this article we will create a Docker image from a Java project using Azure Container Registry and then it will be deployed in a Docker compatible hosting environment, for instance Azure Container App.


For this process it is required:



  • JDK 1.8+

  • Maven

  • Azure CLI

  • GIT


And the following Azure resources:



  • Azure Container Registry

  • Azure Container App. This resource can be changed by other container hosting service, such as Azure App Service, Azure Functions or Azure Kubernetes Service.


These Azure resources can be create using the following az cli commands:


 

LOCATION=westeurope
RESOURCE_GROUP=rg-acrbuild-demo
ACR_NAME=acrbuilddemo
CONTAINERAPPS_ENVIRONMENT=containerapp-demo
CONTAINERAPPS_NAME=spring-petclinic
# create a resource group to hold resources for the demo
az group create --location $LOCATION --name $RESOURCE_GROUP
# create an Azure Container Registry (ACR) to hold the images for the demo
az acr create --resource-group $RESOURCE_GROUP --name $ACR_NAME --sku Standard --location $LOCATION

# register container apps extension
az extension add --name containerapp --upgrade
# register Microsoft.App namespace provider
az provider register --namespace Microsoft.App
# create an azure container app environment
az containerapp env create 
    --name $CONTAINERAPPS_ENVIRONMENT 
    --resource-group $RESOURCE_GROUP 
    --location $LOCATION

# Create a user managed identity and assign AcrPull role on the ACR.
USER_IDENTITY=$(az identity create -g $RESOURCE_GROUP -n $CONTAINERAPPS_NAME --location $LOCATION --query clientId -o tsv)
ACR_RESOURCEID=$(az acr show --name $ACR_NAME --resource-group $RESOURCE_GROUP --query "id" --output tsv)
az role assignment create 
    --assignee "$USER_IDENTITY" --role AcrPull --scope "$ACR_RESOURCEID"

# container app will be created once the image is pushed to the ACR

 


Let’s take a sample application, for instance the well known spring pet clinic.


 

git clone https://github.com/spring-projects/spring-petclinic.git
cd spring-petclinic

 


Then create a Dockerfile. For demo purposes it will be as simple as possible.


 

FROM openjdk:8-jdk-slim
# takes the jar file as an argument
ARG ARTIFACT_NAME
# assumes the application entry port is 8080
EXPOSE 8080

# The application's jar file
ARG JAR_FILE=${ARTIFACT_NAME}

# Add the application's jar to the container
ADD ${JAR_FILE} app.jar

# Run the jar file
ENTRYPOINT ["java","-jar","/app.jar"]

 


To build it locally the following command would be used:


 

docker build -t myacr.azurecr.io/spring-petclinic:2.7.0 
    -f Dockerfile 
    --build-arg ARTIFACT_NAME=target/spring-petclinic-2.7.0-SNAPSHOT.jar 
    .

 


To build it in Azure Container Registry the following command would be used instead:


 

az acr build 
        --resource-group rg-spring-petclinic 
        --registry myacr 
        --image spring-petclinic:2.7.0 
        --build-arg ARTIFACT_NAME=target/spring-petclinic-2.7.0-SNAPSHOT.jar 
        .

 


Instead of execute it manually, this command can be integrated as part of the maven build cycle. To perform this action, we will use 


org.codehaus.mojo:exec-maven-plugin. This plugin allows to execute system or Java programs. It will be used to execute the previous az cli command. To include as part of the build lifecycle it will be created a new maven profile.

<profile>
    <id>buildAcr</id>
    <build>
        <plugins>
            <plugin>
                <groupId>org.codehaus.mojo</groupId>
                <artifactId>exec-maven-plugin</artifactId>
                <version>3.0.0</version>
                <executions>
                    <execution>
                        <id>acr-package</id>
                        <phase>package</phase>
                        <goals>
                            <goal>exec</goal>
                        </goals>
                        <configuration>
                            <executable>az</executable>
                            <workingDirectory>${project.basedir}</workingDirectory>
                            <arguments>
                                <argument>acr</argument>
                                <argument>build</argument>
                                <argument>--resource-group</argument>
                                <argument>${RESOURCE_GROUP}</argument>
                                <argument>--registry</argument>
                                <argument>${ACR_NAME}</argument>
                                <argument>--image</argument>
                                <argument>${project.artifactId}:${project.version}</argument>
                                <argument>--build-arg</argument>
                                <argument>ARTIFACT_NAME=target/${project.build.finalName}.jar</argument>
                                <argument>-f</argument>
                                <argument>Dockerfile</argument>
                                <argument>.</argument>
                            </arguments>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</profile>


To execute this process just execute the following maven goal with the new profile.

mvn package -PbuildAcr -DRESOURCE_GROUP=$RESOURCE_GROUP -DACR_NAME=$ACR_NAME

The resource group and the Azure Container Registry are passed as environment variables, but the can be defined as maven parameters or just hardcoding.


Now the image already exists in Azure Container Registry.


fmiguel_0-1656887311912.png

Now it’s time to create the Container App.

# Create the container app
az containerapp create 
    --name ${CONTAINERAPPS_NAME} 
    --resource-group $RESOURCE_GROUP 
    --environment $CONTAINERAPPS_ENVIRONMENT 
    --container-name spring-petclinic-container 
    --user-assigned ${CONTAINERAPPS_NAME} 
    --registry-server $ACR_NAME.azurecr.io 
    --image $ACR_NAME.azurecr.io/spring-petclinic:2.7.0-SNAPSHOT 
    --ingress external 
    --target-port 8080 
    --cpu 1 
    --memory 2 


Now the application is deployed and running on Azure Container Apps.

fmiguel_0-1656974193359.png

fmiguel_1-1656974290151.png

You can find the source code for this application here.

 

New transactable offers from Contentsquare, Connecting Software, and Enlighten Designs

New transactable offers from Contentsquare, Connecting Software, and Enlighten Designs

This article is contributed. See the original author and article here.

Microsoft partners like Contentsquare, Connecting Software, and Enlighten Designs deliver transact-capable offers, which allow you to purchase directly from Azure Marketplace. Learn about these offers below:


 

















Contentsquare logo.png

Contentsquare Digital Experience Analytics: More than 1,000 leading brands, including BMW, Giorgio Armani, Samsung, Sephora, and Virgin Atlantic, leverage Contentsquare’s digital experience analytics cloud to understand customer behaviors and empower teams to seize growth-accelerating opportunities. Intuitive visual reporting makes it easy to see how customers are using a brand’s site or app, in turn driving more successful experiences.


Connecting Software logo.png CB Dynamics 365 to SharePoint Permissions Replicator: This application from Connecting Software automatically synchronizes your Microsoft Dynamics 365 privileges with your SharePoint permissions, enhancing security and avoiding infringement of the European Union’s General Data Protection Regulation. It replicates the Dynamics 365 permission schema and ensures that your SharePoint folders match your CRM security model.
Enlighten Designs logo.png

IDA, the Insights and Discovery Accelerator: Powered by Microsoft Azure Cognitive Search, this solution from Enlighten Designs facilitates journalistic research. The Insights and Discovery Accelerator uses object visioning to identify characteristics like columns and hyphens from text scans to give researchers a full and accurate transcript. Its video indexer can recognize people, topics, and entities and can pull transcripts from video and audio files.