Azure Neural TTS previews a new contextual voice model for long-form paragraph reading

Azure Neural TTS previews a new contextual voice model for long-form paragraph reading

This article is contributed. See the original author and article here.

This blog is co-authored with Shaofei Zhang, Xi Wang, Lei He, Sheng Zhao


 


Azure Neural Text-to-Speech (Neural TTS) has made rapid progress in speech quality, with voice models closely mirror natural speech at the sentence level (see more details here). However, in the real-life scenarios, TTS is used not just to read out single sentences but content within a long-form context such as paragraphs in web pages, audiobooks, or video subtitles.


 


In general, TTS synthesis generates paragraph or long-form audios by sentence. After each sentence is synthesized, they are concatenated together into paragraphs. Sentence-level speech synthesis only takes account of the single sentence information, with no consideration of the content before or after that sentence, even in the long-form context. The sentence generated with this approach would sound the same no matter how its context changes in paragraphs. As we know, scenarios like audiobooks or web content read-aloud usually consist of many paragraphs with context. Listeners usually expect that synthesized sentences in different contexts should sound dynamic in the pitch, rhythm, intonation, etc. in such scenarios so it creates a sense of coherence between sentences. Without the contextual information taken into the model, the sentence-level synthetic voices can be more monotone and less expressive, leading the listeners to feel less engaged.


 


In this blog, we introduce a new technical innovation that considers contextual information to model TTS voices for paragraph or long-form content reading. This new technology significantly improves the coherence and expressiveness when generating long audios, using Paragraph MOS (Mean Opinion Score) as metrics. With this new technology, we are glad to announce the public preview of Roger, a contextual voice model in English (US), to enable customers to generate more expressive and natural-sounding long-form audio content using Azure Neural TTS service.


 


Introducing Roger: the new contextual voice model


 


Roger is a voice that understands the paragraph content and automatically identifies the context for each sentence. Hence, he can adjust the pitch, rhythm, and intonation along with the context, and insert natural pauses as needed when reading paragraphs.


 


Listen to the samples below and hear how Roger expressively read paragraphs. Samples generated using the same voice’s sentence-level model are also provided as a baseline so you can compare how they differ in pause, intonation, rhythm, etc. inside and cross sentences.


 





























Table 1: samples from Roger (sentence-level voice model as a baseline)



Scripts



Sentence Model



Contextual Model



A druid was a man of mystery and power, respected by all save the most arrogant churchmen and lords. ” Why have we been brought here, druid? ” Farrel asked. ” No doubt we will learn that in time, ” Valin said. ” But I sense our circle is not yet complete. Look, another joins us. ” With his short wooden wand he gestured down the hill, to where a maiden was stepping among the fallen stones. She wore a gown of white linen, loose and frayed.









Many others were asking different questions. Questions like “if they made it, why are they still sending people out?” and “I’m next, I don’t want to go, what’s down there?” float through our minds like clouds. Even as hushed whispers begin to break the silence, my Squad remains silent.



He then gazed at every soldier on that trench giving each of them a nod.” Do and die boys, do and die!” The twenty eight repeated the chant as the sound of the ghastly GEOMs began to resound.” Do and Die!” their voices drowning the GEOM swarm, taunting the monsters to finish them off.



 


With the integration of contextual information, the generated speech sounds more expressive and coherent in terms of the prosody inside sentence and cross sentences, and for the content that is very rich and expressive, the audio from contextual model is more stable and natural.


 


The technology behind


 


In this section, we describe the technology behind the new contextual voice model and introduce how we integrate contextual information into neural TTS voice modeling and build the text-based contextual encoder for long-form audio generation, followed by the quality benefit measurement.


 


Integrating contextual information to neural TTS modeling


Given the same sentence with different context, the prosody of the generated speech could be different. Based on this concept, we use a text-based contextual encoder to extract the contextual information from paragraph text and use this information as a condition on the current sentence’s phoneme embedding to relieve one-to-many problems in the conventional TTS. To model long-form context, the long-form data such as paragraph text and paragraph-level voice samples are needed. As shown in Figure 1, we bring the phoneme embedding and contextual information from text-based contextual encoder together and then go through the same sentence level modeling process to generate the mel-spectrograms for audio waves.


Figure 1. Overall architecture of the contextual voice modelFigure 1. Overall architecture of the contextual voice model


 


Text-based contextual encoder


As shown in Figure 2, we use two branches to extract phoneme level and sequence level semantic and syntax information separately. Given a paragraph text, a token-level statistics extractor is used to calculate the syntactic information which covers the correlation among token, sentence, and paragraph. For each sentence centered in a predefined range of context, we use a pretrained model to extract the token-level semantic context embedding derived as current sentence token embedding and context sentence token embedding.


 


For the phoneme-level information extraction branch, token-level embedding of current sentence and statistical features will be concatenated and then go through a phoneme level contextual encoder to up-sample and encode to phoneme level contextual embedding.


Figure 2. Text-based contextual encoderFigure 2. Text-based contextual encoder


 


For the sequence-level information extraction branch, context sentence token embedding will go through a sequence-level contextual encoder to do some pooling and be encoded to a contextual representation which not only contains current but also historical and future information. Then these two-level contextual features will be combined as the final contextual information.


 


Quality benefits


We conducted a 10-point scale paragraph MOS to measure the voice quality of the proposed contextual model. The difference between paragraph MOS and general sentence MOS is mainly in two aspects: 1) the audios tested and cases for judgement are paragraphs, not single sentences; 2) The testers are asked not only to focus on the voice quality of a single sentence, but also the coherence cross sentences.


 


In this paragraph MOS test, we prepared 30 paragraphs including conversations and narrations from 10 genre novels for each TTS system (sentence-level model and contextual model), and 7 paragraphs for human recordings. 20 native speakers listened to the given audios and gave a score on a 10-point scale according to the overall impression (see table 2 for the definition of overall impression), along with some specific metrics, including naturalness, pleasantness, speech pause, stress, intonation, emotion, style matchiness and listening effort etc.  


 













Table 2: Metric Definition of Paragraph MOS



Overall impression



How is your overall impression on this content reading, considering the inside and cross sentences? Consider if the voice is clear, natural, expressive, easy to understand and pleasant to listen to.



 


Paragraph MOS result shows that the contextual model significantly reduces the paragraph MOS gap with recording from -0.2 (sentence-level model) to -0.06 (contextual model), which is around 70% paragraph MOS reduction, and closely mirror human speech at the paragraph level.


 


Here are samples from contextual model and human recording for your reference.


 



















Table 3: Samples from contextual model vs. human recording



Scripts



Contextual Model



Human



It was two stories high; showed no windows, nothing but a door on the lower story, and a blind forehead of a discolored wall, on the upper; the bore in every feature, the marks of prolonged and sordid negligence. The door, which was equipped with neither bell nor knocker, was blistered and distained. Tramps slouched in the recess and struck matches on the panels; children kept shop upon the steps; the schoolboy had tried his knife on the moldings; and for close on a generation no one had appeared to drive away these random visitors or to repair their ravages. Mr. Enfield and the lawyer were on the other side of the by-street; but when they came abreast at the entry, the former lifted up his cane and pointed. ” Did you ever remark that door? ” He asked; and when his companion had replied in the affirmative, ” It is connected in my mind, ” added he, ” with a very odd story. ” ” Indeed? ” Said Mr. Utterson, with a slight change of voice, ” and what was that? “



 


How to use it


 


With the above contextual modeling, an en-US voice (RogerNeural) is released to Azure TTS platform. You can find voice samples and sample code from Voice Gallery. Non developers can also create audio content directly without writing a single line of code, using our audio content creation tool. It’s recommended that long-form context (such as paragraph text, not single sentence) should be provided as text input when using Roger to generate audio content.


 


Get started


 


Neural Text-to-Speech (Neural TTS), a powerful speech synthesis capability of Azure Cognitive Services, enables users to convert text to lifelike speech. It can be used in various scenarios including voice assistant, content read-aloud capabilities, accessibility tools, and more. Azure Neural TTS has been integrated into many Microsoft Flag products, such as Edge Read Aloud, Word Read Aloud, Outlook Read Aloud etc. It’s also been adopted by many customers, such as AT&TDuolingoProgressive, and more. Up to now, 140 languages and variants are supported in Azure TTS, user can choose from more than 400 pre-set voices or use our Custom Neural Voice service to create their voice instead.


 


To explore the capabilities of Neural TTS with its different voice offerings, try the demo.


 


For more information:


CISA and ACSC Release Top 2021 Malware Strains

This article is contributed. See the original author and article here.

CISA and the Australian Cyber Security Centre (ACSC) have published a joint Cybersecurity Advisory on the top malware strains observed in 2021. Malicious cyber actors often use malware to covertly compromise and then gain access to a computer or mobile device. As malicious cyber actors have been using most of these top malware strains for more than five years, organizations have opportunities to better prepare, identify, and mitigate attacks from these strains.  

CISA and ACSC encourage organizations to apply the recommendations in the Mitigations sections of the joint CSA. These mitigations include prioritizing patching all systems with known exploited vulnerabilities, enforcing multifactor authentication (MFA), securing remote desktop protocol (RDP) and other risky services, making offline backups of your data, and providing end-user awareness and training about social engineering and phishing. The appendix contains detection signatures organizations can employ in defending their networks. For more information on preventing malicious cyber actors from using 2021 top malware strains to exploit vulnerabilities, see:

•    CISA’s Known Exploited Vulnerabilities Catalog 
•    CISA’s Cyber Hygiene Services
•    CISA’s Choosing and Protecting Passwords
•    ACSC’s Implementing Multi-Factor Authentication
 

Providing High Availability (HA) to your appliances in Azure Stack HCI

Providing High Availability (HA) to your appliances in Azure Stack HCI

This article is contributed. See the original author and article here.

Imagine this, you have your favorite network virtual appliances (NVAs) sitting in front of your virtual machines (VMs). For one reason or another, you prefer to use firewall or intrusion detection/prevention systems from third-party providers to sit in front of your production or development virtual machine pools. To ensure that your production or development pools are secured by your choice of firewall, you’d usually want to put up more than one firewall virtual machine and have a load balancer sitting in front of these machines to ensure availability. However, with virtual machine pools that can number in the hundreds with multiple port rules for different data streams, this quickly becomes a management nightmare when it comes to assigning load balancing rules. The question then becomes, how do we have the best of both worlds: availability and manageability?


 


Well, now you can configure High Availability (HA) Ports load balancing rules on SDN for a pool of NVAs so that your NVAs will remain available and easy to manage. This is done by configuring your traffic type to All and setting your frontend and backend ports to 0 for your load balancing rule. With this, you can manage high availability of your NVAs with a single load balancing rule.


 


A video demo is linked below to show how you would configure this load balancing rule for yourself through Windows Admin Center:


p.gif


Windows Admin Center:


 



  1. Ensure that you have configured a public or private VIP for your load balancing rule as well as the backend pool of NVAs for HA ports.

  2. Windows Admin Center will require that you also have a health probe enabled for all load balancing rules.

  3. For HA Ports, ensure that traffic type has been set to ALL and that the Frontend and Backend ports are set to 0.

  4. Every other input is up to your discretion!


For more information regarding configuring HA Ports, please follow this link to our technical documentation Click Here!


Thanks for bearing with me, and I hope you found this blog to be helpful. If you happen to give this new feature a try and would like to give some feedback, then please reach out to sdn_feedback@microsoft.com. Lastly, if you’d like to learn more about Software Load Balancers and SDN, here are some resources to read up on.


Supercharge your CRM with Microsoft Viva Sales—Now in preview

Supercharge your CRM with Microsoft Viva Sales—Now in preview

This article is contributed. See the original author and article here.

At the announcement of Microsoft Viva Sales, Paul Greenberg, Founder, Managing Principal, The 56 Group, LLC,often called the “Godfather of CRM” and has written a book, CRM at the Speed of Light, and I were chatting about the state of customer relationship management (CRM) market and the innovations in the last 20 years. One area we both agreed that has been underserved is the seller experience. CRM vendors have all focused on the system of recordthe CRM system but sellers continue to work in the system of productivityMicrosoft Office and Microsoft Teams with no connection between them.

In independent research conducted by Microsoft, we surveyed over 500 salespeople who live and work in North America and who use a CRM on a regular basis. The findings revealed that the tools in use today aren’t always helping, and, in some instances, they are even hindering a salesperson’s ability to do their job.1 They love the CRM, but they hate manual data entry. This is one area where Microsoft wants to focus on with Viva Sales. With Viva Sales, we take the manual work of CRM away from the sellers and put technology to work on behalf of the sellers so they can cut the forms, connect the data, and crush the sale.

Today I’m pleased to share that Microsoft Viva Sales is now in preview.

The Viva Sales preview includes surfacing customer contact information from your CRM directly in Microsoft Outlook and Teams, connecting to the CRM of your choice, capturing rich contact information in Outlook, capturing notes and action items using AI, providing AI driven recommendations, and more.

Improve the seller experience with Microsoft Viva Sales

The past two years have had a profound impact on work. Employee expectations concerning how, where, and when they work have significantly changed and continue to evolve. While these work trends are pervasive across the workforce, they have especially reshaped the expectations of sales professionals who have had to adapt to an increasingly digital workplace all while using outdated sales tools. In fact, 74 percent of sellers described sales intelligence tools as critical or extremely critical in closing deals.1 Prior to 2018, digital selling was already gaining popularity but the forced adoption of remote work, brought on by COVID-19, put a spotlight on the top pain points common across sales organizations by highlighting the gaps and limitations in the sales tools.

With Viva Sales we address these gaps to mitigate these top pain points:

  • Manual data entry is time-consuming and frustrating and bogs down digital sellers.  
  • Inability to capture customer engagement data in productivity applications where sellers do their work. According to Futurum research, 82 percent of salespeople shared that faulty data has led to an embarrassing mistake with a customer.2
  • Lack of AI-based recommendations delivered to the point of action.
  • Disconnected processes and tools that slow down productivity. Sellers spend two-thirds of their time on administrative tasks that do not directly generate revenue.

Viva Sales gives sellers more time to focus on selling by eliminating the administrative burden of manual data entry to provide an improved seller experience. Combining the power of Microsoft 365 applications and Teams, this new sales experience application captures, accesses, and registers data into any CRM. With Viva Sales, sellers can cut the forms, connect the data, and crush the sale.

The power of AI combined with enriched customer engagement data from Microsoft 365 applications and Teams provides easy access to sales intelligence in the applications sellers already use every day.

In the video, The future of sales enablement, Paul states “Because Viva Sales is 100% seller focused and all the sales manager stuff is taken care of by the more traditional CRM technologies. That’s pretty awesome. You’ve managed to create it in a way that doesn’t disrupt or destroy the traditional focus but actually enhances it but at the same time makes the seller’s experience much better for that seller, which is not something that has been around much.” 

Learn more

Read our in-depth blog about Viva Sales to learn more and find out how to try out Viva Sales today.


Sources

1Microsoft Viva Sales: Supercharge your CRM, PDF.

2Futurum, 2022. Reimagining the Sales Process – Are You Ready?

The post Supercharge your CRM with Microsoft Viva Sales—Now in preview appeared first on Microsoft Dynamics 365 Blog.

Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.