What’s new in SynapseML v0.11

What’s new in SynapseML v0.11

This article is contributed. See the original author and article here.

Announcing SynapseML v0.11. The new version contains many new features to help you build scalable machine learning pipelines.Announcing SynapseML v0.11. The new version contains many new features to help you build scalable machine learning pipelines.


 


 


We are pleased to announce SynapseML v0.11, a new version of our open-source distributed machine learning library that simplifies and accelerates the development of scalable AI. In this release, we are excited to introduce many new features from the past year of developments well as many bug fixes and improvements. Though this post will give a high-level overview of the most salient new additions, curious readers can check out the full release notes for all of the new additions.


 


OpenAI Language Models and Embeddings


A new release wouldn’t be complete without joining the large language model (LLM) hype train and SynapseML v0.11 features a variety of new features that make large-scale LLM usage simple and easy. In particular, SynapseML v0.11 introduces three new APIs for working with foundation models: `OpenAIPrompt`, ` OpenAIEmbedding`, and `OpenAIChatCompletion`. The `OpenAIPrompt` API makes it easy to construct complex LLM prompts from columns of your dataframe. Here’s a quick example of translating a dataframe column called “Description” into emojis.


 

from synapse.ml.cognitive.openai import OpenAIPrompt

emoji_template = """
  Translate the following into emojis
  Word: {Description}
  Emoji: """

results = (OpenAIPrompt()
    .setPromptTemplate(emoji_template)
    .setErrorCol("error")
    .setOutputCol("Emoji")
    .transform(inputs))

 


 


This code will automatically look for a database column called “Description” and prompt your LLM (ChatGPT, GPT-3, GPT-4) with the created prompts. Our new OpenAI embedding classes make it easy to embed large tables of sentences quickly and easily from your Apache Spark clusters.  To learn more, see our docs on using OpenAI embeddings API and the SynapseML KNN model to create an LLM-based vector search engine directly on your spark cluster. Finally, the new OpenAIChatCompletion transformer allows users to submit large quantities of chat-based prompts to ChatGPT, enabling parallel inference of thousands of conversations at a time. We hope you find the new OpenAI integrations useful for building your next intelligent application.


 


Simple Deep Learning


SynapseML v0.11 introduces a new Simple deep learning package that allows for the training of custom text and deep vision classifiers with only a few lines of code. This package integrates the power of distributed deep network training with PytorchLightning with the simple and easy APIs of SynapseML. The new API allows users to fine-tune visual foundation models from torchvision as well as a variety of state-of-the-art text backbones from HuggingFace.


 


Here’s a quick example showing how to fine-tune custom vision networks:


 

from synapse.ml.dl import DeepVisionClassifier

train_df = spark.createDataframe([
    ("PATH_TO_IMAGE_1.jpg", 1),
    ("PATH_TO_IMAGE_2.jpg", 2)
], ["image", "label"])

deep_vision_classifier = DeepVisionClassifier(
    backbone="resnet50",
    num_classes=2,
    batch_size=16,
    epochs=2,
)

deep_vision_model = deep_vision_classifier.fit(train_df)

 


 


Keep an eye out with upcoming new releases of SynapseML featuring additional simple deep-learning algorithms that will make it easier than ever to train and deploy models at scale.


 


LightGBM v2


LightGBM is one of the most used features of SynapseML and we heard your feedback on better performance! SynapseML v0.11 introduces a completely refactored integration between LightGBM and Spark, called LightGBM v2. This integration aims for high performance by introducing a variety of new streaming APIs in the core LightGBM library to enable fast and memory-efficient data sharing between spark and LightGBM. In particular, the new “Streaming execution mode” has a >10x lower memory footprint than earlier versions of SynapseML yielding fewer memory issues and faster training. Best of all, you can use the new mode by just passing a single extra flag to your existing LightGBM models in SynapseML.


 


ONNX Model Hub


SynapseML supports a variety of new deep learning integrations with the ONNX runtime for fast, hardware-accelerated inference in all of the SynapseML languages (Scala, Java, Python, R, and .NET).  In version 0.11 we add support for the new ONNX model hub, which is an open collection of state-of-the-art pre-trained ONNX models that can be quickly downloaded and embedded into spark pipelines. This allowed us to completely deprecate and remove our old dependence on the CNTK deep learning library.  


 


To learn more about how you can embed deep networks into Spark pipelines, check out our ONNX episode in the new SynapseML video series:


 


 


Causal Learning


SynapseML v0.11 introduces a new package for causal learning that can help businesses and policymakers make more informed decisions. When trying to understand the impact of a “treatment” or intervention on an outcome, traditional approaches like correlation analysis or prediction models fall short as they do not necessarily establish causation. Causal inference aims to overcome these shortcomings by bridging the gap between prediction and decision-making. SynapseML’s causal learning package implements a technique called “Double machine learning”, which allows us to estimate treatment effects without data from controlled experiments. Unlike regression-based approaches, this approach can model non-linear relationships between confounders, treatment, and outcome. Users can run the DoubleMLEstimator using a simple code snippet like the one below:


 

from pyspark.ml.classification import LogisticRegression
from synapse.ml.causal import DoubleMLEstimator

dml = (DoubleMLEstimator()
      .setTreatmentCol("Treatment")
      .setTreatmentModel(LogisticRegression())
      .setOutcomeCol("Outcome")
      .setOutcomeModel(LogisticRegression())
      .setMaxIter(20))

dmlModel = dml.fit(dataset)

 


 


For more information, be sure to check out Dylan Wang’s guided tour of the DoubleMLEstimator on the SynapseML video series:


 


Vowpal Wabbit v2


Finally, SynapseML v0.11 introduces Vowpal Wabbit v2, the second-generation integration between the Vowpal Wabbit (VW) online optimization library and Apache Spark. With this update, users can work with Vowpal wabbit data directly using the new “VowpalWabbitGeneric” model. This makes working with Spark easier for existing VW users. This more direct integration also adds support for new cost functions and use cases including “multi-class” and “cost-sensitive one against all” problems. The update also introduces a new progressive validation strategy and a new Contextual Bandit Offline policy evaluation notebook to demonstrate how to evaluate VW models on large datasets.


 


Conclusion


In conclusion, we are thrilled to share the new SynapseML library with you with you and hope you will find that it simplifies your distributed machine learning pipelines.  This blog only covered the highlights, so be sure to check out the full release notes for all the updates and new features. Whether you are working with large language models, training custom classifiers, or performing causal inference, SynapseML makes it easier and faster to develop and deploy machine learning models at scale.


 


Learn more


Como criar uma extensão customizada para o Azure DevOps

Como criar uma extensão customizada para o Azure DevOps

This article is contributed. See the original author and article here.

Como criar uma extensão customizada para o Azure DevOps


Em alguns casos, é necessário criar uma extensão personalizada para o Azure DevOps, seja para adicionar funcionalidades que não estão disponíveis nativamente ou para modificar alguma funcionalidade existente que não atenda às necessidades do projeto. Neste artigo, mostraremos como criar uma extensão personalizada para o Azure DevOps e como publicá-la no Marketplace do Azure DevOps.


Antes de começar certifique:



  • Ter uma conta no Azure DevOps. Caso ainda não tenha uma, você pode criar uma seguindo as instruções disponíveis aqui.

  • Ter um editor de código instalado, como o Visual Studio Code, que pode ser baixado em code.visualstudio.com.

  • Ter a versão LTS do Node.js instalada, disponível para download em, nodejs.org. Ter o compilador de TypeScript instalado, sendo a versão recomendada 4.0.2 ou superior. Ele pode ser instalado via npm em npmjs.com.

  • Ter o CLI do TFX instalado, sendo a versão recomendada 0.14.0 ou superior. Ele pode ser instalado globalmente via npm com o comando npm i -g tfx-cli ou conferindo mais detalhes em TFX-CLI npm i -g tfx-cli.


Preparando o ambiente de desenvolvimento




  1. Crie uma pasta para a extensão, por exemplo, my-extension e dentro desta pasta crie a uma subpasta, por exemplo, task.




  2. Abra o terminal na pasta criada e execute o comando npm init -y, o parâmetro -y é para aceitar todas as opções padrão. Você vai notar que foi criado um arquivo chamado package.json e nele estão as informações da extensão.


        {
    “name”: “my-extension”,
    “version”: “1.0.0”,
    “description”: “”,
    “main”: “index.js”,
    “scripts”: {
    “build”: “tsc ./index.ts”,
    },
    “keywords”: [],
    “author”: “”,
    “license”: “ISC”
    }



  3. Adicione a azure-pipelines-task-lib como dependência da extensão, execute o comando npm i azure-pipelines-task-lib –save-dev.




  4. Adicione também as tipificações do TypeScript, execute o comando npm i @types/node –save-dev e npm i @types/q –save-dev.




  5. Crie um arquivo .gitignore na pasta raiz da extensão e adicione o seguinte conteúdo:


    node_modules



  6. Instale o compilador de TypeScript, execute o comando npm i typescript –save-dev.




  7. Crie um arquivo tsconfig.json na pasta raiz da extensão e adicione o seguinte conteúdo:


        {
    “compilerOptions”: {
    “target”: “es6”, /* Specify ECMAScript target version: ‘ES3’ (default), ‘ES5’, ‘ES2015’, ‘ES2016’, ‘ES2017’, ‘ES2018’, ‘ES2019’, ‘ES2020’, or ‘ESNEXT’. */
    “module”: “commonjs”, /* Specify module code generation: ‘none’, ‘commonjs’, ‘amd’, ‘system’, ‘umd’, ‘es2015’, ‘es2020’, or ‘ESNext’. */
    “strict”: true, /* Enable all strict type-checking options. */
    “esModuleInterop”: true, /* Enables emit interoperability between CommonJS and ES Modules via creation of namespace objects for all imports. Implies ‘allowSyntheticDefaultImports’. */
    “skipLibCheck”: true, /* Skip type checking of declaration files. */
    “forceConsistentCasingInFileNames”: true /* Disallow inconsistently-cased references to the same file. */
    }
    }



  8. Crie um arquivo chamado vss-extension.json na pasta raiz da extensão my-extension e adicione o seguinte conteúdo:


        {
    “manifestVersion”: 1,
    “id”: “<>”,
    “version”: “1.0.0”,
    “publisher”: “<>”,
    “name”: “My Extension”,
    “description”: “My Extension”,
    “public”: false,
    “categories”: [
    “Azure Pipelines”
    ],
    “targets”: [
    {
    “id”: “Microsoft.VisualStudio.Services”
    }
    ],
    “icons”: {
    “default”: “images/icon.png”
    },
    “files”: [
    {
    “path”: “task”
    }
    ],
    “contributions”: [
    {
    “id”: “my-extension”,
    “description”: “My Extension”,
    “type”: “ms.vss-distributed-task.task”,
    “targets”: [
    “ms.vss-distributed-task.tasks”
    ],
    “properties”: {
    “name”: “my-extension”
    }
    }
    ]
    }

    Substitua o <> por ID único de cada extensão, você pode gerar um ID aqui. Substitua o <> pelo publisher ID criado no passo 1 da etapa de publish.




  9. Na pasta raiz da sua extensão my-extension, crie uma pasta chamada images e adicione uma imagem chamada icon.png com o tamanho de 128×128 pixels. Essa imagem será usada como ícone da sua extensão no Marketplace.




Criando a extensão


Depois de configurar o ambiente, você pode criar a extensão.




  1. Na pasta task crie um arquivo chamado task.json e adicione o seguinte conteúdo:


        {
    “$schema”: “https://raw.githubusercontent.com/Microsoft/azure-pipelines-task-lib/master/tasks.schema.json”,
    “id”: “<>”,
    “name”: “My Extension”,
    “friendlyName”: “My Extension”,
    “description”: “My Extension”,
    “helpMarkDown”: “”,
    “category”: “Utility”,
    “visibility”: [
    “Build”,
    “Release”
    ],
    “author”: “Your Name”,
    “version”: {
    “Major”: 1,
    “Minor”: 0,
    “Patch”: 0
    },
    “groups”: [],
    “inputs”: [],
    “execution”: {
    “Node16”: {
    “target”: “index.js”
    }
    }
    }

    Substitua o <> pelo mesmo GUID gerado no passo 8 da etapa de preparação de ambiente de desenvolvimento.


    Esse arquivo descreve a extensão que será executada no pipeline. Nesse caso, a extensão ainda não faz nada, mas você pode adicionar os inputs e a lógica para executar qualquer coisa.




  2. Na sequência crie um arquivo chamado index.js e adicione o seguinte conteúdo:


        const tl = require(‘azure-pipelines-task-lib/task’);

    async function run() {
    try {
    tl.setResult(tl.TaskResult.Succeeded, ‘My Extension Succeeded!’);
    }
    catch (err) {
    if (err instanceof Error) {
    tl.setResult(tl.TaskResult.Failed, err.message);
    }
    }
    }

    run();


    Esse arquivo é o responsável por executar a extensão. Nesse caso, ele apenas retorna uma mensagem de sucesso. Você pode adicionar a lógica para executar qualquer coisa.




  3. Adicione na pasta task uma imagem chamada icon.png com o tamanho de 32×32 pixels. Essa imagem será usada como ícone da sua extensão no Azure Pipelines.




  4. No terminal, execute o comando tsc, para compilar o código Typescript para Javascript. Esse comando irá gerar um arquivo chamado index.js na pasta task.




  5. Para executar a extensão localmente, execute o comando node index.js. Você deve ver a mensagem My Extension Succeeded!.


        C:tempmy-extensiontask> node index.js
    ##vso[task.debug]agent.TempDirectory=undefined
    ##vso[task.debug]agent.workFolder=undefined
    ##vso[task.debug]loading inputs and endpoints
    ##vso[task.debug]loading INPUT_CLEANTARGETFOLDER
    ##vso[task.debug]loading INPUT_CLIENTID
    ##vso[task.debug]loading INPUT_CLIENTSECRET
    ##vso[task.debug]loading INPUT_CONFLICTBEHAVIOUR
    ##vso[task.debug]loading INPUT_CONTENTS
    ##vso[task.debug]loading INPUT_DRIVEID
    ##vso[task.debug]loading INPUT_failOnEmptySource
    ##vso[task.debug]loading INPUT_FLATTENFOLDERS
    ##vso[task.debug]loading INPUT_SOURCEFOLDER
    ##vso[task.debug]loading INPUT_TARGETFOLDER
    ##vso[task.debug]loading INPUT_TENANTID
    ##vso[task.debug]loaded 11
    ##vso[task.debug]Agent.ProxyUrl=undefined
    ##vso[task.debug]Agent.CAInfo=undefined
    ##vso[task.debug]Agent.ClientCert=undefined
    ##vso[task.debug]Agent.SkipCertValidation=undefined
    ##vso[task.debug]task result: Succeeded
    ##vso[task.complete result=Succeeded;]My Extension Succeeded!
    C:tempmy-extensiontask>



Publicando a extensão no Marketplace


Quando a sua extensão estiver pronta, você pode publicá-la no Marketplace. Para isso será necessário criar um editor de extensão no Marketplace.




  1. Acesse o Marketplace e clique em Publish Extension. Após fazer o login, você será redirecionado para a página de criação de um editor de extensão. Preencha os campos e clique em Create.


    Criando um editor de extensão




  2. No terminal execute o comando tfx extension create –manifest-globs vss-extension.json, na pasta My-Extension. Esse comando irá gerar um arquivo chamado publishID-1.0.0.vsix, que é o arquivo que será publicado no Marketplace.


    CreateExtension




  3. Acesse a página de publicação de extensão no Marketplace e clique New extension e seguida Azure DevOps. Selecione o arquivo my-extension-1.0.0.vsix e clique em Upload.


    UploadExtension


    Se tudo ocorrer bem, você verá algo como a imagem abaixo.


    ExtensionPublished




  4. Com a extensão publicada, será necessário compartilhá-la com a sua organização. Para isso, clique no menu de contexto da extensão e clique em Share/UnShare.


    ShareExtension


    Clique em + Organization.


    ShareExtension1


    E digite o nome da sua Organização, ao clicar fora da caixa de digitação a validação é feita e o compartilhamento é realizado.


    ShareExtension2




Instalando a extensão na sua organização


Após publicar a extensão no Marketplace, você pode instalá-la na sua organização, para isso siga os passos abaixo.




  1. Clique no menu de contexto da extensão e clique em View Extension.


    InstallExtension


    Você verá algo como a imagem abaixo.


    InstallExtension1




  2. Clique em Get it free.




  3. Verifique se sua organização está selecionada e clique em Install.


    InstallExtension2


    Se a instalação ocorrer tudo bem, você verá algo como a imagem abaixo.


    InstallExtension3


    Após a instalação, você verá a extensão na lista de extensões instaladas e poderá ser utilizada nos seus pipelines.




Conclusão


O uso de extensões customizadas no Azure DevOps desbloqueiam funcionalidades que não estão disponíveis. Neste artigo, você aprendeu como criar uma extensão customizada e como publicá-la no Marketplace. Espero que tenha gostado e que possa aplicar o conhecimento adquirido em seus projetos.


Referências



  1. Criar uma organização

  2. Referência de manifesto de extensão

  3. Build/Release Task Exemplos

  4. Extensões de pacote e publicação

Extracting Table data from documents into an Excel Spreadsheet

Extracting Table data from documents into an Excel Spreadsheet

This article is contributed. See the original author and article here.

Documents can contain table data. For example, earning reports, purchase order forms, technical and operational manuals, etc., contain critical data in tables. You may need to extract this table data into Excel for various scenarios.



  • Extract each table into a specific worksheet in Excel.

  • Extract the data from all the similar tables and aggregate that data into a single table.


Here, we present two ways to generate Excel from a document’s table data:



  1. Azure Function (HTTP Trigger based): This function takes a document and generates an Excel file with the table data in the document.

  2. Apache Spark in Azure Synapse Analytics (in case you need to process large volumes of documents).


The Azure function extracts table data from the document using Form Recognizer’s “General Document” model and generates an Excel file with all the extracted tables. The following is the expected behavior:



  • Each table on a page gets extracted and stored to a sheet in the Excel document. The sheet name corresponds to the page number in the document.

  • Sometimes, there are key-value pairs on the page that need to be captured in the table. If you need that feature, leverage the add_key_value_pairs flag in the function.

  • Form Recognizer extracts column and row spans, and we take advantage of this to present the data as it is represented in the actual table.


 


Following are two sample extractions.









Pic3.png Pic4.png

Top excel is with key value pairs added to the table. Bottom one is without the key value pairs.


 









Pic1.png Pic2.png







The Excel shown above is the extraction of table data from an earnings report. The earnings report file had multiple pages with tables, and the fourth page had two tables. 






 

 








 




 


Solution


Azure Function and Synapse Spark Notebook is available here in this GIT Repository 



  • Deployment Steps 


  • Sample Data: The repository has two sample documents to work with:


  • Note on the Excel output: 

    • If there is a page in the main document with no tables, no sheet will be created for that page.

    • The code has been updated to remove the extracted text from check boxes (“:selected:”, “:unselected:”) in the table.

    • If a cell does not have any alphanumeric text, it will be skipped. Please update the code to reflect different behavior.




 


How to leverage this Solution



  • Use this solution to generate an Excel file as mentioned above.

  • Integrate this with Power Automate so that end-users can use this seamlessly from O365 (email, SharePoint, or Teams).

  • Customize this to generate an aggregated table.


 


Contributors: Ben Ufuk Tezcan, Vinod Kurpad, Matt Nelson, Nicolas Uthurriague , Sreedhar Mallangi

Microsoft Purview in the Real World (April 21, 2023) – Sensitivity Labels and SharePoint Sites

Microsoft Purview in the Real World (April 21, 2023) – Sensitivity Labels and SharePoint Sites

This article is contributed. See the original author and article here.

James_Havens_1-1682100919511.png


 


Disclaimer


This document is not meant to replace any official documentation, including those found at docs.microsoft.com.  Those documents are continually updated and maintained by Microsoft Corporation.  If there is a discrepancy between this document and what you find in the Compliance User Interface (UI) or inside of a reference in docs.microsoft.com, you should always defer to that official documentation and contact your Microsoft Account team as needed.  Links to the docs.microsoft.com data will be referenced both in the document steps as well as in the appendix.


 


All the following steps should be done with test data, and where possible, testing should be performed in a test environment.  Testing should never be performed against production data.


 


Target Audience


Microsoft customers who want to better understand Microsoft Purview.


 


 


Document Scope


The purpose of this document (and series) is to provide insights into various user cases, announcements, customer driven questions, etc.


 


Topics for this blog entry


Here are the topics covered in this issue of the blog:



  • Sensitivity Labels relating to SharePoint Lists

  • Sensitivity Label Encryption versus other types of Microsoft tenant encryption

  • How Sensitivity Labels conflicts are resolved

  • How to apply Sensitivity Labels to existing SharePoint Sites

  • Where can I find information on how Sensitivity Labels are applied to data within a SharePoint site (i.e. File label inheritance from the Site label)


 


Out-of-Scope


This blog series and entry is only meant to provide information, but for your specific use cases or needs, it is recommended that you contact your Microsoft Account Team to find other possible solutions to your needs.


 


Sensitivity labels and SharePoint Sites – Assorted topics


 


Encryption Sensitivity Label Encryption versus other types of Microsoft tenant encryption


 


 


Question #1


How does the encryption of Sensitivity Labels compare to encryption in leveraged in BitLocker?


 


Answer #1


The following table breaks this down in detail and is taken from the following Microsoft Link.


Encryption in Microsoft 365 – Microsoft Purview (compliance) | Microsoft Learn


 


James_Havens_0-1682101199234.png


 


Sensitivity Labels relating to SharePoint Lists


 


 


Question #2


Can you apply Sensitivity Labels to SharePoint Lists?


 


Answer #2


The simple answer is NO while in the list, but YES once the list is exported to a file format.


 


Data in the SharePoint List is stored within a SQL table in SharePoint.  At the time of the writing of this blog, you cannot apply a Sensitivity Label to a SharePoint Online tables, including SharePoint Lists.


 


SharePoint Lists allow for exports of the data in the list to a file format.  An automatic sensitivity label policy can apply a label to those file formats. Here is an (example below of those export options.


 


James_Havens_1-1682101270872.png


 


 


How to apply Sensitivity Labels to existing SharePoint Sites


 


Question #3


Can you apply Sensitivity Labels to existing SHPT sites?  If so, is this, can this be automated (ex. PowerShell)


 


Answer #3


You can leverage PowerShell to apply SharePoint labels to multiple sites.  Here is the link that explains how to accomplish this.


Look for these two sections in the link below for details:



  • Use PowerShell to apply a sensitivity label to multiple sites

  • View and manage sensitivity labels in the SharePoint admin center


 


 


Use sensitivity labels with Microsoft Teams, Microsoft 365 Groups, and SharePoint sites – Microsoft Purview (compliance) | Microsoft Learn


 


How Sensitivity Labels conflicts are resolved


 


Question #4


If you have an existing file with an existing Sensitivity Label that is stricter than the Sensitivity Label being inherited from SharePoint Site label, which Sensitivity Label is applied to the file? 


 


Answer #4


Please refer to the link and table below for how Sensitivity Label conflicts are handled.  Notice that any Higher priority label or user applied label, would not be overridden by a site label or an automatic labeling policy.


 


Configure a default sensitivity label for a SharePoint document library – Microsoft Purview (compliance) | Microsoft Learn


 


James_Havens_2-1682101300207.png


 


File label inheritance from the Site label


 


Question #5


Where can you find the documentation on SharePoint Site labels and how label inheritance applies to files in that SharePoint site?


 


Answer #5


 


Here are 2 links that can help you with Sensitivity Labels and how they relate to SharePoint sites:


 



 



 


 


When it comes to default Sensitivity Labels for SharePoint sites/libraries (what I have called “label inheritance” above, this link is of use.


 



 


“When SharePoint is enabled for sensitivity labels, you can configure a default label for document libraries. Then, any new files uploaded to that library, or existing files edited in the library will have that label applied if they don’t already have a sensitivity label, or they have a sensitivity label but with lower priority.


 


For example, you configure the Confidential label as the default sensitivity label for a document library. A user who has General as their policy default label saves a new file in that library. SharePoint will label this file as Confidential because of that label’s higher priority.”


 


 


Appendix and Links