Meta-data driven key-value pairs extraction with Azure Form Recognizer

This article is contributed. See the original author and article here.

Most organizations are now aware of how valuable the forms (pdf, images, videos…) they keep in their closets are. They are looking for best practices and most cost-effective ways and tools to digitize those assets. By extracting the data from those forms and combining it with existing operational systems and data warehouses, they can build powerful AI and ML models to get insights from it to deliver value to their customers and business users.

With the Form Recognizer Cognitive Service, we help organizations to harness their data, automate processes (invoice payments, tax processing …), save money and time and get better accuracy.

Figure 1:Typical form

In my first blog about the automated form processing, I described how you can extract key-value pairs from your forms in real-time using the Azure Form Recognizer cognitive service. We successfully implemented that solution for many customers.

Often, after a successful PoC or MVP, our customers realize that, not only they need this real time solution but, they also have a huge backlog of forms they would like to ingest into their relational, NoSQL databases or data lake, in a batch fashion. They have different types of forms and they don’t want to build a model for each type. They are also looking for easy and quick way to ingest the new type of forms.

In this blog, we’ll describe how to dynamically train a form recognizer model to extract the key-value pairs of different type of forms and at scale using Azure services. We’ll also share a github repository where you can download the code and implement the solution we describe in this post.

The backlog of forms maybe in your on-premises environment or in a (s)FTP server. We assume that you were able to upload them into an Azure Data Lake Store Gen 2, using Azure Data Factory, Storage Explorer or AzCopy. Therefore, the solution we’ll describe here will focus on the data ingestion from the data lake to the (No)SQL database.

Our product team published a great tutorial on how to Train a Form Recognizer model and extract form data by using the REST API with Python. The solution described here demonstrates the approach for one model and one type of forms and is ideal for real-time form processing.

The value-add of the post is to show how to automatically train a model with new and different type of forms using a meta-data driven approach, in batch mode.

Below is the high-level architecture.

Figure 2: High Level Architecture

Azure services required to implement this solution

To implement this solution, you will need to create the below services:

Form Recognizer resource:

Form Recognizer resource to setup and configure the form recognizer cognitive service, get the API key and endpoint URI.

Azure SQL single database:

We will create a meta-data table in Azure SQL Database. This table will contain the non-sensitive data required by the Form Recognizer Rest API. The idea is, whenever there is a new type of form, we just insert a new record in this table and trigger the training and scoring pipeline.
The required attributes of this table are:

form_description: This field is not required as part of the training of the model the inference. It just to provide a description of the type of forms we are training the model for (example client A forms, Hotel B forms,…)

training_container_name: This is the storage account container name where we store the training dataset. It can be the same as scoring_container_name

training_blob_root_folder: The folder in the storage account where we’ll store the files for the training of the model.

scoring_container_name: This is the storage account container name where we store the files we want to extract the key value pairs from. It can be the same as the training_container_name

scoring_input_blob_folder: The folder in the storage account where we’ll store the files to extract key-value pair from.

model_id: The identify of model we want to retrain. For the first run, the value must be set to -1 to create a new custom model to train. The training notebook will return the newly created model id to the data factory and, using a stored procedure activity, we’ll update the meta data table with in the Azure SQL database.

Whenever you had a new form type, you need to reset the model id to -1 and retrain the model.

file_type: The supported types are application/pdf, image/jpeg, image/png, image/tif.

form_batch_group_id : Over time, you might have multiple forms type you train against different models. The form_batch_group_id will allow you to specify all the form types that have been training using a specific model.

Azure Key Vault:

For security reasons, we don’t want to store certain sensitive information in the parametrization table in the Azure SQL database. We store those parameters in Azure Key Vault secrets.

Below are the parameters we store in the key vault:

CognitiveServiceEndpoint: The endpoint of the form recognizer cognitive service. This value will be stored in Azure Key Vault for security reasons.

CognitiveServiceSubscriptionKey: The access key of the cognitive service. This value will be stored in Azure Key Vault for security reasons. The below screenshot shows how to get the key and endpoint of the cognitive service

Figure 3: Cognitive Service Keys and Endpoint

StorageAccountName: The storage account where the training dataset and forms we want to extract the key value pairs from are stored. The two storage accounts can be different. The training dataset must be in the same container for all form types. They can be in different folders.

StorageAccountSasKey : the shared access signature of the storage account

The below screen shows the key vault after you create all the secrets

Figure 4 : Key Vault Secrets

Azure Data Factory:

To orchestrate the training and scoring of the model. Using a look up activity, we’ll retrieve the parameters in the Azure SQL Database and orchestrate the training and scoring of the model using Databricks notebooks. All the sensitive parameters stored in Key vault will be retrieve in the notebooks.

Azure Data Lake Gen 2:

To store the training dataset and the forms we want to extract the key-values pairs from. The training and the scoring datasets can be in different containers but, as mentioned above, the training dataset must be in the same container for all form types.