Responsible Synthetic Data Creation for Fine-Tuning with RAFT Distillation

Responsible Synthetic Data Creation for Fine-Tuning with RAFT Distillation

This article is contributed. See the original author and article here.

Introduction


In the age of AI and machine learning, data is the key to training and fine-tuning models. However, gathering high-quality, diverse datasets can be challenging. Synthetic data generation offers a promising solution, but how do you ensure the data you’re creating is both valid and responsible?


This blog will explore the process of crafting responsible synthetic data, evaluating it, and using it for fine-tuning models. We’ll also dive into Azure AI’s RAFT distillation recipe, a novel approach to generating synthetic datasets using Meta’s Llama 3.1 model and UC Berkeley’s Gorilla project.


 


Understanding Synthetic Data for Fine-Tuning


What is synthetic data?


Synthetic data is artificially generated rather than collected from real-world events. It is used when gathering real data is expensive, time-consuming, or raises privacy concerns. For example, synthetic images, videos, or text can be generated to mimic real-world datasets.


 


Why synthetic data matters for fine-tuning:


Fine-tuning a machine learning model with real-world data is often limited by the availability of diverse, high-quality datasets. Synthetic data fills this gap by providing additional samples, augmenting the original dataset, or generating new, unseen scenarios. For instance, in AI models like GPT or image classification systems, fine-tuning with synthetic data helps models adapt to specialized tasks or environments.


 


Common use cases:



  • Natural Language Processing (NLP): Generating new text to help models better understand uncommon language structures.

  • Computer Vision: Creating synthetic images to train models for object detection, especially in rare or sensitive cases like medical imaging.

  • Robotics: Simulating environments for AI models to interact with, reducing the need for real-world testing.


What makes data “responsible”?


Synthetic data can exacerbate existing biases or create new ethical concerns. Creating responsible data ensures that datasets are fair, and representative, and do not introduce harmful consequences when used for fine-tuning AI models.


Key principles of responsible synthetic data include:



  • Fairness: Avoiding biases in race, gender, or other sensitive attributes.

  • Privacy: Ensuring that synthetic data does not leak sensitive information from real-world datasets.

  • Transparency: Ensuring that the origins and processing of the synthetic data are documented.


Quality aspects for validation:



  • Diversity: Does the data capture the range of possible real-world cases?

  • Relevance: Does the synthetic data match the domain and task for which it will be used?

  • Performance Impact: Does the use of synthetic data improve model performance without degrading fairness or accuracy?


 Validating Synthetic Data


Validation ensures that synthetic data meets the required quality and ethical standards before being used to fine-tune a model.


Techniques for validation:



  • Ground-truth comparison: If there’s real data available, compare the synthetic data with real-world datasets to see how closely they match.

  • Model-based validation: Fine-tune a model with both synthetic and real data, then test its performance on a validation dataset. If the synthetic data significantly improves the model’s accuracy or generalization capabilities, it’s considered valid.

  • Bias and fairness evaluation: Use fairness metrics (such as demographic parity or disparate impact) to check if the synthetic data introduces unintended biases. Several tools, like Microsoft Fair Learn or IBM’s AI Fairness 360, can help identify such issues.


Tools and methods for validation:



  • Azure Machine Learning offers built-in tools for data validation, including feature importance, explainability dashboards, and fairness assessments.

  • Open-source tools such as Google’s What-If Tool or IBM AI Fairness 360, can provide detailed reports on fairness and bias in your synthetic data.


The RAFT Distillation Recipe


The RAFT distillation recipe, available on GitHub, provides a method to generate high-quality synthetic datasets using Meta Llama 3.1 and UC Berkeley’s Gorilla project.


Introduction to RAFT


RAFT (Reinforcement Active Fine-Tuning) is a technique where a pre-trained model generates synthetic data, which is then used to fine-tune the same or a similar model. The goal is to create data that is relevant, diverse, and aligned with the task for which the model is being fine-tuned.


Meta Llama 3.1:


A powerful language model deployed on Azure AI. Using RAFT, Meta Llama generates synthetic data that can be used for NLP tasks, such as question answering, summarization, or classification.


UC Berkeley’s Gorilla Project:


The Gorilla project focuses on fine-tuning models for specific tasks using minimal data. By integrating the Gorilla project’s methods into RAFT, users can create a tailored dataset quickly and efficiently.


Steps from the RAFT distillation recipe:



  • Step 1: Deploy Meta Llama 3.1 on Azure AI using the provided instructions in the GitHub repo.

  • Step 2: Use RAFT distillation to generate synthetic datasets. This involves having the model generate relevant text or data samples based on input prompts.

  • Step 3: Evaluate the generated synthetic dataset using metrics such as relevance, diversity, and performance.

  • Step 4: Fine-tune the model using the generated synthetic data to improve performance on specific tasks.


The blog can include code snippets from the repo to show users how to set up RAFT on Azure, generate synthetic datasets, and fine-tune models.


 


To create a JSONL (JSON Lines) file for training models in Azure Machine Learning, follow these step-by-step instructions:


What is a JSONL File?


A JSONL file is a format where each line is a valid JSON object. It’s commonly used for machine learning tasks like fine-tuning models because it allows you to store structured data in a readable format.


 


Step-by-Step Guide to Creating a JSONL File


Step 1: Prepare Your Data



  • Identify the data you need for fine-tuning. For instance, if you’re fine-tuning a text model, your data may consist of input and output text pairs.

  • Each line in the file should be a JSON object. A typical structure might look like this:


shardakaur_1-1727770094438.png


 


Step 2: Use a Text Editor or Python Script


You can create a JSONL file using a text editor like Notepad or VS Code, or generate it programmatically with a script (e.g., in Python).



  • Method 1: Using a Text Editor


    • Open a plain text editor (like Notepad++ or Visual Studio Code).

    • Write each line as a valid JSON object, e.g.



 


shardakaur_2-1727770166423.png


 



  •  Save the file with a .jsonl extension (e.g., training_data.jsonl).


Method 2: Using Python Script You can also use Python to generate a JSONL file, especially if you have a large dataset.


Example Python code:


shardakaur_3-1727770213960.png


 


Step 3: Validate the JSON Format


Ensure that:



  • Each line in your file is a valid JSON object.

  • There are no commas between objects (unlike a JSON array).

  • Make sure that every object is properly enclosed in {}.


Step 4: Upload to Azure ML


Once your JSON file is ready:



  1. Upload the file to your Azure Machine Learning workspace. You can do this from the Azure portal or via an SDK command.

  2. Use the file for training or evaluation in the Azure ML pipeline, depending on your task (e.g., fine-tuning).


Step 5: Test the File


To verify the file, you can use a simple Python script to load and print the contents:


 


shardakaur_0-1727792857025.png


Example: JSONL File for Fine-Tuning (with 3 lines)


 

 

shardakaur_1-1727792874445.png


 


Summary of Steps:



  1. Prepare your data in a structured JSON format.

  2. Write each line as a separate JSON object in a text editor or using a Python script.

  3. Save the file with a .jsonl extension.

  4. Validate the format to ensure each line is a valid JSON object.

  5. Upload to Azure Machine Learning for model training or fine-tuning.


By following these steps, you’ll have a valid JSONL file ready to be used in Azure Machine Learning for tasks such as model fine-tuning.


 


Resources



  1. Azure Machine Learning documentation: https://learn.microsoft.com/azure/machine-learning/?view=azureml-api-2&viewFallbackFrom=azureml-api-2%3Fstudentamb_263805

  2. Azure AI: https://azure.microsoft.com/solutions/ai/?studentamb_263805


 

How to fix HTTP Error 500.37 – ASP.NET Core app failed to start within the startup time limit error

How to fix HTTP Error 500.37 – ASP.NET Core app failed to start within the startup time limit error

This article is contributed. See the original author and article here.

Introduction


ASP.NET Core applications hosted in IIS are designed to provide robust performance. However, sometimes issues arise that prevent apps from starting properly within the expected time. One common problem is the HTTP Error 500.37, which indicates that the application failed to start within the startup time limit.  This article will walk you through what causes this error and how to resolve it.


 


Problem


HTTP Error 500.37 occurs when an ASP.NET Core application hosted in IIS does not start within the allocated startup time. The default time limit is 120 seconds. This may happen due to various reasons such limited resource (CPU and memory), long initialization tasks, or inadequate startup time configurations in IIS. The error message usually looks like this –


HridayDutta_0-1728657041706.png


This issue can be particularly problematic for larger or resource-intensive applications, where a longer startup time might be required to complete initialization tasks.


 


Solution


The good news is that this issue is easily fixable by increasing the startup time limit for the application in the web.config file. Here’s a step-by-step guide to applying the fix:


 


Locate your application’s web.config file, which should be in the root directory of your ASP.NET Core application. And update the startupTimeLimit  to a higher value then default 120 seconds in section.

In this example, the startupTimeLimit is set to 360 seconds (6 minutes). You can adjust this value based on the needs of your application. This configuration gives your application a longer window to initialize properly, preventing the 500.37 error from occurring.


 


Although, this fix will resolve the issue and allow the application to run smoothly, it’s important to review the Startup.cs or Program.cs files to identify if any modules or dependencies are taking longer than expected to initialize.


 


Conclusion


HTTP Error 500.37 indicate startup failure of your ASP.NET Core application in IIS that requires more time to initialize. By adjusting the startupTimeLimit in the web.config file, you can give your application the necessary time to start and avoid startup errors. This simple yet effective solution ensures your application runs smoothly, even under complex startup conditions.


 


 

What’s new with Microsoft Credentials?

What’s new with Microsoft Credentials?

This article is contributed. See the original author and article here.

Welcome to the October 2024 edition of our Microsoft Credentials roundup, where we highlight the latest portfolio news and updates to inspire and support your training and career journey. 
 


In this article 


What’s new with Microsoft Applied Skills? 



  • Explore the latest Applied Skills scenarios 



  • Applied Skills updates 



  • Take the Microsoft Learn AI Skills Challenge 


 


What’s new with Microsoft Certifications? 



  • Announcing the new Microsoft Certified: Fabric Data Engineer Associate Certification (Exam DP-700) 



  • New pricing for Microsoft Certification exams effective November 1, 2024 



  • Microsoft Certifications to be retired 



  • Build your confidence with exam prep resources 


 


Let’s talk about Applied Skills  


Want to have a bigger impact in your career? Explore Microsoft Applied Skills credentials to showcase your skills. According to a recent IDC study*, 50% of organizations find that micro-credentials like Microsoft Applied Skills help employees get promoted faster! 


*Source: IDC InfoBrief, sponsored by Microsoft, Skilling Up! Leveraging Full and Micro-Credentials for Optimal Skilling Solutions, Doc. #US52019124, June 2024 


 


Explore the latest Applied Skills scenarios  


We’re all about keeping you up to date with in-demand, project-based skills? Here are the new scenarios we’ve launched: 





 


Coming soon 


Stay tuned for the following Applied Skills and more at aka.ms/BrowseAppliedSkills: 



  • Implement retention, eDiscovery, and Communication Compliance in Microsoft Purview  



  • Develop data-driven applications by using Microsoft Azure SQL Database 


 


Applied Skills updates 


We ensure that Applied Skills credentials stay aligned with the latest product updates and trends. Our commitment to helping you build relevant, in-demand skills means that we periodically update or retire older assessments as we introduce new ones.  


 


Now available 


These assessments have been updated and are now back online:   









If you have been preparing for these credentials, go ahead and take the assessment today! 


 


Retirements  


Keep an eye out for new credentials in these areas. 



  • Retired: Build collaborative apps for Microsoft Teams 



  • Retiring on October 31st: Develop generative AI solutions with Azure OpenAI Service 


 


Take the Microsoft Learn AI Skills Challenge   


Dive into the latest AI technologies like Microsoft Copilot, GitHub, Microsoft Azure, and Microsoft Fabric. Explore six curated topics designed to elevate your skills, participate in interactive community events, and gain insights from expert-led sessions. Take your learning to the next level with a selection of the topics to earn a Microsoft Applied Skills Credential and showcase your expertise to potential employers. Start your AI journey today! 


 


Share Your Story 


Have you earned an Applied Skills credential that has made a difference in your career or education? Bala Venkata Veturi (see below) earned an Applied Skills credential through the #WICxSkillsReadyChallenge.


 


valyparker_0-1728333379540.png


We’d love to hear from you! Share your story with us and inspire others with your journey. We could feature your success story next! 


 


What’s new with Microsoft Certifications?  


Announcing the new Microsoft Certified: Fabric Data Engineer Associate Certification (Exam DP-700) 


These days, as organizations strive to harness the power of AI, data engineering skills are indispensable. Data engineers play a pivotal role in designing and implementing the foundational elements for any successful data and AI initiative. To support learners who want to build and validate these skills, Microsoft Learn is pleased to announce the new Microsoft Certified: Fabric Data Engineer Associate Certification, along with its related Exam DP-700: Implementing data engineering solutions using Microsoft Fabric (beta), both of which will be available in late October 2024. Read all about this exciting news and explore other Microsoft Credentials for analytics and data science, in our blog post Prove your data engineering skills, and be part of the AI transformation. 


 


New pricing for Microsoft Certification exams effective November 1, 2024 


Microsoft Learn continually reviews and evolves its portfolio of Microsoft Certifications to help learners around the world stay up to date with the latest technologies, especially as AI and cybersecurity skills become increasingly important in the workplace. We also regularly update our exam content, format, and delivery. To reflect their current market value, we’re updating pricing for Microsoft Certification exams, effective November 1, 2024. The new prices will vary depending on the country or region where you take the exam. For most areas, there will be no change in the price. For some areas, the price will decrease to make the exams more affordable. For a few areas, the price will increase to align with global and regional standards. The goal is to make Microsoft Certification exam pricing simpler and more consistent across geographies while still offering a fair and affordable value proposition. Check out all the details in New pricing for Microsoft Certification exams effective November 1, 2024. 


 


Microsoft Certifications to be retired 


As announced in Evolving Microsoft Credentials for Dynamics 365, the following Certifications will be retired on November 30, 2024: 





If you’re currently preparing for Exam MB-210, Exam MB-220, or Exam MB-260, we recommend that you take the exam before November 30, 2024. If you’ve already earned one of the Certifications being retired, it will remain on the transcript in your profile on Microsoft Learn. If you’re eligible to renew your certification before November 30, 2024, we recommend that you consider doing so. 


 


Build your confidence with exam prep resources 


To better prepare you to earn a new Microsoft Certification, we’ve added new Practice Assessments on Microsoft Learn, including:  





 


 


Watch this space for the next Microsoft Credentials roundup as we continue to evolve our portfolio to help support your career growth.  


Follow us on X and LinkedIn, and make sure you’re subscribed to The Spark, our LinkedIn newsletter.   


Previous editions of the Microsoft Credentials roundup  





Microsoft Security announcements and demos at Authenticate 2024

This article is contributed. See the original author and article here.

The Microsoft Security team is excited to connect with you next week at Authenticate 2024 Conference, taking place October 14 to 16 in Carlsbad, CA! With the rise in identity attacks targeting passwords and MFA credentials, it’s becoming increasingly clear that phishing resistant authentication is critical to counteract these attacks. As the world shifts towards stronger, modern authentication methods, Microsoft is proud to reaffirm our commitment to passwordless authentication and to expanding our support for passkeys across products like Microsoft Entra, Microsoft Windows, and Microsoft consumer accounts (MSA). 


 


To enhance security for both consumers and enterprise customers, we’re excited to showcase some of our latest innovations at this event: 


 



 


We look forward to demonstrating these new advancements and discussing how to take a comprehensive approach to modern authentication at Authenticate Conference 2024. 


 


 Where to find Microsoft Security at Authenticate 2024 Conference   


Please stop by our booth to chat with our product team or join us at the following sessions:  


  


























Session Title  



Session Description  



 Time 



Passkeys on Windows: Paving the way to a frictionless future! 



UX Fundamentals


 


Discover the future of passkey authentication on Windows. Explore our enhanced UX, powered by Microsoft AI and designed for seamless experiences across platforms. Join us as we pave the way towards a passwordless world. 


 


Speakers: 


Sushma K. Principal Program Manager, Microsoft 


Ritesh Kumar Software Engineer, Microsoft 



October 14th  


 


12:00 – 12:25 PM 



Passkeys on Windows: New platform features 



Technical Fundamentals and Features


 


This is an exciting year for us as we’re bringing some great passkey features to Windows users. In this session, I’ll discuss our new capabilities for synced passkeys protected by Windows Hello, and I’ll walk through a plugin model for third-party passkey providers to integrate with our Windows experience. Taken together, these features make passkeys more readily available wherever users need them, with the experience, flexibility, and durability that users should expect when using their passkeys on Windows.  


 


Speaker: 


Bob Gilbert Software Engineering Manager, Microsoft 



October 14th 


 


2:30 – 2:55 PM 



We love passkeys – but how can we convince a billion users? 



Keynote


 


It’s clear that passkeys will be core component of a passwordless future. The useability and security advantages are clear. What isn’t as clear is how we actually convince billions of users to step away from a decades-long relationship with passwords and move to something new. Join us as we share insights on how to accelerate adoption when users, platforms, and applications needs are constantly evolving. We will share practical UX patterns and practices, including messaging, security implications,  


and how going passwordless changes the concept of account recovery.  


 


Speakers:  


Scott Bingham Principal Product Manager, Microsoft  


Sangeeta Ranjit Group Product Manager, Microsoft 



  October 14th 


 


5:05 – 5:25 PM 



  


Stop by our booth #402 to speak with our product team in person!  


  
















Stop counting actors… Start describing authentication events 



Vision and Future  


 


We began deploying multifactor authentication because passwords provided insufficient security. More factors equal more security, right? Yes, but we continue to see authentication attacks such as credential stuffing and phishing! The identity industry needs to stop thinking in the quantity of authentication factors and start thinking about the properties of the authentication event. As we transition into the era of passkeys, it’s time to consider how we describe the properties of our authentication event. In this talk, we’ll demonstrate how identity providers and relying parties can communicate a consistent, composable collection of authentication properties. To raise the security bar and provide accountability, these properties must communicate not only about the authentication event, but about the security primitives underlying the event itself. These properties can be used to drive authentication and authorization decisions in standalone and federated environments, enabling clear, consistent application of security controls.  


 


Speakers: 


Pamela Dingle Director of Identity Standards, Microsoft  


Dean H. Saxe Principal Engineer, Office of the CTO, Beyond Identity 



October 16th 


 


10:00 – 10:25 AM 



Bringing passkeys into your passwordless journey 



Passkeys in the Enterprise


 


Most of our enterprise customers are deploying some form of passwordless credential or planning to in the next few years, however, the industry is all abuzz with excitement about passkeys. What do passkeys mean for your organization’s passwordless journey? Join the Microsoft Entra ID product team as we explore the impact of passkeys on the passwordless ecosystem, share insights from Microsoft’s own passkey implementation and customer experiences.


 


Speakers: 


Tim Larson – Senior Product Manager, Identity Network and Access, Security, Microsoft 


Micheal Epping – Senior Product Manager, Microsoft 



 October 16th 


11:00 – 11:25 AM 



 


We can’t wait to see you in Carlsbad, CA for Authenticate 2024 Conference   


  


 Jarred Boone, Senior Product Marketing Manager, Identity Security  


 


 


Read more on this topic 



 


Learn more about Microsoft Entra  


Prevent identity attacks, ensure least privilege access, unify access controls, and improve the experience for users with comprehensive identity and network access solutions across on-premises and clouds. 


Azure AI Search October Updates: Nearly 100x Compression with Minimal Quality Loss

This article is contributed. See the original author and article here.

 


In our continued effort to equip developers and organizations with advanced search tools, we are thrilled to announce the launch of several new features in the latest Preview API for Azure AI Search. These enhancements are designed to optimize vector index size and provide more granular control and understanding of your search index to build Retrieval-Augmented Generation (RAG) applications.


 


MRL Support for Quantization


Matryoshka Representation Learning (MRL) is a new technique that introduces a different form of vector compression, which complements and works independently of existing quantization methods. MRL enables the flexibility to truncate embeddings without significant semantic loss, offering a balance between vector size and information retention.

This technique works by training embedding models so that information density increases towards the beginning of the vector. As a result, even when using only a prefix of the original vector, much of the key information is preserved, allowing for shorter vector representations without a substantial drop in performance.

OpenAI has integrated MRL into their ‘text-embedding-3-small’ and ‘text-embedding-3-large’ models, making them adaptable for use in scenarios where compressed embeddings are needed while maintaining high retrieval accuracy. You can read more about the underlying research in the official paper [1] or learn about the latest OpenAI embedding models in their blog.


 


Storage Compression Comparison


Table 1.1 below highlights the different configurations for vector compression, comparing standard uncompressed vectors, Scalar Quantization (SQ), and Binary Quantization (BQ) with and without MRL. The compression ratio demonstrates how efficiently the vector index size can be optimized, yielding significant cost savings. You can find more about our Vector Index Size Limits here: Service limits for tiers and skus – Azure AI Search | Microsoft Learn.


 


Table 1.1: Vector Index Size Compression Comparison






























 


Configuration



*Compression Ratio



Uncompressed





SQ



4x



BQ



28x



**MRL + SQ (1/2 and 1/3 truncation dimension respectively)



8x-12x



**MRL + BQ (1/2 and 1/3 truncation dimension respectively)



64x – 96x



 


Note: Compression ratios depend on embedding dimensions and truncation. For instance, using “text-embedding-3-large” with 3072 dimensions truncated to 1024 dimensions can result in 96x compression with Binary Quantization.


*All compression methods listed above, may experience slightly lower compression ratios due to overhead introduced by the index data structures. See “Memory overhead from selected algorithm” for more details.


**The compression impact when using MRL depends on the value of the truncation dimension. We recommend either using ½ or 1/3 of the original dimensions to preserve embedding quality (see below)


 


Quality Retainment Table:


Table 1.2 provides a detailed view of the quality retainment when using MRL with quantization across different models and configurations. The results indicate the impact on Mean NDCG@10 across a subset of MTEB datasets, showing that high levels of compression can still preserve up to 99% of search quality, particularly with BQ and MRL.


 


Table 1.2: Impact of MRL on Mean NDCG@10 Across MTEB Subset














































































Model Name



Original Dimension



MRL Dimension



Quantization Algorithm



No Rerank (% Δ)



Rerank 2x Oversampling (% Δ)



OpenAI text-embedding-3-small



1536



512



SQ



-2.00% (Δ = 1.155)



-0.0004% (Δ = 0.0002)



OpenAI text-embedding-3-small



1536



512



BQ



-15.00% (Δ = 7.5092)



-0.11% (Δ = 0.0554)



OpenAI text-embedding-3-small



1536



768



SQ



-2.00% (Δ = 0.8128)



-1.60% (Δ = 0.8128)



OpenAI text-embedding-3-small



1536



768



BQ



-10.00% (Δ = 5.0104)



-0.01% (Δ = 0.0044)



OpenAI text-embedding-3-large



3072



1024



SQ



-1.00% (Δ = 0.616)



-0.02% (Δ = 0.0118)



OpenAI text-embedding-3-large



3072



1024



BQ



-7.00% (Δ = 3.9478)



-0.58% (Δ = 0.3184)



OpenAI text-embedding-3-large



3072



1536



SQ



-1.00% (Δ = 0.3184)



-0.08% (Δ = 0.0426)



OpenAI text-embedding-3-large



3072



1536



BQ



-5.00% (Δ = 2.8062)



-0.06% (Δ = 0.0356)



 


Table 1.2 compares the relative point differences of Mean NDCG@10 when using different MRL dimensions (1/3 and 1/2 from the original dimensions) from an uncompressed index across OpenAI text-embedding models.


 


Key Takeaways:



  • 99% Search Quality with BQ + MRL + Oversampling: Combining Binary Quantization (BQ) with Oversampling and Matryoshka Representation Learning (MRL) retains 99% of the original search quality in the datasets and embeddings combinations we tested, even with up to 96x compression, making it ideal for reducing storage while maintaining high retrieval performance.

  • Flexible Embedding Truncation: MRL enables dynamic embedding truncation with minimal accuracy loss, providing a balance between storage efficiency and search quality.

  • No Latency Impact Observed: Our experiments also indicated that using MRL had no noticeable latency impact, supporting efficient performance even at high compression rates.


For more details on how MRL works and how to implement it, visit the MRL Support Documentation.


 


Targeted Vector Filtering


Targeted Vector Filtering allows you to apply filters specifically to the vector component of hybrid search queries. This fine-grained control ensures that your filters enhance the relevance of vector search results without inadvertently affecting keyword-based searches.


 


Sub-Scores


Sub-Scores provide granular scoring information for each recall set contributing to the final search results. In hybrid search scenarios, where multiple factors like vector similarity and text relevance play a role, Sub-Scores offer transparency into how each component influences the overall ranking.


 


Text Split Skill by Tokens


The Text Split Skill by Tokens feature enhances your ability to process and manage large text data by splitting text based on token countsThis gives you more precise control over passage (chunk) length, leading to more targeted indexing and retrieval, particularly for documents with extensive content.


For any questions or to share your feedback, feel free to reach out through our  Azure Search · Community


 


Getting started with Azure AI Search 



 


References:
[1] Kusupati, A., Bhatt, G., Rege, A., Wallingford, M., Sinha, A., Ramanujan, V., Howard-Snyder, W., Chen, K., Kakade, S., Jain, P., & Farhadi, A. (2024). Matryoshka Representation Learning. arXiv preprint arXiv:2205.13147. Retrieved from https://arxiv.org/abs/2205.13147
{2205.13147}