This article is contributed. See the original author and article here.
The purpose of this about is to discuss Managed and External tables while querying from SQL On-demand or Serverless.
Thanks to my colleague Dibakar Dharchoudhury for the really nice discussion related to this subject.
- Managed tables
Spark provides many options for how to store data in managed tables, such as TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE, DELTA, and LIBSVM. These files are normally stored in the warehouse directory where managed table data is stored.
- External tables
Spark also provides ways to create external tables over existing data, either by providing the LOCATION option or using the Hive format. Such external tables can be over a variety of data formats, including Parquet.
Azure Synapse currently only shares managed and external Spark tables that store their data in Parquet format with the SQL engines
Note “The Spark created, managed, and external tables are also made available as external tables with the same name in the corresponding synchronized database in serverless SQL pool.”
Following an example of an External Table created on Spark-based in a parquet file:
blob_account_name = "StorageAccount" blob_container_name = "ContainerName" from pyspark.sql import SparkSession sc = SparkSession.builder.getOrCreate() token_library = sc._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary blob_sas_token = token_library.getConnectionString("LInkedServerName") spark.conf.set( 'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name), blob_sas_token)
Note my linked Server Configuration:
2) External table:
Spark.sql('CREATE DATABASE IF NOT EXISTS SeverlessDB') #THE BELOW EXTERNAL SPARK TABLE filepath ='wasbs://Container@StorageAccount.blob.core.windows.net/parquets/file.snappy.parquet' df = spark.read.load(filepath, format='parquet') df.write.mode('overwrite').saveAsTable('SeverlessDB.Externaltable')
Here you can query from SQL Serverless
If you check the path where your external table was created you will be able to see under the Data lake as follows. For example, my workspace name is synapseworkspace12:
3) I can also create a managed table as parquet using the same dataset that I used for the external one as follows:
#Managed - table df.write.format("Parquet").saveAsTable("SeverlessDB.ManagedTable")
This one will also be persisted on the storage account under the same path but on the managed table folder.
Following the documentation. This is another way to achieve the same result for managed table, however in this case the table will be empty:
CREATE TABLE SeverlessDB.myparquettable(id int, name string, birthdate date) USING Parquet
Those are the commands supported to create managed and external tables on Spark per doc. that would be possible to query on SQL Serverless.
If you want to clean up this lab – Spark SQL:
-- Drop the database and it's tables DROP DATABASE SeverlessDB CASCADE
That is it!
Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.