This article is contributed. See the original author and article here.

HDInsight 4.0 brings upgraded versions for all Apache components , but for this lab we specifically focus on the Hive versions.

 

Component HDInsight 4.0 HDInsight 3.6
Hive 3.1.0 2.1.0 ; 1.2.1

Some key areas that Hive 3.x differs from earlier versions are.

Client

  • Hive CLI support is deprecated in Hive 3.x and only the thin client Beeline is supported

Transaction Processing

  • ACID tables are the default table type in Hive 3.x with performance as good as non Acid
  • Automatic query cache

Catalogs

  • Spark and Hive use independent catalogues for Spark SQL or Hive tables in same or independent clusters
  • The Hive Warehouse connector is used to integrate between Spark and Hive workloads.

For an exhaustive overview of advancements in Hive 3.0 , listen to this presentation.

 

Lab

This lab is a simulation of a real migration and will consist of the following steps in sequence.

  1. Create an HDInsight Hadoop 3.6 cluster.
  2. Create test data in the HDInsight 3.6 cluster. We use TPCDS data here.
  3. Run TPCDS tests to simulate Hive workloads.
  4. Copy the Hive 2.1 Metastore and upgrade the Metastore copy to 3.1.
  5. Delete the older HDInsight Hadoop 3.6 cluster.
  6. Create a new HDInsight 4.0 cluster with the older storage account and new upgraded 3.1 Hive Metastore.

Run the same TPCDS tests to ensure everything is working as intended.

Create a Storage account

1. Create storage account and click create

 

2. Populate the Basics tab of Storage account with the below information.

  • Resource group: Create or use an existing resource group
  • Storage account name: Enter a unique name for your storage account( I used agstorage36 to represent an older cluster )
  • Location: Enter an Azure region( HDInsight cluster needs to be created in same the Azure region)
  • Performance: Standard
  • Account Kind: StorageV2(general purpose v2)
  • Replication: Locally-redundant storage(LRS)
  • Access Tier: Hot

p1.png

 

3. Populate the Advanced tab of Storage account with the below information.

  • Security(Secure transfer required): Enabled
  • Virtual Network(Allow access from): All networks
  • Data Protection(Blob soft delete:( Disabled
  • Data Lake Storage Gen2(Hierarchical Namespace): Disabled

 

 

< Insert Image > 

4. Make no changes to the Tags Tab and post validation click Create on the Review + Create tab to create an Azure Blob storage account.
p2.png

 

5. In the next section we would create the external metastore for the HDInsight Hadoop 3.6 cluster.

Create an External Metastore for the HDInsight Hadoop 3.6 cluster

1. On the Azure Portal Navigate to the SQL Database blade and click create

 

p3.png

 

2. Populate the Basics tab of SQL Server with the below information.

  • Resource group: Use the same resource group used for storage

  • Database Details :

    • Database name: Choose any allowed name for the metastore ( I used aghive36db to represent a Hive 2.1 metastore on an HDInsight 3.6 cluster type

    • Server :- Click Create new

      • Server admin login: Create a username
      • Password/Confirm Password : Enter Metastore server password – Location : Choose same location as Storage Account
    • Want to use SQL Elastic Pool : N

 

p4.pngp5.pngp6.png

 

3. Compute+Storage : Click on the Configure Database link to Navigate to the database configuration page

 

p7.png

4. In the Database Configuration page select the Standard Tier with the below settings and click Apply

  • DTU: 200
  • Data Max Size : A few GBs

p8.png

5. In the Networking tab ensure the following settings are met

  • Connectivity Method : Public endpoint
  • Firewall Rules:
    • Allow Azure Services and Resources to Access this Server : Yes
    • Add current IP address: Yes

 

p9.png

6.Leave the Additional settings and Tags to their default state

7. In the Review+Create tab click Create

p10.png

8. In this section we created an External Hive Metastore(aghive36db) which we will use subsequently in an HDInsight 3.6 cluster.

p11.png

Provision HDInsight Hadoop 3.6(Hive 2.1.0) cluster from Azure Management Portal

 

To provision HDInsight LLAP with Azure Management Portal, perform the below steps.

  1. Go to the Azure Portal portal.azure.com. Login using your azure account credentials.

  2. Select Create a resource -> Azure HDInsight -> Create

  3. Click the Custom(size ,settings, apps) slider

p12.png

 

2. In the Basics tab populate the following values.

  • Resource Group:Put the cluster in the same resource group as the Storage account and Metastore
  • Cluster Name: Enter the cluster name. A green tick will appear if the cluster name is available.
  • Region: Choose the same region as the storage account
  • Cluster Type : Cluster Type – Hadoop
  • Version: Hadoop 2.7.3 (HDI 3.6)

p13.png

  • Cluster login username:Enter username for cluster administrator(default:admin)
  • Cluster login password:Enter password for cluster login(admin)
  • Confirm Cluster login password:Repeat the same password as used earlier
  • Secure Shell(SSH) username:_Enter the SSH username for the cluster(default:sshuser)
  • Use cluster login for SSH: Check this box(this makes your SSH password same as the cluster admin password)

p14.png

5. In the Storage tab populate the following values.

  • Primary Storage Type: Azure Storage
  • Selection method: Select from list
  • Primary Storage account: Select the Azure storage blob account created earlier.
  • Container: Enter a name for the storage container

p15.png
6. Metastore Settings: Enter the name for the SQL Database/SQL Server combination that was created in the last step

p17.png

p18.png

7. Click Authenticate to enter the username and password for the Metastore. Enter the username and password that was set for the SQL Server in the last exercise.

p19.png

8. In the Configuration+Pricing tab select the node sizes for the cluster. There are no hard and fast rules and the recommendation is to select larger nodes for faster data processing. Note that choosing nodes that are too small may result in failures.

p20.png

 

9. In the Review+Create tab , review the cluster specifics and click Create.

 

p21.png

 

10. In this step we created an HDInsight 3.6 Hadoop cluster with preconfigured external metastore and storage account.

 

p22.png

 

11. In the next step we will create some test data in the cluster to represent a production workload.

 

 

Create TPCDS test data on the HDInsight 3.6 Hadoop cluster with Beeline CLI

Goal of this step is to help generate TPCDS data with hive as a close representation of production data.

  1. Clone this repo
git clone https://github.com/hdinsight/tpcds-hdinsight && cd tpcds-hdinsight 
  1. Upload the resources to DFS
hdfs dfs -copyFromLocal resources /tmp
  1. Run TPCDSDataGen.hql with settings.hql file and set the required config variables.( 1 GB of data)
beeline -u "jdbc:hive2://`hostname -f`:10001/;transportMode=http" -n "" -p "" -i settings.hql -f TPCDSDataGen.hql -hiveconf SCALE=1 -hiveconf PARTS=1 -hiveconf LOCATION=/HiveTPCDS/ -hiveconf TPCHBIN=`grep -A 1 "fs.defaultFS" /etc/hadoop/conf/core-site.xml | grep -o "wasb[^<]*"`/tmp/resources

SCALE is a scale factor for TPCDS. Scale factor 10 roughly generates 10 GB data, Scale factor 1000 generates 1 TB of data and so on.

PARTS is a number of task to use for datagen (parrellelization). This should be set to the same value as SCALE.

LOCATION is the directory where the data will be stored on HDFS.

TPCHBIN is where the resources are found. You can specify specific settings in settings.hql file.

  1. Create tables on the generated data.
beeline -u "jdbc:hive2://`hostname -f`:10001/;transportMode=http" -n "" -p "" -i settings.hql -f ddl/createAllExternalTables.hql -hiveconf LOCATION=/HiveTPCDS/ -hiveconf DBNAME=tpcds
  1. Generate ORC tables and analyze
beeline -u "jdbc:hive2://`hostname -f`:10001/;transportMode=http" -n "" -p "" -i settings.hql -f ddl/createAllORCTables.hql -hiveconf ORCDBNAME=tpcds_orc -hiveconf SOURCE=tpcds
beeline -u "jdbc:hive2://`hostname -f`:10001/;transportMode=http" -n "" -p "" -i settings.hql -f ddl/analyze.hql -hiveconf ORCDBNAME=tpcds_orc
  1. Run a few queries to represent a production workload. Change the query number in the end to test various queries.
beeline -u "jdbc:hive2://`hostname -f`:10001/tpcds_orc;transportMode=http" -n "" -p "" -i settings.hql -f queries/query12.sql
  1. In this section we have created test data on the cluster and then tested a few queries representative of production datasets.

  2. In the next section we would upgrade the Hive Metastore from 2.1.1. to 3.1.

Upgrade the Hive Metastore from Hive 2.1.1 to Hive 3.1

Create a copy of the Hive Metastore in HDInsight 3.6

  1. From the HDInsight 3.6 cluster click on the External Metastores.

 

p24.png

2. Click on the Metastore to open the SQL DB portal.

 

p25.png

3. On the SQL DB portal click on Restore.

 

p26.png

 

  1. In the restore blade choose the below values to get the most recent copy of the Hive Metastore
  • Select Source : Point-in-time
  • Database name: Assign a new db name*( I chose the name aghive40db)
  • Restore Point:
    • Date: Current date
    • Time: Current time
  • Target Server: Choose SQL server create earlier
  • Elastic Pool: None
  • Pricing Tier: Same tier as older metastore db

Click OK to continue with creation of a copy

 

 

p27.png

 

5. After creation, the new SQL db(aghive40db) appears as an additional database in the same SQL Server.

 

 

p28.pngp29.pngp30.png

 

Upgrade the Hive 2.1.1 Metastore

  1. On the HDInsight cluster click on Script Actions.

 

p31.png

 

2. On the Script action page populated the parameters as described below and click Create

p32.png

 

3. The script action starts on the cluster

 

p33.png

 

4. The script comes back with a green check mark which indicates successful completion.

 

p34.png

 

Validate the Hive Metastore version

After the script finishes , we would need to validate if the Hive Metastore is indeed upgraded

  1. Launch the new SQL db portal(aghive40db) and click on Query Editor.You could alternatively use SSMS or Azure Data Studio.

 

 

 

p35.png

 

  1. Enter the below query in the query editor and click Run. Select * from [dbo].version

  2. Validate to see the if the schema version was upgraded to 3.1.0 . This would indicate that the Hive metastore was succesfully upgraded.

p36.png

 

4. Post up-gradation , delete the older HDInsight 3.6 cluster.

 

 

p37.pngp38.pngp39.png

  1. In this section we upgraded the new Hive Metastore aghive40db from version 2.1.2 to 3.1.0 post which we deleted the older HDInsight cluster.

  2. In the next section , we would create a new HDInsight 4.0( Hive 3.1) cluster with the new Hive Metastore and the older storage account.

 

Provision HDInsight Hadoop 4.0(Hive 3.1.0) cluster from Azure Management Portal

To provision HDInsight LLAP with Azure Management Portal, perform the below steps.

  1. Go to the Azure Portal portal.azure.com. Login using your azure account credentials.

  2. Select Create a resource -> Azure HDInsight -> Create

  3. Click the Custom(size ,settings, apps) slider

  4. In the Basics tab populate the following values.

  • Resource Group:Put the cluster in the same resource group as the Storage account and Metastore
  • Cluster Name: Enter the cluster name. A green tick will appear if the cluster name is available.
  • Region: Choose the same region as the storage account
  • Cluster Type : Cluster Type – Hadoop
  • Version: Hadoop 3.1.0(HDI 4.0)

 

p40.png

 

  • Cluster login username:Enter username for cluster administrator(default:admin)
  • Cluster login password:Enter password for cluster login(admin)
  • Confirm Cluster login password:Repeat the same password as used earlier
  • Secure Shell(SSH) username:_Enter the SSH username for the cluster(default:sshuser)
  • Use cluster login for SSH: Check this box(this makes your SSH password same as the cluster admin password)

 

 

 

p41.png

 

  1. In the Storage tab populate the following values.
  • Primary Storage Type: Azure Storage
  • Selection method: Select from list
  • Primary Storage account: Select the Azure storage blob account created earlier.
  • Container: Enter the same storage container that was used for the previous HDInsight 3.6 cluster

 

p42.png

  • Metastore Settings: Enter the name for the new SQL Database/SQL Server(agclusterdbserver/aghive40db) combination that was upgraded in the last step

 

p43.png

6. Click Authenticate to enter the username and password for the Metastore. Enter the username and password that was set for the SQL Server.

 

 

p44.png

 

7. In the Configuration+Pricing tab select the node sizes for the cluster. There are no hard and fast rules and the recommendation is to select larger nodes for faster data processing. Note that choosing nodes that are too small may result in failures.

 

p45.png

 

8. In the Review+Create tab , review the cluster specifics and click Create.

 

p46.png

 

  1. In this step we created an HDInsight 4.0 Hadoop cluster with preconfigured upgraded external metastore and mapped its storage to a preexisting storage container.

  2. In the next step we will create some test data in the cluster to represent a production workload.

 

Run TPCDS test data on the HDInsight 4.0 Hadoop cluster with Beeline CLI

Goal of this step is to run TPCDS tests with an upgraded Hive Metastore to represent regression tests of Hive workloads.

  1. The TPCDS repo should already be cloned and TPCDS data should already exist from what we created earlier.

  2. SSH into the cluster using the ssh credentials provided durinng cluster creation.

  3. Run a few TPCDS queries to represent a production regression test . Change the query number in the end to test various queries.

beeline -u "jdbc:hive2://`hostname -f`:10001/tpcds_orc;transportMode=http" -n "" -p "" -i settings.hql -f queries/query12.sql
  1. In this lab we migrated multiple HDInsight 3.6 Hive workloads to HDInsight 4.0.

 

Further reference:

Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.