From Data to Delta: Creating Delta tables in Spark

Delta tables in Databricks are a powerful feature that enables efficient and reliable data management and processing. They are an extension of Apache Spark's data storage capabilities designed to handle large-scale, transactional data workloads.

Delta Tables provides several advantages such as,

They offer ACID(Atomicity, Consistency, Isolation, Durability) transactions.
They handle schema evolution seamlessly
They leverage advanced indexing and caching techniques to accelerate query execution, resulting in faster data access and processing.

If we create tables without the location information, the tables are automatically created in the HIVE Meta store, becoming MANAGED tables. However, if we define the table location, it will become an EXTERNAL Table.

This blog post serves as a tutorial on common methods for creating delta tables in Databricks. The process remains the same in Azure Databricks or Azure Synapse.

Syntax:

Using SQL Commands: %SQL:

Create the table if it doesn't exist:

CREATE TABLE IF NOT EXISTS [Table_Name]

( [Column_Name] [DataType],

...

)

USING [DATA_SOURCE]

PARTITIONED BY ([Partition Columns])

LOCATION '[FileStoreLocationDetails]'

Create or replace the table if it exists:

CREATE OR REPLACE TABLE [Table_Name]

(

[Column_Name] [DataType],

...

)

USING [DATA_SOURCE]

PARTITIONED BY ([Partition Columns])

LOCATION '[FileStoreLocationDetails]'

Using the Python Commands

To utilize the Delta table and its properties, we first need to import the Delta library to the session and continue using its properties.

from delta.tables import *

Create table if it doesn't exist:

DeltaTable.createIfNotExists(spark)

.tableName("[Table_Name]")

.addColumn("[Column_Name]", "[Data_Type]")

.addColumn("[Column_Name]", "[Data_Type]", comment="[Comments]")

.execute()

Create the table if it already exists:

DeltaTable.createOrReplace(spark)

.tableName("[Table_Name]")

.addColumn("[Column_Name]", "[Data_Type]")

.addColumn("[Column_Name]", "[Data_Type]", comment="[Comments]")

.location("[Location_Info]")

.execute()

Before delving into examples of delta tables, let's understand how and where they are stored.

The tables created in Databricks fall into two types:

Managed Tables:

These are Spark tables in which both the data and data lifecycle are fully managed by Spark.
Managed tables are accessible across all clusters.
By default, these tables are stored under the root folder (root folder configurations are managed by each project).
Dropping these tables will clear both the data and metadata from Spark.
Managed tables exclusively support Delta format data.
This table type is the default property for creating Unity catalog tables.

2. External Tables:

In these tables, the metadata is managed by Spark, while the data is managed by individuals or project teams.
Data in external tables are accessible to all clusters, and dropping the table will only remove the metadata while leaving the data intact at its location.
External tables support all kinds of data sources that Spark supports.
The user/creator must define where the data should be stored.

Managed Tables:

Create table if not exists using the %SQL Commands: