When creating any table, one of the crucial aspects is maintaining data integrity. Each column should adhere to specific rules and constraints to ensure accurate and reliable data. For instance, a company might provide employee IDs in number format and require that basic salary be at least 15K. In such cases, it becomes essential to enforce constraints on these columns, allowing only data that complies with the defined rules.
To accomplish this Delta lake offers four different constraints
- NOT NULL
- CHECK
- PRIMARY KEY
- FOREIGN KEY
So, before delving into the details, it's essential to become familiar with the Metastore in Databricks/Azure Databricks.
In the Lake House architecture, the MetaStore plays a pivotal role by serving as a repository for metadata. As the name suggests, it contains information about all the objects defined within the lake house environment. Further, the Metastore categorized into two types
Built-in Hive Metastore:
- Each Databricks or Azure Databricks resource contains one built-in Hive Metastore.
- An instance of this metastore is attached to each cluster created in Databricks.
- This legacy Hive Metastore provides less centralized access to the tables, and by default, all users connected to the cluster have access to the underlying objects.
- Unless we have user access control set up, it is very difficult to manage user access.
- Due to this, the legacy Hive Metastore supports only NOT NULL and CHECK constraints.
Unity Catalog:
- This provides centralized user access control and audit features.
- In this, the data is secured by default, which means users will not have access initially; either the catalog admin or owner of the object has to provide the user access.
- Unlike the legacy metastore, this supports all constraints on the tables.
CHECK Constraint: The CHECK constraint proves valuable when we need to allow only data within a specific range. For instance, ensuring that basic salary remains at a minimum of 15K or that employee data stored in a table includes individuals born after January 1, 2000.
In Delta tables, any attempt to insert new records that violate the CHECK constraint rules results in a DeltaInvariantViolationException error. Unfortunately, the Delta table create statement does not directly support adding the CHECK constraint. Instead, we must utilize the ALTER table statement to add the CHECK constraint after the table is created.
Let us see this in a practical way in both SQL format and Python DeltaTable format.
<<please click on each image for enlarged view>>
Please leave a comment if you have any doubts or issues while working with CHECK Constraint.
Thanks for reading 🙏
No comments:
Post a Comment