Overview:
Octopai is a cloud-agnostic, automated data lineage platform that provides cross-system, column-level lineage across both legacy and modern environments. By integrating with Databricks Unity Catalog, Octopai extends the platform’s native capabilities, offering detailed file-level and column-level lineage for a centralized view of data flows across AWS, Azure, and Google Cloud environments.
This guide covers the setup for metadata extraction from Databricks Unity Catalog to build lineage within Octopai. It provides two options: setting up lineage through Unity Catalog or configuring it for specific notebooks.
Supported Cloud Platforms:
Octopai supports a multi-cloud environment with comprehensive lineage tracking across these platforms:
-
AWS:
- Supports services like S3, Redshift, Glue, and RDS.
- Captures data lineage for ingestion, transformation, and storage processes in the AWS ecosystem.
-
Azure:
- Fully integrated with services like Synapse Analytics, Azure Data Factory, and Azure SQL.
- Tracks detailed lineage across hybrid environments, whether on-premise or cloud-based.
-
Google Cloud Platform (GCP):
- Supports BigQuery, Cloud Storage, Dataflow, and other GCP services.
- Captures column-level and file-level lineage for end-to-end tracking of data flows within GCP environments.
Databricks Unity Catalog Structure:
Catalogs:
- Catalog Name: The catalog is a logical grouping of databases or schemas within Unity Catalog, representing a container for managing metadata and data lineage.
Schemas (Databases):
- Schema Name: Logical grouping of tables within a catalog, allowing metadata organization at the schema level.
- Lineage: Tracks how data flows into and out of the schema objects.
Tables and Views:
- Table Name: The name of the specific table or view.
- Table Type: Whether the table is a managed table, external table, or Delta table.
- Data Lineage: Shows how data in the table or view is derived or transformed from other source tables.
Files:
- File Metadata: Information about files (e.g., CSV, Parquet) stored in external sources, capturing schema details and data types.
- File Lineage: Captures how data from files is used and transformed within data pipelines.
Databricks ETL Notebooks with Unity Catalog:
- Notebook Metadata with Lineage: Automatically captures metadata lineage for data read from or written to Unity Catalog-managed tables.
- Data Sources and Targets: Tracks the lineage of data sources and targets processed by ETL notebooks.
Databricks Jobs with Unity Catalog:
- Job Metadata with Lineage: Captures lineage metadata for job tasks interacting with Unity Catalog objects (e.g., tables, files).
- Job Tasks with Unity Catalog Integration: Tracks the input and output data lineage for each job task, showing how Unity Catalog tables are used in the context of job executions.
Octopai’s Integration with Databricks Unity Catalog:
Option 1: Supporting Lineage via Unity Catalog Using Octopai Client Extraction
Access Octopai Knowledge Center here.
-
Ensure the Correct Cluster Type:
- Databricks Unity Catalog requires a Premium license and supports only certain cluster types, including Standard, High Concurrency, and Single-Node clusters. It also requires Databricks Runtime 9.1 LTS or higher for metadata extraction.
-
Configuring Permissions in Databricks:
- API Management:
Use the Databricks API to manage access permissions. Example API calls:
bash
Copy code
curl --location 'https://adb-90442919623923.3.azuredatabricks.net/api/2.0/unity-catalog/metastores' \
--header 'Authorization: Bearer <token>'
To manage schema access:
bash
Copy code
curl --location --request PUT 'https://adb-90442919623923.3.azuredatabricks.net/api/2.0/unity-catalog/metastores/<metastore-id>/systemschemas/access' \
--header 'Authorization: Bearer <token>'
-
Assign Permissions:
- Access Unity Catalog > Manage Permissions in the Databricks UI.
- Add users/groups that need access and assign permissions such as "Can View", "Can Run", "Can Edit", or "Is Owner".
- Ensure the Octopai service principal has access to the necessary Unity Catalog objects.
- Audit Logs:
Enable audit logs to track metadata access. Example query to monitor Octopai’s access:
sql
Copy code
select *
from system.access.audit
where user_agent like '%octopai%'
-
Unity Catalog Supported Features:
-
Supported:
- Multi-language support (Python, SQL, Scala).
- Automatic notebook metadata capture.
-
Not Supported:
- Does not support file-level lineage unless the files are mapped to a volume.
- Does not support tables that are not part of hive_metastore.
- Delta Live Tables (DLT) streaming patterns are not captured.
- Overwrite mode for DataFrame write operations into Unity Catalog is supported only for Delta tables (and not other file formats), and requires the CREATE or MODIFY privilege.
-
Supported:
-
Octopai Advantage:
- Octopai extends lineage visibility to include detailed file-level and column-level lineage, handling gaps in streaming live table lineage and other unsupported patterns in Unity Catalog. Octopai can parse lineage where Unity Catalog doesn’t, particularly with non-standard transformations and files.
Option 2: Building Lineage for Specific Notebooks
-
Identify Notebooks:
- Determine which notebooks should be included in the data lineage process.
-
Configuring Permissions for Notebooks:
- Set up permissions for the relevant notebooks by using the "Sharing Permissions" dialog in Databricks. Assign roles like "Can View", "Can Edit", or "Can Manage".
-
Validate Permissions:
- Ensure permissions are set correctly by reviewing through the Databricks UI or using the Databricks API.
Setting Up Octopai for Databricks Metadata Extraction:
- Connection Name: Provide a meaningful name for the connection.
- Databricks Server URL: Enter the correct Databricks server URL (e.g., https://abc-1234.5.azuredatabricks.net).
- Token: Generate an OAuth2 personal access token (PAT) and input it into Octopai. Refer to the Databricks documentation for detailed instructions.
Testing and Validation:
- Unity Catalog Validation: Run test extractions from Unity Catalog and verify lineage capture in the Octopai UI. Ensure that column-level lineage and file-level lineage are accurately reflected.
- Notebook Lineage Validation: Validate specific notebook lineage by ensuring data read/write operations are correctly captured in Octopai.
- Error Handling: If lineage extraction fails, verify that permissions are correctly configured and the cluster supports Unity Catalog.
Security Considerations:
- Token Management: Rotate OAuth2 tokens periodically to prevent unauthorized access. Ensure tokens are generated with the minimum privileges necessary.
- IAM Roles and Managed Identities: Use IAM roles for AWS and Managed Identities for Azure to securely manage Octopai’s access to cloud services without hardcoded credentials.
Performance and Scaling Considerations:
- Scaling Unity Catalog: For large datasets or environments with high-volume queries, ensure that Unity Catalog clusters are optimized for efficient metadata extraction.
- Scaling Octopai: Octopai is built to handle multi-system, high-volume environments. Ensure that Octopai’s processing resources are scaled to handle large metadata sets without performance degradation.
Key Benefits of Octopai with Databricks Unity Catalog:
By integrating with Databricks Unity Catalog, Octopai enables automated, cross-system lineage capture, filling in gaps where Unity Catalog may lack detailed lineage tracking. This integration provides comprehensive, end-to-end visibility across files, tables, and streaming operations, empowering organizations to manage and govern complex data ecosystems more effectively.
Databricks Architecture Overview
To ensure accurate data lineage tracking with Unity Catalog in Databricks, it is essential to enable audit logs in the workspace. This process involves querying the system tables to monitor activity associated with data access and operations.
Steps to Enable Audit Logs in Databricks:
-
Obtain the Metastore ID:
- This ID is required to manage and access metadata within Unity Catalog.
-
List All Available Schemas:
- Retrieve a list of schemas from the metastore to understand which schemas contain data that will be tracked.
-
Enable Schema Access:
- Ensure that the appropriate permissions are set for schema access to allow metadata extraction.
- Create a SQL Warehouse and Run the Query:
Execute the following query to retrieve audit logs, filtering for activities initiated by Octopai:
sql
Copy code
SELECT *
FROM system.access.audit
WHERE user_agent LIKE '%octopai%'
Managing Permissions in Databricks Using the API
To enable access to the Unity Catalog, administrators can manage permissions and settings via API calls. Below are the essential steps:
- Set Up the Metastore:
Make a request to the Databricks API to retrieve the metastore details:
bash
Copy code
curl --location 'https://adb-xxx.azuredatabricks.net/api/2.0/unity-catalog/metastores' \
--header 'Authorization: Bearer <token>'
- Enable Schema Access:
Once the metastore is configured, enable schema access with the following API request:
bash
Copy code
curl --location --request PUT 'https://adb-xxx.azuredatabricks.net/api/2.0/unity-catalog/metastores/<metastore-id>/systemschemas/access' \
--header 'Authorization: Bearer <token>'
-
Generate a Service Principal and Secret:
- A service principal must be created (or an existing one used) for access control. After creating or selecting the service principal, generate a secret for authentication.
-
Grant Permissions to the Service Principal:
- Ensure that the service principal is assigned the necessary permissions for each catalog, ensuring seamless integration with Octopai.
OAuth Token Generation for API Access
To authenticate API calls for Unity Catalog, an OAuth token must be generated using client credentials. Here’s how:
bash
Copy code
curl --location 'https://adb-xxx.azuredatabricks.net/oidc/v1/token' \
--header 'Content-Type: application/x-www-form-urlencoded' \
--data-urlencode 'grant_type=client_credentials' \
--data-urlencode 'client_id=<client-id>' \
--data-urlencode 'client_secret=<client-secret>' \
--data-urlencode 'scope=all-apis'
Unity Catalog Lineage Features
Supported Features:
- Automatic Notebook Name Retrieval: No scripts are required to capture notebook names.
- Supports Multiple Modes: Works with both shared and single modes.
- Multi-Language Support: Unity Catalog can track lineage across different programming languages such as SQL, Python, and Scala.
Unsupported Features:
- File-Level Lineage: Unity Catalog does not support file-level lineage unless the files are mapped to a volume.
- Non-Hive Metastore Tables: Tables that are not part of the Hive metastore are not supported.
- License Requirement: Unity Catalog is only available for Databricks users on the premium tier.
- DataFrame Write Limitations: Overwrite mode for writing DataFrames to Unity Catalog is only supported for Delta tables. Users must have the CREATE privilege on the parent schema and be the owner or have MODIFY privilege on the existing object.
Delta Live Tables (DLT) Limitations
Unsupported Patterns:
- Lineage for Streaming Live Tables: The lineage between a streaming live table and the files it processes is not captured automatically. For example, the creation of a temporary table may not reflect lineage between the input files and the resulting table.
- Lineage Example with Delta Live Table:
A Delta Live Table can be created to read data from a streaming source, but the lineage of the streaming file will not automatically appear:
python
Copy code
@dlt.table(
name="raw_data_table_name",
comment="Source data from SQL Server"
)
def load_sql_server_data():
table_df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.load("/path/to/source")
Enhancements by Octopai
Octopai enhances lineage tracking for patterns that Unity Catalog does not natively support, particularly for streaming live tables and file-level operations. Octopai’s engine is capable of parsing lineage data from complex patterns, offering more comprehensive data lineage coverage than what Unity Catalog provides alone.
For instance, in cases like refresh streaming data processing, Octopai can capture metadata and track lineage more thoroughly:
sql
Copy code
CREATE OR REFRESH STREAMING LIVE TABLE BrandName_ach_reject_bronze
LOCATION 'abfss://datalake/path'
AS SELECT *, _metadata.file_modification_time AS receipt_time FROM cloud_files;
Comments
0 comments
Please sign in to leave a comment.