Processing Databricks files¶

This document describes how the Snowpark Migration Accelerator (SMA) processes Databricks files based on their file extensions during the inventory and migration phases.

File processing by extension¶

The SMA recognizes and processes various Databricks file formats. Each file type is handled according to its structure and origin.

SQL files¶

Extension	Format	Description
.sql	JSON cells	Inventoried by the SMA. Typically extracted from a `.dbc` file.
.sql	First-line-comment	Databricks notebook exported to SQL format. Inventoried by the SMA.

Example: SQL with JSON cells format¶

{
  "version": "NotebookV1",
  "commands": [
    {
      "command": "CREATE TABLE customers (\n  id INT,\n  name STRING\n)",
      "commandType": "sql"
    },
    {
      "command": "SELECT * FROM customers",
      "commandType": "sql"
    }
  ]
}

Copy

Example: SQL with first-line-comment format¶

-- Databricks notebook source
CREATE TABLE customers (
  id INT,
  name STRING
)

-- COMMAND ----------

SELECT * FROM customers

Copy

Python files¶

Extension	Format	Description
.python	JSON cells	Inventoried by the SMA. Typically extracted from a `.dbc` file.
.py	First-line-comment	Databricks notebook exported to Python format. Inventoried by the SMA.

Example: Python with JSON cells format¶

{
  "version": "NotebookV1",
  "commands": [
    {
      "command": "df = spark.read.table(\"customers\")",
      "commandType": "python"
    },
    {
      "command": "df.filter(df.status == \"active\").show()",
      "commandType": "python"
    }
  ]
}

Copy

Example: Python with first-line-comment format (.py)¶

# Databricks notebook source
df = spark.read.table("customers")

# COMMAND ----------

df.filter(df.status == "active").show()

Copy

Scala files¶

Extension	Format	Description
.scala	JSON cells	Inventoried by the SMA. Typically extracted from a `.dbc` file.
.scala	First-line-comment	Databricks notebook exported to Scala format. Inventoried by the SMA.

Example: Scala with JSON cells format¶

{
  "version": "NotebookV1",
  "commands": [
    {
      "command": "val df = spark.read.table(\"customers\")",
      "commandType": "scala"
    },
    {
      "command": "df.filter($\"status\" === \"active\").show()",
      "commandType": "scala"
    }
  ]
}

Copy

Example: Scala with first-line-comment format¶

// Databricks notebook source
val df = spark.read.table("customers")

// COMMAND ----------

df.filter($"status" === "active").show()

Copy

Databricks archive files¶

Extension	Description
.dbc	Databricks compressed archive file. The SMA extracts and analyzes its contents.

Example: DBC file structure¶

A .dbc file is a ZIP archive containing notebook files. When extracted, the structure looks like the following:

my_project.dbc (extracted)
|-- notebook1.python
|-- notebook2.sql
|-- folder/
|   |-- notebook3.python
|   |-- notebook4.scala
|-- utils/
    |-- helpers.python

Copy

How it works¶

DBC Files: When the SMA encounters a .dbc file, it automatically extracts the compressed contents and processes each file individually based on its extension.
JSON Cells Format: Files with JSON cell structure are native Databricks notebook formats, typically found inside .dbc archives. These contain cell definitions with metadata, source code, and outputs.
First-Line-Comment Format: Files exported from Databricks using the export functionality contain a special comment in the first line that identifies them as Databricks notebooks. The SMA recognizes this pattern and processes them accordingly.

Inventory process¶

During the inventory phase, the SMA:

Scans all provided files and directories.
Identifies file types based on extension and internal structure.
Catalogs each notebook with its language, cell count, and dependencies.
Prepares the files for the translation phase.