Using Snowpark Submit

Snowpark Submit lets you run batch-oriented Spark workloads directly on Snowflake’s infrastructure. You package your application as a Python script or a Scala/Java JAR, then use the Snowpark Submit CLI to submit it. The job runs in cluster mode on Snowpark Container Services with no external Spark cluster required.

Prerequisites

On the machine that submits the job, you need:

  • Python 3.10 or later (earlier than 3.13). The Snowpark Submit CLI is written in Python, but your application itself can be Python, Java, or Scala.

  • Snowflake CLI (snow). Used to manage connections and upload files to stages. For installation instructions, see Installing Snowflake CLI.

  • A Snowflake warehouse. The virtual warehouse on which your Spark job executes. The role you use for Snowpark Submit needs USAGE privilege on that warehouse.

  • A Snowflake compute pool. An SPCS compute pool that hosts the Spark driver. If you don’t have one, create it in Step 3.

For Scala or Java jobs, you also need:

  • A JDK (Java 11 or later) to compile your application.

  • sbt (Scala) or Maven (Java) to build the JAR.

Required privileges

The role you use with Snowpark Submit must have the following privileges:

Privilege

Object

Notes

USAGE

Database

The parent database of the schema where the job runs.

USAGE

Schema

The parent schema where the job service is created.

CREATE SERVICE

Schema

The schema where the job service is created.

USAGE

Compute pool

The pool specified with --compute-pool or in connections.toml.

USAGE

Warehouse

The virtual warehouse on which the Spark job executes.

READ

Stage

The stage where workload files are stored (if using --snowflake-stage).

READ

Image repository

The repository containing images referenced by the job spec.

Step 1: Prepare your application

Create a Python file with your Spark application. Use SparkSession.builder to obtain a session; Snowpark Submit handles the connection to the Snowflake-managed Spark server automatically.

# app.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, upper

spark = SparkSession.builder.appName("MyApp").getOrCreate()

data = [
    (1, "alice", "engineering", 95000),
    (2, "bob", "marketing", 72000),
    (3, "carol", "engineering", 105000),
]
df = spark.createDataFrame(data, ["id", "name", "department", "salary"])

df.withColumn("name", upper(col("name"))).show()

spark.stop()

Step 2: Install the Snowpark Submit CLI

Snowpark Submit is a Python package that acts as the submission CLI. Your application still runs as Python, Java, or Scala on Snowflake.

pip install snowpark-submit

Verify the installation:

snowpark-submit -h

Optionally, install it inside a Python virtual environment to keep it isolated from your system Python:

python3 -m venv ~/.venvs/snowpark-submit
source ~/.venvs/snowpark-submit/bin/activate
pip install --upgrade pip
pip install snowpark-submit

Step 3: Set up a compute pool

If you already have a compute pool, skip to Step 4.

Create a compute pool in Snowsight or with the Snowflake CLI using a role that has the CREATE COMPUTE POOL privilege (for example, ACCOUNTADMIN). For testing, you can start with the smallest CPU instance family and scale up later. For more information, see Snowpark Container Services: Working with compute pools.

CREATE COMPUTE POOL MY_SPARK_POOL
  MIN_NODES = 1
  MAX_NODES = 1
  INSTANCE_FAMILY = CPU_X64_XS
  AUTO_SUSPEND_SECS = 300
  AUTO_RESUME = TRUE;

Note

  • AUTO_SUSPEND_SECS and AUTO_RESUME mean the pool suspends itself when idle and resumes on the next submit, so you don’t pay for idle nodes between runs. Expect around 30-60 seconds of cold-start latency on a resumed pool.

  • If your job needs more memory, step up to CPU_X64_S (3 vCPU, 13 GiB) or CPU_X64_M (6 vCPU, 28 GiB). Use SHOW COMPUTE POOL INSTANCE FAMILIES; to list all available sizes.

  • The role that runs Snowpark Submit needs USAGE on the compute pool. If it’s a different role from the one that created the pool, grant it:

    GRANT USAGE ON COMPUTE POOL MY_SPARK_POOL TO ROLE <your_role>;
    

Step 4: Configure your Snowflake connection

Snowpark Submit reads the connections.toml file used by the Snowflake Python connector and Snowflake CLI. The Snowflake CLI provides an interactive onboarding experience for adding connections and a dedicated command for testing them. For details, see Configuring Snowflake CLI.

You can create or update a connection with the Snowflake CLI:

snow connection add

Or edit connections.toml directly. An example entry using OAuth:

[snowpark-submit]
account = "<account>"
user = "<user>"
authenticator = "OAUTH_AUTHORIZATION_CODE"
client_store_temporary_credential = true
warehouse = "<warehouse>"
database = "<database>"
schema = "<schema>"
compute_pool = "<compute_pool>"

Password authentication is also supported. Replace the authenticator and client_store_temporary_credential lines with password = "<password>".

The compute pool can either be set here as compute_pool or passed at submit time with --compute-pool (see Step 5).

Verify that the connection works:

snow connection test --connection snowpark-submit

Step 5: Submit the job

See the Snowpark Submit reference for the full list of CLI flags.

snowpark-submit \
  --snowflake-workload-name MY_PYTHON_JOB \
  --snowflake-connection-name snowpark-submit \
  --compute-pool MY_SPARK_POOL \
  --wait-for-completion \
  path/to/app.py

Common flags:

  • --class – required for Java/Scala if the main class isn’t set in the JAR manifest.

  • --compute-pool – the SPCS compute pool that hosts the Spark driver. Can also be set in connections.toml.

  • --wait-for-completion – blocks until the workload finishes. Without it, Snowpark Submit starts the job and returns immediately.

  • --scala-version 2.13 – required only if the JAR is built against Scala 2.13.

  • --jars – comma-separated list of additional JAR dependencies. These must also be registered with spark.addArtifact() in your application code.

Step 6: Check status and logs

The submit command from Step 5 prints two things you need for status lookups:

  • Snowflake Workload Name – the full workload name with the timestamp suffix the CLI appends (for example, MY_SCALA_JOB_260417_200531). Status lookups only work with this suffixed name.

  • Job History URL – a Snowsight link for the workload, which is the fastest way to see status and logs in the UI.

To check status and logs from the CLI, pass the suffixed name:

snowpark-submit \
  --snowflake-connection-name snowpark-submit \
  --snowflake-workload-name MY_SCALA_JOB_260417_200531 \
  --workload-status --display-logs

--workload-status returns the current state (DEPLOYING, RUNNING, SUCCEEDED, or FAILED), start time, duration, and service details. --display-logs prints the application logs.

Note

There is a small latency, from a few seconds to a minute, before logs are available for fetching. When an event table isn’t used to store log data, logs are retained for a short period of time, such as five minutes or less.

For production monitoring and detailed log configuration, see Monitoring Snowpark Connect for Spark workloads.