Using Snowpark Submit¶
Snowpark Submit lets you run batch-oriented Spark workloads directly on Snowflake’s infrastructure. You package your application as a Python script or a Scala/Java JAR, then use the Snowpark Submit CLI to submit it. The job runs in cluster mode on Snowpark Container Services with no external Spark cluster required.
Prerequisites¶
On the machine that submits the job, you need:
Python 3.10 or later (earlier than 3.13). The Snowpark Submit CLI is written in Python, but your application itself can be Python, Java, or Scala.
Snowflake CLI (snow). Used to manage connections and upload files to stages. For installation instructions, see Installing Snowflake CLI.
A Snowflake warehouse. The virtual warehouse on which your Spark job executes. The role you use for Snowpark Submit needs USAGE privilege on that warehouse.
A Snowflake compute pool. An SPCS compute pool that hosts the Spark driver. If you don’t have one, create it in Step 3.
For Scala or Java jobs, you also need:
A JDK (Java 11 or later) to compile your application.
sbt (Scala) or Maven (Java) to build the JAR.
Required privileges¶
The role you use with Snowpark Submit must have the following privileges:
Privilege |
Object |
Notes |
|---|---|---|
USAGE |
Database |
The parent database of the schema where the job runs. |
USAGE |
Schema |
The parent schema where the job service is created. |
CREATE SERVICE |
Schema |
The schema where the job service is created. |
USAGE |
Compute pool |
The pool specified with |
USAGE |
Warehouse |
The virtual warehouse on which the Spark job executes. |
READ |
Stage |
The stage where workload files are stored (if using |
READ |
Image repository |
The repository containing images referenced by the job spec. |
Step 1: Prepare your application¶
Create a Python file with your Spark application. Use SparkSession.builder to obtain a session; Snowpark Submit handles the
connection to the Snowflake-managed Spark server automatically.
Note
The Java client for Snowpark Connect for Spark is a preview feature.
Add the Snowpark Connect for Spark Java client and Spark Connect client dependencies to your pom.xml:
Write your application entry point using SnowparkConnectSession:
Build a fat JAR (uber JAR) that bundles all dependencies into a single file. Snowpark Submit uploads
this JAR to Snowflake, so all classes must be included. Add the Maven Shade Plugin to your
pom.xml:
Then build:
For details on the Java client API, see Snowpark Connect for Spark Java/Scala client reference.
Note
The Scala client for Snowpark Connect for Spark is a preview feature.
Scala 2.12 is the default. Add the Snowpark Connect for Spark Java client and Spark Connect client dependencies
to your build.sbt:
Write your application entry point using SnowparkConnectSession:
Build a fat JAR (uber JAR) that bundles all dependencies into a single file. Snowpark Submit uploads
this JAR to Snowflake, so all classes must be included. Add the sbt-assembly plugin to
project/plugins.sbt:
Then build:
For details on the Java client API, see Snowpark Connect for Spark Java/Scala client reference.
If you must use Scala 2.13, two changes are required:
Pass
--scala-version 2.13when submitting (Step 5).Set
snowpark.connect.scala.versionin your application code:
Step 2: Install the Snowpark Submit CLI¶
Snowpark Submit is a Python package that acts as the submission CLI. Your application still runs as Python, Java, or Scala on Snowflake.
Verify the installation:
Optionally, install it inside a Python virtual environment to keep it isolated from your system Python:
Step 3: Set up a compute pool¶
If you already have a compute pool, skip to Step 4.
Create a compute pool in Snowsight or with the Snowflake CLI using a role that has the CREATE COMPUTE POOL privilege (for example, ACCOUNTADMIN). For testing, you can start with the smallest CPU instance family and scale up later. For more information, see Snowpark Container Services: Working with compute pools.
Note
AUTO_SUSPEND_SECS and AUTO_RESUME mean the pool suspends itself when idle and resumes on the next submit, so you don’t pay for idle nodes between runs. Expect around 30-60 seconds of cold-start latency on a resumed pool.
If your job needs more memory, step up to
CPU_X64_S(3 vCPU, 13 GiB) orCPU_X64_M(6 vCPU, 28 GiB). UseSHOW COMPUTE POOL INSTANCE FAMILIES;to list all available sizes.The role that runs Snowpark Submit needs USAGE on the compute pool. If it’s a different role from the one that created the pool, grant it:
Step 4: Configure your Snowflake connection¶
Snowpark Submit reads the connections.toml file used by the Snowflake Python connector and Snowflake CLI. The Snowflake CLI provides an interactive onboarding experience for adding connections and a dedicated command for testing them. For details, see Configuring Snowflake CLI.
You can create or update a connection with the Snowflake CLI:
Or edit connections.toml directly. An example entry using OAuth:
Password authentication is also supported. Replace the authenticator and client_store_temporary_credential lines with
password = "<password>".
The compute pool can either be set here as compute_pool or passed at submit time with --compute-pool (see Step 5).
Verify that the connection works:
Step 5: Submit the job¶
See the Snowpark Submit reference for the full list of CLI flags.
For Scala 2.13 JARs, add --scala-version 2.13.
Common flags:
--class– required for Java/Scala if the main class isn’t set in the JAR manifest.--compute-pool– the SPCS compute pool that hosts the Spark driver. Can also be set inconnections.toml.--wait-for-completion– blocks until the workload finishes. Without it, Snowpark Submit starts the job and returns immediately.--scala-version 2.13– required only if the JAR is built against Scala 2.13.--jars– comma-separated list of additional JAR dependencies. These must also be registered withspark.addArtifact()in your application code.
Step 6: Check status and logs¶
The submit command from Step 5 prints two things you need for status lookups:
Snowflake Workload Name – the full workload name with the timestamp suffix the CLI appends (for example,
MY_SCALA_JOB_260417_200531). Status lookups only work with this suffixed name.Job History URL – a Snowsight link for the workload, which is the fastest way to see status and logs in the UI.
To check status and logs from the CLI, pass the suffixed name:
--workload-status returns the current state (DEPLOYING, RUNNING, SUCCEEDED, or FAILED), start time, duration,
and service details. --display-logs prints the application logs.
Note
There is a small latency, from a few seconds to a minute, before logs are available for fetching. When an event table isn’t used to store log data, logs are retained for a short period of time, such as five minutes or less.
For production monitoring and detailed log configuration, see Monitoring Snowpark Connect for Spark workloads.