Develop with a local IDE¶

You can run Spark workloads interactively from Jupyter Notebooks, VS Code, IntelliJ, or any Python/Java/Scala interface without needing to manage a Spark cluster. The workloads run on the Snowflake infrastructure.

There are two ways to connect:

Snowpark Connect package (recommended): Install the snowpark-connect Python package, which is required for all languages (Python, Java, and Scala). For Java and Scala projects, also add the snowpark-connect-java-client Maven dependency. For establishing a connection, use a TOML connection file. This approach handles server lifecycle, authentication, and session management automatically.
Direct endpoint (server-side): Connect to Snowflake’s hosted Spark Connect endpoint using standard PySpark or Spark Java/Scala clients with programmatic access tokens (PATs). No Snowflake-specific packages are required.

Prerequisites¶

You have a Snowflake account with access to Snowpark Connect for Spark.
Python 3.10 or later (earlier than 3.13) is installed. Confirm your version by running python3 --version.
Ensure that your Java and Python installations use the same CPU architecture. For example, if Python is arm64, install an arm64 build of Java (not x86_64).

Connection configuration¶

Snowpark Connect for Spark connects to Snowflake using a TOML connection file. You can create this file manually or by using Snowflake CLI.

If you have Snowflake CLI installed, you can use it to define a connection. Otherwise, you can manually write connection parameters in a config.toml file.

Add a connection by using Snowflake CLI¶

You can use Snowflake CLI to add connection properties that Snowpark Connect for Spark uses to connect to Snowflake. Your changes are saved to a config.toml file.

Run the following command to add a connection:
```
snow connection add
```

Follow the prompts to define a connection.

Specify spark-connect as the connection name.

This command adds a connection to your config.toml file:

[connections.spark-connect]
host = "example.snowflakecomputing.cn"
port = 443
account = "example"
user = "test_example"
password = "password"
protocol = "https"
warehouse = "example_wh"
database = "example_db"
schema = "public"

Confirm the connection works:

snow connection list
snow connection test --connection spark-connect

Add a connection manually¶

You can write or update a connections.toml file so that your code can connect to Snowpark Connect for Spark on Snowflake.

Ensure that the file permissions allow only the owner to read and write:
```
chmod 0600 "~/.snowflake/connections.toml"
```

Edit the file to contain a [spark-connect] connection with your specifics:

[spark-connect]
host="my_snowflake_account.snowflakecomputing.cn"
account="my_snowflake_account"
user="my_user"
password="&&&&&&&&"
warehouse="my_wh"
database="my_db"
schema="public"

Install Snowpark Connect for Spark¶

Create a Python virtual environment and install the Snowpark Connect for Spark package:

python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade --force-reinstall 'snowpark-connect[jdk]'

Note

The Java client for Snowpark Connect for Spark is a preview feature.

The Java/Scala client library manages the Snowpark Connect for Spark Python gRPC server as a child process. You need both the library and a Python virtual environment with snowpark-connect installed.

Create a Python virtual environment:

python3 -m venv /path/to/scos-venv
/path/to/scos-venv/bin/pip install snowpark-connect

Add the following dependencies to your pom.xml:

<!-- Snowpark Connect Java client (Scala 2.12) -->
<dependency>
    <groupId>com.snowflake</groupId>
    <artifactId>snowpark-connect-java-client_2.12</artifactId>
    <version>1.0.0</version>
</dependency>

<!-- Spark Connect client (must be provided separately) -->
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-connect-client-jvm_2.12</artifactId>
    <version>3.5.6</version>
</dependency>

The library is available on Maven Central: snowpark-connect-java-client_2.12 (https://central.sonatype.com/artifact/com.snowflake/snowpark-connect-java-client_2.12), snowpark-connect-java-client_2.13 (https://central.sonatype.com/artifact/com.snowflake/snowpark-connect-java-client_2.13).

On Java 9+, add the required --add-opens JVM arguments for Apache Arrow compatibility. See JVM module system arguments for the full list and how to configure them in Maven, IntelliJ, or on the command line.
Point the library to the venv using one of these methods (in order of precedence):
- Code API: .pythonVenv("/path/to/scos-venv") on the session builder
- Environment variable: SNOWPARK_CONNECT_PYTHON_VENV=/path/to/scos-venv
If neither is set, the library falls back to system python3 (or python on Windows) and checks whether snowpark-connect is importable.

Note

The Scala client for Snowpark Connect for Spark is a preview feature.

Create a Python virtual environment:

python3 -m venv /path/to/scos-venv
/path/to/scos-venv/bin/pip install snowpark-connect

Add the following dependencies to your build.sbt:
```
libraryDependencies ++= Seq(
  "com.snowflake" %% "snowpark-connect-java-client" % "1.0.0",
  "org.apache.spark" %% "spark-connect-client-jvm" % "3.5.6"
)
```
The library is available on Maven Central: snowpark-connect-java-client_2.12 (https://central.sonatype.com/artifact/com.snowflake/snowpark-connect-java-client_2.12), snowpark-connect-java-client_2.13 (https://central.sonatype.com/artifact/com.snowflake/snowpark-connect-java-client_2.13).
On Java 9+, add the required --add-opens JVM arguments for Apache Arrow compatibility. See JVM module system arguments for the full list and how to configure them in sbt, IntelliJ, or on the command line.
Point the library to the venv using one of these methods (in order of precedence):
- Code API: .pythonVenv("/path/to/scos-venv") on the session builder
- Environment variable: SNOWPARK_CONNECT_PYTHON_VENV=/path/to/scos-venv
If neither is set, the library falls back to system python3 (or python on Windows) and checks whether snowpark-connect is importable.

Start a session and run code¶

Once you have Snowpark Connect for Spark installed and an authenticated connection in place, start a session and run Spark code.

Start the Snowpark Connect for Spark server and create a session:

from snowflake import snowpark_connect
spark = snowpark_connect.init_spark_session()

Then run Spark DataFrame code:

from pyspark.sql import Row

df = spark.createDataFrame([
    Row(id=1, name="Alice", age=25),
    Row(id=2, name="Bob", age=30),
    Row(id=3, name="Charlie", age=35),
])

df.show()
df.filter(df.age > 28).show()
print(df.count())

Note

The Java client for Snowpark Connect for Spark is a preview feature.

import com.snowflake.snowpark_connect.client.SnowparkConnectSession;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;

public class SnowparkConnectExample {
    public static void main(String[] args) {
        SparkSession spark = SnowparkConnectSession.builder()
            .pythonVenv("/path/to/scos-venv")
            .appName("My App")
            .getOrCreate();

        Dataset<Row> df = spark.sql("SELECT 1 AS id, 'Alice' AS name, 25 AS age "
            + "UNION ALL SELECT 2, 'Bob', 30 "
            + "UNION ALL SELECT 3, 'Charlie', 35");

        df.show();
        df.filter("age > 28").show();
        System.out.println(df.count());

        spark.stop();
    }
}

Compile and run:

mvn compile exec:java -Dexec.mainClass="SnowparkConnectExample"

UDF support

When using user-defined functions or custom code with Java, do one of the following:

import org.apache.spark.sql.connect.client.REPLClassDirMonitor;

REPLClassDirMonitor classFinder = new REPLClassDirMonitor("/absolute/path/to/target/classes");
spark.registerClassFinder(classFinder);

Upload JAR dependencies. You can include the workload JAR itself if a class finder isn’t used.
```
spark.addArtifact("/absolute/path/to/dependency.jar");
```

Use a staged JAR.

spark.conf().set("snowpark.connect.udf.java.imports",
    "[@mystage/dependency.jar, @db.schema.stage/other_dependency.jar]");

Using Scala 2.13

By default, Snowpark Connect for Spark uses Scala 2.12. If your dependencies are built with Scala 2.13, you must specify the Scala version using the snowpark.connect.scala.version configuration option.

SparkSession spark = SnowparkConnectSession.builder()
    .appName("My App")
    .config("snowpark.connect.scala.version", "2.13")
    .getOrCreate();

Note

The Scala client for Snowpark Connect for Spark is a preview feature.

import com.snowflake.snowpark_connect.client.SnowparkConnectSession

object SnowparkConnectExample {
  def main(args: Array[String]): Unit = {
    val spark = SnowparkConnectSession.builder()
      .pythonVenv("/path/to/scos-venv")
      .appName("My App")
      .getOrCreate()

    try {
      import spark.implicits._

      val df = spark.createDataFrame(Seq(
        (1, "Alice", 25),
        (2, "Bob", 30),
        (3, "Charlie", 35)
      )).toDF("id", "name", "age")

      df.show()
      df.filter($"age" > 28).show()
      println(df.count())

    } finally {
      spark.stop()
    }
  }
}

Compile and run:

sbt "runMain SnowparkConnectExample"

UDF support

When using user-defined functions or custom code with Scala, do one of the following:

import org.apache.spark.sql.connect.client.REPLClassDirMonitor

val classFinder = new REPLClassDirMonitor("/absolute/path/to/target/scala-2.12/classes")
spark.registerClassFinder(classFinder)

Upload JAR dependencies. You can include the workload JAR itself if a class finder isn’t used.
```
spark.addArtifact("/absolute/path/to/dependency.jar")
```

Use a staged JAR.

spark.conf.set("snowpark.connect.udf.java.imports",
    "[@mystage/dependency.jar, @db.schema.stage/other_dependency.jar]")

Using Scala 2.13

By default, Snowpark Connect for Spark uses Scala 2.12. Workloads built with Scala 2.13 must specify the Scala version using the snowpark.connect.scala.version configuration option.

val spark = SnowparkConnectSession.builder()
  .appName("My App")
  .config("snowpark.connect.scala.version", "2.13")
  .getOrCreate()

Common installation issues¶

Use the following checks to resolve common Snowpark Connect for Spark installation issues.

Ensure that Java and Python are based on the same architecture.
Use the most recent Snowpark Connect for Spark package, as described in Install Snowpark Connect for Spark.
Confirm that the python command with PySpark code is working correctly for local execution without Snowflake connectivity.

For example, execute a command such as the following:
```
python your_pyspark_file.py
```

Connect directly to Snowflake’s Spark Connect endpoint¶

You can connect to Snowflake’s hosted Spark Connect endpoint using standard, off-the-shelf Spark client packages such as PySpark or Spark clients for Java and Scala. You don’t need to install any Snowflake-specific packages.

With this approach, all Spark processing runs on Snowflake’s infrastructure. Your client sends Spark Connect protocol messages directly to Snowflake, which executes the workload and returns results. Authentication uses programmatic access tokens (PATs).

This option is useful when you want to:

Avoid installing Snowflake-specific packages in your environment.
Use your existing Spark tooling (Jupyter, VS Code, terminals) with Snowflake compute and governance.
Simplify dependency management by relying only on the standard PySpark package.

Step 1: Install required packages¶

Install the Spark Connect client for your language. You don’t need to install any Snowflake packages.

pip install "pyspark[connect]>=3.5.0,<4"

Note

The Java client for Snowpark Connect for Spark is a preview feature.

Add the Spark Connect client dependency to your pom.xml:

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-connect-client-jvm_2.12</artifactId>
    <version>3.5.6</version>
</dependency>

Note

The Scala client for Snowpark Connect for Spark is a preview feature.

Add the Spark Connect client dependency to your build.sbt file:

libraryDependencies += "org.apache.spark" %% "spark-connect-client-jvm" % "3.5.6"

javaOptions ++= Seq(
  "--add-opens=java.base/java.nio=ALL-UNNAMED"
)

Step 2: Set up authentication¶

Generate a programmatic access token (PAT).

For more information, see the following topics:
- Using programmatic access tokens for authentication
- ALTER USER … ADD PROGRAMMATIC ACCESS TOKEN (PAT)
The following example adds a PAT named TEST_PAT for the user sysadmin and sets the expiration to 30 days.
```
ALTER USER add PAT TEST_PAT ROLE_RESTRICTION = sysadmin DAYS_TO_EXPIRY = 30;
```

Find your Snowflake Spark Connect host URL.

Run the following SQL in Snowflake to find the hostname for your account:

SELECT t.VALUE:type::VARCHAR as type,
       t.VALUE:host::VARCHAR as host,
       t.VALUE:port as port
  FROM TABLE(FLATTEN(input => PARSE_JSON(SYSTEM$ALLOWLIST()))) AS t where type = 'SNOWPARK_CONNECT';

Step 3: Connect and run Spark code¶

Connect to the Snowflake Spark Connect endpoint using the host URL and PAT from the previous steps.

from pyspark.sql import SparkSession
import urllib.parse

# Replace with your actual PAT.
pat = urllib.parse.quote("<pat>", safe="")

# Replace with your Snowpark Connect host from the above SQL query.
snowpark_connect_host = ""

# Define database/schema/warehouse for your Spark session (recommended).
# Otherwise, values are resolved from your default_namespace and default_warehouse.
db_name = urllib.parse.quote("TESTDB", safe="")
schema_name = urllib.parse.quote("TESTSCHEMA", safe="")
warehouse_name = urllib.parse.quote("TESTWH", safe="")

spark = SparkSession.builder.remote(
    f"sc://{snowpark_connect_host}/;token={pat};token_type=PAT"
    f";database={db_name};schema={schema_name};warehouse={warehouse_name}"
).getOrCreate()

Once connected, you can write regular Spark DataFrame code:

from pyspark.sql import Row

df = spark.createDataFrame([
    Row(id=1, name="Alice", age=25),
    Row(id=2, name="Bob", age=30),
    Row(id=3, name="Charlie", age=35),
])

df.show()
df.filter(df.age > 28).show()
print(df.count())

Note

The Java client for Snowpark Connect for Spark is a preview feature.

import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import java.net.URLEncoder;

public class ServerSideConnectExample {
    public static void main(String[] args) throws Exception {
        String pat = URLEncoder.encode("<pat>", "UTF-8");
        String snowparkConnectHost = "";

        String dbName = URLEncoder.encode("TESTDB", "UTF-8");
        String schemaName = URLEncoder.encode("TESTSCHEMA", "UTF-8");
        String warehouseName = URLEncoder.encode("TESTWH", "UTF-8");

        SparkSession spark = SparkSession.builder()
            .remote("sc://" + snowparkConnectHost + "/;token=" + pat + ";token_type=PAT"
                + ";database=" + dbName + ";schema=" + schemaName
                + ";warehouse=" + warehouseName)
            .getOrCreate();

        Dataset<Row> df = spark.sql("SELECT 1 AS id, 'Alice' AS name, 25 AS age "
            + "UNION ALL SELECT 2, 'Bob', 30 "
            + "UNION ALL SELECT 3, 'Charlie', 35");

        df.show();
        df.filter("age > 28").show();
        System.out.println(df.count());

        spark.stop();
    }
}

Compile and run:

mvn compile exec:java -Dexec.mainClass="ServerSideConnectExample"

Note

The Scala client for Snowpark Connect for Spark is a preview feature.

import org.apache.spark.sql.SparkSession
import java.net.URLEncoder

object ServerSideConnectExample {
  def main(args: Array[String]): Unit = {
    val pat = URLEncoder.encode("<pat>", "UTF-8")
    val snowparkConnectHost = ""

    val dbName = URLEncoder.encode("TESTDB", "UTF-8")
    val schemaName = URLEncoder.encode("TESTSCHEMA", "UTF-8")
    val warehouseName = URLEncoder.encode("TESTWH", "UTF-8")

    val spark = SparkSession.builder()
      .remote(
        s"sc://$snowparkConnectHost/;token=$pat;token_type=PAT" +
        s";database=$dbName;schema=$schemaName;warehouse=$warehouseName"
      )
      .getOrCreate()

    try {
      import spark.implicits._

      val df = spark.createDataFrame(Seq(
        (1, "Alice", 25),
        (2, "Bob", 30),
        (3, "Charlie", 35)
      )).toDF("id", "name", "age")

      df.show()
      df.filter($"age" > 28).show()
      println(df.count())

    } finally {
      spark.stop()
    }
  }
}

Compile and run:

sbt "runMain ServerSideConnectExample"