Run Spark workloads from Snowflake Notebooks¶
You can run Spark workloads interactively from Snowflake Notebooks without needing to manage a Spark cluster. The workloads run on the Snowflake infrastructure.
To use Snowflake Notebooks as a client for developing Spark workloads to run on Snowflake:
Launch Snowflake Notebooks.
Within the notebook, start a Spark session.
Write PySpark code to load, transform, and analyze data—such as to filter high-value customer orders or aggregate revenue.
Use a Snowflake Notebook that runs on a warehouse¶
For more information about Snowflake Notebooks, see Create a notebook.
Create a Snowflake Notebook by completing the following steps:
Sign in to Snowsight.
At the top of the navigation menu, select
(Create) » Notebook » New Notebook.In the Create notebook dialog, enter a name, database, and schema for the new notebook.
For more information, see Create a notebook.
For Runtime, select Run on warehouse.
For Runtime version, select Snowflake Warehouse Runtime 2.0.
When you select version 2.0, you ensure that you have the dependency support you need, including Python 3.10. For more information, see Notebook runtimes.
For Query warehouse and Notebook warehouse, select warehouses for running query code and kernel and Python code, as described in Create a notebook.
Select Create.
In the notebook you created, under Packages, ensure that you have the following packages listed to support code in your notebook:
Python, version 3.10 or later
snowpark-connect, latest version
If you need to add these packages, use the following steps:
Under Anaconda Packages, type the packages name in the search box.
Select the package name.
Select Save.
To connect to the Snowpark Connect for Spark server and test the connection, copy the following code and paste it in the Python cell of the notebook you created:
from snowflake import snowpark_connect spark = snowpark_connect.server.init_spark_session() df = spark.sql("show schemas").limit(10) df.show()
Use a Snowflake Notebook that runs in a workspace¶
For more information about Snowflake Notebooks in Workspaces, see Snowflake Notebooks in Workspaces.
Create a PyPI external access integration.
You must use the ACCOUNTADMIN role and have a database you can access.
Run the following commands from a SQL file in a workspace.
USE DATABASE mydb; USE ROLE accountadmin; CREATE OR REPLACE NETWORK RULE pypi_network_rule MODE = EGRESS TYPE = HOST_PORT VALUE_LIST = ('pypi.org', 'pypi.python.org', 'pythonhosted.org', 'files.pythonhosted.org'); CREATE OR REPLACE EXTERNAL ACCESS INTEGRATION pypi_access_integration ALLOWED_NETWORK_RULES = (pypi_network_rule) ENABLED = true;
Enable PyPI integration in a notebook.
In the notebook, for Service name, select a service.
For External access integrations, select the PyPI integration you created.
For Python version, select Python 3.11.
Select Create.
Install the
snowpark_connectpackage from PyPI in the notebook, using code such as the following:pip install snowpark-connect[jdk]
Restart the kernel.
From the Connect button, select Restart kernel.
Start the
snowpark_connectserver using code such as the following:import snowflake.snowpark_connect spark = snowflake.snowpark_connect.server.init_spark_session()
Run your Spark code, as shown in the following example:
from pyspark.sql.connect.functions import * from pyspark.sql.connect.types import * from pyspark.sql import Row # Sample nested data data = [(1, ("Alice", 30))] schema = "id INT, info STRUCT<name:STRING, age:INT>" df = spark.createDataFrame(data, schema=schema) df.show() spark.sql("show databases").show()
