snowflake.snowpark.DataFrame¶
- class snowflake.snowpark.DataFrame(session: Optional[Session] = None, plan: Optional[LogicalPlan] = None, is_cached: bool = False)[source] (https://github.com/snowflakedb/snowpark-python/blob/v1.30.0/snowpark-python/src/snowflake/snowpark/dataframe.py#L347-L6324)¶
- Bases: - object- Represents a lazily-evaluated relational dataset that contains a collection of - Rowobjects with columns defined by a schema (column name and type).- A DataFrame is considered lazy because it encapsulates the computation or query required to produce a relational dataset. The computation is not performed until you call a method that performs an action (e.g. - collect()).- Creating a DataFrame - You can create a DataFrame in a number of different ways, as shown in the examples below. - Creating tables and data to run the sample code:
- >>> session.sql("create or replace temp table prices(product_id varchar, amount number(10, 2))").collect() [Row(status='Table PRICES successfully created.')] >>> session.sql("insert into prices values ('id1', 10.0), ('id2', 20.0)").collect() [Row(number of rows inserted=2)] >>> # Create a CSV file to demo load >>> import tempfile >>> with tempfile.NamedTemporaryFile(mode="w+t") as t: ... t.writelines(["id1, Product A", "\n" "id2, Product B"]) ... t.flush() ... create_stage_result = session.sql("create temp stage test_stage").collect() ... put_result = session.file.put(t.name, "@test_stage/test_dir") 
- Example 1
- Creating a DataFrame by reading a table in Snowflake: - >>> df_prices = session.table("prices") 
- Example 2
- Creating a DataFrame by reading files from a stage: - >>> from snowflake.snowpark.types import StructType, StructField, IntegerType, StringType >>> df_catalog = session.read.schema(StructType([StructField("id", StringType()), StructField("name", StringType())])).csv("@test_stage/test_dir") >>> df_catalog.show() --------------------- |"ID" |"NAME" | --------------------- |id1 | Product A | |id2 | Product B | --------------------- 
- Example 3
- Creating a DataFrame by specifying a sequence or a range: - >>> session.create_dataframe([(1, "one"), (2, "two")], schema=["col_a", "col_b"]).show() --------------------- |"COL_A" |"COL_B" | --------------------- |1 |one | |2 |two | --------------------- >>> session.range(1, 10, 2).to_df("col1").show() ---------- |"COL1" | ---------- |1 | |3 | |5 | |7 | |9 | ---------- 
- Example 4
- Create a new DataFrame by applying transformations to other existing DataFrames: - >>> df_merged_data = df_catalog.join(df_prices, df_catalog["id"] == df_prices["product_id"]) 
 - Performing operations on a DataFrame - Broadly, the operations on DataFrame can be divided into two types: - Transformations produce a new DataFrame from one or more existing DataFrames. Note that transformations are lazy and don’t cause the DataFrame to be evaluated. If the API does not provide a method to express the SQL that you want to use, you can use - functions.sqlExpr()as a workaround.
- Actions cause the DataFrame to be evaluated. When you call a method that performs an action, Snowpark sends the SQL query for the DataFrame to the server for evaluation. 
 - Transforming a DataFrame - The following examples demonstrate how you can transform a DataFrame. - Example 5
- Using the - select()method to select the columns that should be in the DataFrame (similar to adding a- SELECTclause):- >>> # Return a new DataFrame containing the product_id and amount columns of the prices table. >>> # This is equivalent to: SELECT PRODUCT_ID, AMOUNT FROM PRICES; >>> df_price_ids_and_amounts = df_prices.select(col("product_id"), col("amount")) 
- Example 6
- Using the - Column.as_()method to rename a column in a DataFrame (similar to using- SELECT col AS alias):- >>> # Return a new DataFrame containing the product_id column of the prices table as a column named >>> # item_id. This is equivalent to: SELECT PRODUCT_ID AS ITEM_ID FROM PRICES; >>> df_price_item_ids = df_prices.select(col("product_id").as_("item_id")) 
- Example 7
- Using the - filter()method to filter data (similar to adding a- WHEREclause):- >>> # Return a new DataFrame containing the row from the prices table with the ID 1. >>> # This is equivalent to: >>> # SELECT * FROM PRICES WHERE PRODUCT_ID = 1; >>> df_price1 = df_prices.filter((col("product_id") == 1)) 
- Example 8
- Using the - sort()method to specify the sort order of the data (similar to adding an- ORDER BYclause):- >>> # Return a new DataFrame for the prices table with the rows sorted by product_id. >>> # This is equivalent to: SELECT * FROM PRICES ORDER BY PRODUCT_ID; >>> df_sorted_prices = df_prices.sort(col("product_id")) 
- Example 9
- Using - agg()method to aggregate results.- >>> import snowflake.snowpark.functions as f >>> df_prices.agg(("amount", "sum")).collect() [Row(SUM(AMOUNT)=Decimal('30.00'))] >>> df_prices.agg(f.sum("amount")).collect() [Row(SUM(AMOUNT)=Decimal('30.00'))] >>> # rename the aggregation column name >>> df_prices.agg(f.sum("amount").alias("total_amount"), f.max("amount").alias("max_amount")).collect() [Row(TOTAL_AMOUNT=Decimal('30.00'), MAX_AMOUNT=Decimal('20.00'))] 
- Example 10
- Using the - group_by()method to return a- RelationalGroupedDataFramethat you can use to group and aggregate results (similar to adding a- GROUP BYclause).- RelationalGroupedDataFrameprovides methods for aggregating results, including:- RelationalGroupedDataFrame.avg()(equivalent to AVG(column))
- RelationalGroupedDataFrame.count()(equivalent to COUNT())
- RelationalGroupedDataFrame.max()(equivalent to MAX(column))
- RelationalGroupedDataFrame.median()(equivalent to MEDIAN(column))
- RelationalGroupedDataFrame.min()(equivalent to MIN(column))
- RelationalGroupedDataFrame.sum()(equivalent to SUM(column))
 - >>> # Return a new DataFrame for the prices table that computes the sum of the prices by >>> # category. This is equivalent to: >>> # SELECT CATEGORY, SUM(AMOUNT) FROM PRICES GROUP BY CATEGORY >>> df_total_price_per_category = df_prices.group_by(col("product_id")).sum(col("amount")) >>> # Have multiple aggregation values with the group by >>> import snowflake.snowpark.functions as f >>> df_summary = df_prices.group_by(col("product_id")).agg(f.sum(col("amount")).alias("total_amount"), f.avg("amount")).sort(col("product_id")) >>> df_summary.show() ------------------------------------------------- |"PRODUCT_ID" |"TOTAL_AMOUNT" |"AVG(AMOUNT)" | ------------------------------------------------- |id1 |10.00 |10.00000000 | |id2 |20.00 |20.00000000 | ------------------------------------------------- 
- Example 11
- Using windowing functions. Refer to - Windowfor more details.- >>> from snowflake.snowpark import Window >>> from snowflake.snowpark.functions import row_number >>> df_prices.with_column("price_rank", row_number().over(Window.order_by(col("amount").desc()))).show() ------------------------------------------ |"PRODUCT_ID" |"AMOUNT" |"PRICE_RANK" | ------------------------------------------ |id2 |20.00 |1 | |id1 |10.00 |2 | ------------------------------------------ 
- Example 12
- Handling missing values. Refer to - DataFrameNaFunctionsfor more details.- >>> df = session.create_dataframe([[1, None, 3], [4, 5, None]], schema=["a", "b", "c"]) >>> df.na.fill({"b": 2, "c": 6}).show() ------------------- |"A" |"B" |"C" | ------------------- |1 |2 |3 | |4 |5 |6 | ------------------- 
 - Performing an action on a DataFrame - The following examples demonstrate how you can perform an action on a DataFrame. - Example 13
- Performing a query and returning an array of Rows: - >>> df_prices.collect() [Row(PRODUCT_ID='id1', AMOUNT=Decimal('10.00')), Row(PRODUCT_ID='id2', AMOUNT=Decimal('20.00'))] 
- Example 14
- Performing a query and print the results: - >>> df_prices.show() --------------------------- |"PRODUCT_ID" |"AMOUNT" | --------------------------- |id1 |10.00 | |id2 |20.00 | --------------------------- 
- Example 15
- Calculating statistics values. Refer to - DataFrameStatFunctionsfor more details.- >>> df = session.create_dataframe([[1, 2], [3, 4], [5, -1]], schema=["a", "b"]) >>> df.stat.corr("a", "b") -0.5960395606792697 
- Example 16
- Performing a query asynchronously and returning a list of - Rowobjects:- >>> df = session.create_dataframe([[float(4), 3, 5], [2.0, -4, 7], [3.0, 5, 6], [4.0, 6, 8]], schema=["a", "b", "c"]) >>> async_job = df.collect_nowait() >>> async_job.result() [Row(A=4.0, B=3, C=5), Row(A=2.0, B=-4, C=7), Row(A=3.0, B=5, C=6), Row(A=4.0, B=6, C=8)] 
- Example 17
- Performing a query and transforming it into - pandas.DataFrameasynchronously:- >>> async_job = df.to_pandas(block=False) >>> async_job.result() A B C 0 4.0 3 5 1 2.0 -4 7 2 3.0 5 6 3 4.0 6 8 
 - Methods - agg(*exprs)- Aggregate the data in the DataFrame. - alias(name)- Returns an aliased dataframe in which the columns can now be referenced to using col(<df alias>, <column name>). - approxQuantile(col, percentile, *[, ...])- For a specified numeric column and a list of desired quantiles, returns an approximate value for the column at each of the desired quantiles. - approx_quantile(col, percentile, *[, ...])- For a specified numeric column and a list of desired quantiles, returns an approximate value for the column at each of the desired quantiles. - cache_result(*[, statement_params])- Caches the content of this DataFrame to create a new cached Table DataFrame. - col(col_name)- Returns a reference to a column in the DataFrame. - collect()- Executes the query representing this DataFrame and returns the result as a list of - Rowobjects.- collect_nowait(*[, statement_params, ...])- Executes the query representing this DataFrame asynchronously and returns: class:AsyncJob. - copy_into_table(table_name, *[, files, ...])- Executes a COPY INTO <table> command to load data from files in a stage location into a specified table. - corr(col1, col2, *[, statement_params])- Calculates the correlation coefficient for non-null pairs in two numeric columns. - count()- Executes the query representing this DataFrame and returns the number of rows in the result (similar to the COUNT function in SQL). - cov(col1, col2, *[, statement_params])- Calculates the sample covariance for non-null pairs in two numeric columns. - createOrReplaceTempView(name, *[, comment, ...])- Creates or replace a temporary view that returns the same results as this DataFrame. - createOrReplaceView(name, *[, comment, ...])- Creates a view that captures the computation expressed by this DataFrame. - createTempView(name, *[, comment, ...])- Creates a temporary view that returns the same results as this DataFrame. - create_or_replace_dynamic_table(name, *, ...)- Creates a dynamic table that captures the computation expressed by this DataFrame. - create_or_replace_temp_view(name, *[, ...])- Creates or replace a temporary view that returns the same results as this DataFrame. - create_or_replace_view(name, *[, comment, ...])- Creates a view that captures the computation expressed by this DataFrame. - create_temp_view(name, *[, comment, ...])- Creates a temporary view that returns the same results as this DataFrame. - crossJoin(right, *[, lsuffix, rsuffix])- Performs a cross join, which returns the Cartesian product of the current - DataFrameand another- DataFrame(- right).- cross_join(right, *[, lsuffix, rsuffix])- Performs a cross join, which returns the Cartesian product of the current - DataFrameand another- DataFrame(- right).- crosstab(col1, col2, *[, statement_params])- Computes a pair-wise frequency table (a - contingency table) for the specified columns.- cube(*cols)- Performs a SQL GROUP BY CUBE. - describe(*cols)- Computes basic statistics for numeric columns, which includes - count,- mean,- stddev,- min, and- max.- distinct()- Returns a new DataFrame that contains only the rows with distinct values from the current DataFrame. - drop(*cols)- Returns a new DataFrame that excludes the columns with the specified names from the output. - dropDuplicates(*subset)- Creates a new DataFrame by removing duplicated rows on given subset of columns. - drop_duplicates(*subset)- Creates a new DataFrame by removing duplicated rows on given subset of columns. - dropna([how, thresh, subset])- Returns a new DataFrame that excludes all rows containing fewer than a specified number of non-null and non-NaN values in the specified columns. - except_(other)- Returns a new DataFrame that contains all the rows from the current DataFrame except for the rows that also appear in the - otherDataFrame.- explain()- Prints the list of queries that will be executed to evaluate this DataFrame. - fillna(value[, subset, include_decimal])- Returns a new DataFrame that replaces all null and NaN values in the specified columns with the values provided. - filter(expr)- Filters rows based on the specified conditional expression (similar to WHERE in SQL). - first()- Executes the query representing this DataFrame and returns the first - nrows of the results.- flatten(input[, path, outer, recursive, mode])- Flattens (explodes) compound values into multiple rows. - groupBy(*cols)- Groups rows by the columns specified by expressions (similar to GROUP BY in SQL). - group_by(*cols)- Groups rows by the columns specified by expressions (similar to GROUP BY in SQL). - group_by_grouping_sets(*grouping_sets)- Performs a SQL GROUP BY GROUPING SETS. - intersect(other)- Returns a new DataFrame that contains the intersection of rows from the current DataFrame and another DataFrame ( - other).- join(right[, on, how, lsuffix, rsuffix, ...])- Performs a join of the specified type ( - how) with the current DataFrame and another DataFrame (- right) on a list of columns (- on).- join_table_function(func, *func_arguments, ...)- Lateral joins the current DataFrame with the output of the specified table function. - limit(n[, offset])- Returns a new DataFrame that contains at most - nrows from the current DataFrame, skipping- offsetrows from the beginning (similar to LIMIT and OFFSET in SQL).- minus(other)- Returns a new DataFrame that contains all the rows from the current DataFrame except for the rows that also appear in the - otherDataFrame.- natural_join(right[, how])- Performs a natural join of the specified type ( - how) with the current DataFrame and another DataFrame (- right).- orderBy(*cols[, ascending])- Sorts a DataFrame by the specified expressions (similar to ORDER BY in SQL). - order_by(*cols[, ascending])- Sorts a DataFrame by the specified expressions (similar to ORDER BY in SQL). - pivot(pivot_col[, values, default_on_null])- Rotates this DataFrame by turning the unique values from one column in the input expression into multiple columns and aggregating results where required on any remaining column values. - printSchema([level])- Prints the schema of a dataframe in tree format. - print_schema([level])- Prints the schema of a dataframe in tree format. - randomSplit(weights[, seed, statement_params])- Randomly splits the current DataFrame into separate DataFrames, using the specified weights. - random_split(weights[, seed, statement_params])- Randomly splits the current DataFrame into separate DataFrames, using the specified weights. - rename(col_or_mapper[, new_column])- Returns a DataFrame with the specified column - col_or_mapperrenamed as- new_column.- replace(to_replace[, value, subset, ...])- Returns a new DataFrame that replaces values in the specified columns. - rollup(*cols)- Performs a SQL GROUP BY ROLLUP. - sample([frac, n])- Samples rows based on either the number of rows to be returned or a percentage of rows to be returned. - sampleBy(col, fractions[, seed])- Returns a DataFrame containing a stratified sample without replacement, based on a - dictthat specifies the fraction for each stratum.- sample_by(col, fractions[, seed])- Returns a DataFrame containing a stratified sample without replacement, based on a - dictthat specifies the fraction for each stratum.- select(*cols)- Returns a new DataFrame with the specified Column expressions as output (similar to SELECT in SQL). - selectExpr(*exprs)- Projects a set of SQL expressions and returns a new - DataFrame.- select_expr(*exprs)- Projects a set of SQL expressions and returns a new - DataFrame.- show([n, max_width, statement_params])- Evaluates this DataFrame and prints out the first - nrows with the specified maximum number of characters per column.- sort(*cols[, ascending])- Sorts a DataFrame by the specified expressions (similar to ORDER BY in SQL). - subtract(other)- Returns a new DataFrame that contains all the rows from the current DataFrame except for the rows that also appear in the - otherDataFrame.- take([n, statement_params, block])- Executes the query representing this DataFrame and returns the first - nrows of the results.- toDF(*names)- Creates a new DataFrame containing columns with the specified names. - toLocalIterator(*[, statement_params, ...])- Executes the query representing this DataFrame and returns an iterator of - Rowobjects that you can use to retrieve the results.- toPandas(*[, statement_params, block])- Executes the query representing this DataFrame and returns the result as a pandas DataFrame (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). - to_arrow(*[, statement_params, block])- Executes the query representing this DataFrame and returns the result as a pyarrow Table <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html>. - to_arrow_batches(*[, statement_params, block])- Executes the query representing this DataFrame and returns an iterator of pyarrow Tables (containing a subset of rows) that you can use to retrieve the results. - to_df(*names)- Creates a new DataFrame containing columns with the specified names. - Executes the query representing this DataFrame and returns an iterator of - Rowobjects that you can use to retrieve the results.- Executes the query representing this DataFrame and returns the result as a pandas DataFrame (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). - Executes the query representing this DataFrame and returns an iterator of pandas dataframes (containing a subset of rows) that you can use to retrieve the results. - to_snowpark_pandas([index_col, columns, ...])- Convert the Snowpark DataFrame to Snowpark pandas DataFrame. - union(other)- Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame ( - other), excluding any duplicate rows.- unionAll(other)- Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame ( - other), including any duplicate rows.- unionAllByName(other[, allow_missing_columns])- Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame ( - other), including any duplicate rows.- unionByName(other[, allow_missing_columns])- Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame ( - other), excluding any duplicate rows.- union_all(other)- Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame ( - other), including any duplicate rows.- union_all_by_name(other[, allow_missing_columns])- Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame ( - other), including any duplicate rows.- union_by_name(other[, allow_missing_columns])- Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame ( - other), excluding any duplicate rows.- unpivot(value_column, name_column, column_list)- Rotates a table by transforming columns into rows. - where(expr)- Filters rows based on the specified conditional expression (similar to WHERE in SQL). - withColumn(col_name, col, *[, ...])- Returns a DataFrame with an additional column with the specified name - col_name.- withColumnRenamed(existing, new)- Returns a DataFrame with the specified column - existingrenamed as- new.- with_column(col_name, col, *[, ...])- Returns a DataFrame with an additional column with the specified name - col_name.- with_column_renamed(existing, new)- Returns a DataFrame with the specified column - existingrenamed as- new.- with_columns(col_names, values, *[, ...])- Returns a DataFrame with additional columns with the specified names - col_names.- Attributes - analytics- Returns all column names as a list. - dtypes- Returns a - DataFrameNaFunctionsobject that provides functions for handling missing values in the DataFrame.- Returns a - dictthat contains a list of queries that will be executed to evaluate this DataFrame with the key queries, and a list of post-execution actions (e.g., queries to clean up temporary objects) with the key post_actions.- The definition of the columns in this DataFrame (the "relational schema" for the DataFrame). - Returns a - snowflake.snowpark.Sessionobject that provides access to the session the current DataFrame is relying on.- Returns a new - DataFrameWriterobject that you can use to write the data in the- DataFrameto a Snowflake database or a stage location- Whether the dataframe is cached.