Tutorial: Getting started with data metric functions¶
Introduction¶
You can complete this tutorial using a worksheet in Snowsight or using a CLI client such as SnowSQL. Simply paste the code examples and run them.
By the end of this tutorial, you will learn how to:
- Create a custom data metric function (DMF) to measure data quality.
- Manage the DMF to optimize serverless credit usage.
- Monitor the serverless credit usage associated with calling the scheduled DMF.
Access control setup¶
To complete this tutorial, use a single custom role that has all of the required access, which includes the following:
- Creating a database, which subsequently allows creating a schema, creating a DMF in the schema, and creating a table in the schema
- Creating a warehouse to perform query operations
- Querying the view that contains the results of calling the scheduled DMF
- Querying the view that contains serverless compute usage information
Create the dq_tutorial_role role to use throughout the tutorial:
Grant privileges, and grant the application role and database roles to the dq_tutorial_role:
Create a warehouse to query the table that contains the data and grant the USAGE privilege on the role to the dq_tutorial_role role:
Confirm the grants to the dq_tutorial_role role:
Establish a role hierarchy and grant the role to a user who can complete this tutorial (replace the jsmith value):
Data setup¶
To facilitate managing the data and the DMF for this tutorial, create a dedicated database to contain these objects:
Create a table¶
Insert values into a table¶
Add data to the table:
Create and work with DMFs¶
In the following sections, we will create a user-defined DMF to measure the count of invalid email addresses and subsequently do the following:
- Schedule the DMF to run every 5 minutes.
- Check the DMF table references (find the tables the DMF is set on).
- Query a built-in view that contains the result of calling the scheduled DMF.
- Unset the DMF from the table to avoid unnecessary serverless credit usage.
Create a DMF¶
Create a data metric function (DMF) to return the number of email addresses in a column that don’t match the specified regular expression:
Set the schedule on the table¶
The DMF schedule defines when all DMFs on the table run. Currently, 5 minutes is the shortest possible time interval:
Note
For the purpose of the tutorial, the schedule is set for 5 minutes. However, after you optimize your DMF use cases, experiment with the other schedule settings, such as cron expressions or trigger events associated with DML operations that affect the table.
Set the DMFs on the table and check the references¶
Associate the DMF to the table:
Because the schedule is set for 5 minutes, we need to wait 5 minutes in order for Snowflake to call the DMF and process the results. For now, we can check to see that the DMF is associated with the table by calling the DATA_METRIC_FUNCTION_REFERENCES Information Schema table function:
View the DMF results¶
The results of calling the scheduled DMF are stored in the DATA_QUALITY_MONITORING_RESULTS view. To determine the number of invalid email addresses, query the DATA_QUALITY_MONITORING_RESULTS view to see the results of calling the scheduled DMF:
The results show that the value column contains 1. This number corresponds to one improperly formatted email
address, which corresponds to the first INSERT statement in the Insert values into a table section.
Unset the DMFs from the table¶
You have established that the DMF is working as expected based on the definition of the DMF, the schedule, and the expected results.
To avoid unnecessary serverless credit usage, unset the DMF from the table:
Use DMF to return failed records¶
In this section, you will return records that failed a data quality check because they had blank values.
The data quality metric function identifies rows that contain data that failed the quality check. You can run a data metric scan to extract and return these records.
To return the rows identified by a DMF, follow these steps:
- Create a table.
- Add bad records to the table.
- Run the data metric scan to return records with blank values.
- View the scan results.
- Update records with a new value.
Create a table¶
Paste and run the following statement to create a table.
Insert values into a table¶
Add data with a few bad records, such as blank values, to the table:
Return the number of blank values by running the BLANK_ COUNT data metric function¶
Execute the BLANK_COUNT data metric function to return the number of blank values:
Return rows by running the SYSTEM$DATA_ METRIC_ SCAN function¶
To return the table rows containing blank values in the name column, execute the SYSTEM$DATA_METRIC_SCAN function on the name
column.
View the system metric scan results¶
The results show the rows of the employeeTable table that contain blank values.
Update records with a new value¶
To replace the blank values in the name column, run a query on the target table that includes the SYSTEM$DATA_METRIC_SCAN function.
It sets the blank values in the name column to NULL by running the UPDATE command on each of the rows returned by the system
function:
After you update the values, running the following returns 0:
In this section, you extracted records with data that failed the quality check. In the next section, you will learn how to view your serverless credit consumption.
View your serverless credit consumption¶
Calling scheduled data metric functions (DMFs) requires serverless compute resources. You can query the Account Usage view DATA_QUALITY_MONITORING_USAGE_HISTORY to view the DMF serverless compute cost.
Because the view has a latency of 1-2 hours, wait for that time to pass before querying the view. You can come back to this step later.
Query the view and filter the results to include the time interval of your scheduled DMF:
Clean up, summary, and additional resources¶
Congratulations! You’ve completed this tutorial.
Take a few minutes to review the summary and the key points covered in this tutorial.
Consider cleaning up by dropping the objects you created in this tutorial. Learn more by reviewing other topics in the Snowflake documentation.
Summary and key points¶
In summary, you learned how to do the following:
- Create a custom DMF to measure data quality and manage the DMF to optimize serverless credit usage.
- Monitor the serverless credit usage associated with calling the scheduled DMF.
Drop the tutorial objects¶
If you plan to repeat the tutorial, you can keep the objects that you created.
Otherwise, drop the tutorial objects as follows:
What’s next?¶
Continue learning about Snowflake using the following resources:
- Learn more about DMFs by starting with Introduction to data quality checks.
- Complete the other tutorials provided by Snowflake in the Tutorials to get started with Snowflake topic.