实施差分隐私

本主题包含为其账户实现差分隐私的数据提供商的信息。

为数据集实施差分隐私时,任务涉及三个关键概念:

  • Privacy policies. A table or view is not protected by differential privacy until you assign a privacy policy to it. A table or view with a privacy policy is considered to be privacy-protected.
  • Privacy budgets. As analysts query a privacy-protected table, you can manage the privacy budgets associated with those analysts.
  • Privacy domains. You should define a privacy domain for fact and dimension columns in a privacy-protected table or view.

限制

  • 您无法为同一个表或视图同时分配隐私策略、聚合策略或掩码策略。
  • Apart from querying the noise interval, analysts don’t know whether they’re querying a privacy-protected table, so the data provider should inform them that query results contain noise.
  • 数据提供商无法监控分析师在另一个账户中运行查询所产生的隐私损失。
  • Applying multiple privacy policies to one table is currently not supported. Because of this, protecting more than one entity with entity-level differential privacy in a single table is not possible.
  • Queries on replicated or cloned tables that have a privacy policy associated with an entity key are currently blocked.

关于实体级隐私

实体 是指一类应受保护的数据主体,例如人、组织或地点。如果每个单个实体只出现在一行中,行级隐私就足以保护其身份。但是,如果属于单个实体的数据出现在多行中(例如,在事务性数据中),为了妥善保护每个实体,必须在实体级隐私中配置差分隐私。

To achieve entity-level privacy, Snowflake lets you specify which attribute can be used to identify an entity (an entity key). This lets Snowflake identify all of the records that belong to a particular entity within a dataset. For example, if the entity key is defined as the column email, then Snowflake can determine that all records where email=joe.smith@example.com belong to the same entity.

大多数情况下,相较于行级隐私,实体级隐私更受青睐,但如果表符合以下情况,行级隐私可能更适合:

  • 表中没有任何列能够唯一识别实体。实体级隐私需要识别列。
  • 每个单个实体只出现一次。
  • The table will not be used in a join. Joins with tables protected by row-level privacy are possible, but have limitations.

You choose whether to implement entity-level or row-level privacy when assigning a privacy policy to a table or view. For more information, see Assign a privacy policy. If you choose to implement entity-level privacy, the data must also meet structural requirements to ensure that the entity identifier is used correctly.

Tip

If you want to protect two separate tables with the same privacy policy, but they do not have the same entity key values, you can create a new table that maps the two identifying columns, create a view that joins two of the tables, and apply the privacy policy to the view. For example, you could use this strategy if the entity key in one table is email and in another table it is user_id, but both refer to the same entities.

实体级隐私的结构要求

受实体级差分隐私保护的数据结构必须符合一定要求。只有满足这些要求,Snowflake 才能准确跟踪与实体相关的隐私损失。

You should structure your data to meet these requirements before applying privacy policies to implement differential privacy. Snowflake cannot determine whether data conforms to these structural requirements because they concern the meaning of the data, not the differential privacy implementation. For example, if the entity keys for two different tables are both set to the column user_id, but one of the columns contains values for a numeric identifier while the other column contains email addresses, Snowflake cannot correctly link entity information across the two tables.

为了实现实体级隐私,您的数据必须符合以下要求:

  • Each row belongs to only one individual within an entity — As an example, suppose a table contains users and households. If the entity that needs to be protected is users, the table cannot be structured such that each row is a household and all the users in that household are captured in other columns. You would need to restructure the table so there is only one row per user, with a household_id column to indicate which household a user belongs to.

  • Consistent entity identifier across all tables — You can create a privacy policy that represents the protection needed for a single entity, then apply that policy to multiple tables that contain information about the entity. When you assign the privacy policy to each table, you need to specify the column that uniquely identifies the entity (that is, the entity key). The value that uniquely identifies an entity within these entity key columns must be the same. For example, suppose the email column is the entity key for two tables that contain information about an entity. If the email address of an entity is joe@example.com in one table, then the email address in the other table must also be joe@example.com.

  • Entity identifier in all tables: Although an entity identifier is not required to implement entity-level privacy, you can make it possible for analysts to minimize noise in query joins by including the entity identifier in all tables related to an entity. In some cases, you might need to denormalize the entity key column to meet this requirement. For example, suppose you had the following tables where the entity is customers:

    TableDescription
    customerCustomer directory, where each row is a customer and has a customer_id.
    transactionsCustomer transactions, where each row is a transaction and has a transaction_id. Each customer can have multiple transactions.
    transaction_linesUnique items that were purchased in a transaction. There can be multiple rows in a single transaction.

    Under best practices for normalization, the transaction_lines table would have the transaction_id but not the customer_id. The transaction_lines table would link to the transactions table, which could then be linked to the customers table with customer_id.

    However for differential privacy, you probably want to optimize the data for the analyst by adding the customer_id identifier to the transaction_lines table. This allows the analyst to minimize noise by including customer_id in the join key when joining the transaction_lines table with another table.

与 Snowflake 功能的交互

本节探讨了以下差分隐私对象如何与其他 Snowflake 功能交互。同时,本节讨论了对隐私策略、隐私预算和隐私域的影响。

数据共享

当将应用了隐私策略的安全视图和表添加到共享中时,它们会受到差分隐私的保护。如果通过共享对不安全视图进行查询,则它们不受隐私策略保护。

复制

For considerations when replicating privacy policies and privacy-protected tables and views, see Privacy policies.

Note

如果复制表采用了与实体键关联的隐私策略,则目前对其查询时存在限制。在限制解除之前,对这些表的查询将被阻止。

Cross-Cloud Auto-Fulfillment

使用 Cross-Cloud Auto-Fulfillment 复制数据产品时,请注意以下几点:

  • 接收复制数据产品的账户的管理员无法调整隐私预算。
  • 管理员无法使用单个账户查看所有区域的隐私损失。

克隆

For the effects of cloning privacy-protected tables and views, see Cloning and differential privacy.

Note

如果克隆表采用了与实体键关联的隐私策略,则目前对其查询时存在限制。在限制解除之前,对这些表的查询将被阻止。

基于受隐私保护的基对象创建的视图

您可以在受隐私保护的表或视图上构建视图。但是,不会继承基表或视图的隐私域。因此,请注意以下几点:

  • 必须在新视图的列上设置隐私域。
  • 调整基表的隐私域不会影响在其上所构建视图的隐私域。

物化视图

您可以为物化视图分配隐私策略,使其受隐私保护。

隐私策略和物化视图之间的其他交互包括以下内容:

  • 您不能基于受隐私保护的表或视图创建物化视图。
  • 如果某表被用作物化视图的基表,则无法为该表分配隐私策略。

UDFs

分析师不能使用用户定义的函数查询受隐私保护的表。

无法查询基于受隐私保护表创建的流。

无法向流分配隐私策略。

其他策略

隐私策略与其他 Snowflake 策略通过以下方式交互:

Masking policies

无法将隐私策略和掩码策略分配给同一个表或视图。

Row access policies

行访问策略优先于隐私策略。如果一行被行访问策略阻止,则该行不会包含在差分隐私查询的结果中。

Projection policies

目前,不支持在使用隐私策略保护表的同时使用投影策略保护其各列。虽然您可以通过该方式分配策略,但针对该表的查询将会失败。

Aggregation policies

无法将隐私策略和聚合策略分配给同一个表或视图。

动态表

如果引用的源表受隐私保护,则无法创建动态表。

您可以为现有动态表引用的表分配隐私策略;但是,一旦分配了策略,动态表将不再刷新。

外部表

您可以向外部表分配隐私策略。如果分析师尝试在 VARIANT 列上进行聚合,则查询会失败。但是,如果分析师尝试在虚拟列上进行聚合,则可成功查询。

Time Travel

对于 Time Travel,当将某表的旧版本复制为新表时,该表将使用当前版本的隐私域,因为 Snowflake 不会将旧版本的隐私域存储在表元数据中。