EXTRACT_SEMANTIC_CATEGORIES Function: International Tag Values

Attention

This behavior change is in the 2023_05 bundle.

For the current status of the bundle, refer to Bundle History.

The EXTRACT_SEMANTIC_CATEGORIES function behaves as follows:

Previously:

The output of the function takes the following form:

{
    "<col1_name>": {
    "extra_info" : {
        "alternates" : [<semantic_categories>],
        "probability" : "<number>"
    },
    "privacy_category" : "<value>",
    "semantic_category" : "<value>"
    },
...
...
    "<colN_name>": {
    "extra_info" : {
        "alternates" : [<semantic_categories>],
        "probability" : "<number>"
    },
    "privacy_category" : "<value>",
    "semantic_category" : "<value>"
    }
}
Copy

The possible values for the SNOWFLAKE.CORE.SEMANTIC_CATEGORY tag are defined in this section, which does not specify SEMANTIC_CATEGORY tag values for countries other than the United States of America (i.e. US_).

Currently:

The output of the function will change in its formatting, and the output will include support for SEMANTIC_CATEGORY tag values that pertain to Australia, Canada, the United Kingdom, and the United States. To support these countries, the tag values correspond to certain parent category groups. A parent category contains information about the classification result, including whether the column consists of values largely from one country or another.

The formatting changes are:

  • Remove the extra_info and probability fields.

  • Move the alternates field to a different position in the output.

  • Add these new fields:

    • valid_value_ratio, which specifies the ratio of valid values in the sample size. Invalid values include NULL, an empty string, and a string with more than 256 characters.

    • recommendation, which includes information about each tag and value.

    • confidence, where the possible values are either HIGH, MEDIUM, or LOW.

    • coverage, which indicates the percent of sampled cell values that match the rules for a particular category.

    • details, which contains fields and values that can specify a geographical tag value for the SEMANTIC_CATEGORY tag.

For example:

{
  "valid_value_ratio": 1.0,
  "recommendation": {
    "semantic_category": "PASSPORT",
    "privacy_category": "IDENTIFIER",
    "confidence": "HIGH",
    "coverage": 0.7,
    "details": [
      {
        "semantic_category": "US_PASSPORT",
        "coverage": 0.7
      },
      {
        "semantic_category": "CA_PASSPORT",
        "coverage": 0.1
      }
    ]
  },
  "alternates": [
    {
      "semantic_category": "NATIONAL_IDENTIFIER",
      "privacy_category": "IDENTIFIER",
      "confidence": "LOW",
      "coverage": 0.3,
      "details": [
        {
          "semantic_category": "US_SSN",
          "privacy_category": "IDENTIFIER",
          "coverage": 0.3
        }
      ]
    }
  ]
}
Copy

The following table summarizes the relationship between the classification tags, new category groups and group members, and supported countries. The country codes are based on the ISO-3166-1 alpha-2 (link removed) standard. Other semantic categories, such as EMAIL and GENDER, are not affected.

PRIVACY_CATEGORY Tag Values

SEMANTIC_CATEGORY Tag Values (Parent Group)

Group Members

Country Code

IDENTIFIER

BANK_ACCOUNT

CA_BANK_ACCOUNT . US_BANK_ACCOUNT . IBAN

CA . US

ORGANIZATION_IDENTIFIER

AU_BUSINESS_NUMBER . AU_COMPANY_NUMBER

AU

DRIVERS_LICENSE

AU_DRIVERS_LICENSE . CA_DRIVERS_LICENSE . US_DRIVERS_LICENSE

AU . CA . US

MEDICARE_NUMBER

AU_MEDICARE_NUMBER

AU

PASSPORT

AU_PASSPORT . CA_PASSPORT . US_PASSPORT

AU . CA . US

PHONE_NUMBER

AU_PHONE_NUMBER . CA_PHONE_NUMBER . UK_PHONE_NUMBER . US_PHONE_NUMBER

AU . CA . GB . US

STREET_ADDRESS

CA_STREET_ADDRESS . US_STREET_ADDRESS

CA . US

TAX_IDENTIFIER

AU_TAX_NUMBER

AU

NATIONAL_IDENTIFIER

CA_SOCIAL_INSURANCE_NUMBER . UK_NATIONAL_INSURANCE_NUMBER . US_SSN

CA . GB . US

QUASI_IDENTIFIER

CITY

US_CITY . CA_CITY .

US . CA .

POSTAL_CODE

AU_POSTAL_CODE . CA_POSTAL_CODE . UK_POSTAL_CODE . US_POSTAL_CODE

AU . CA . GB . US

ADMINISTRATIVE_AREA_1

CA_PROVINCE_OR_TERRITORY . US_STATE_OR_TERRITORY

CA . US

ADMINISTRATIVE_AREA_2

US_COUNTY

US

The data engineer can use the pending tag values by manually specifying the tag value in the ALTER TABLE or ALTER VIEW statement. Alternatively, the data engineer can call the ASSOCIATE_SEMANTIC_CATEGORY_TAGS stored procedure to set the tag.

For example, use an ALTER TABLE statement to set the PASSPORT tag value on the PASSPORT table column manually.

ALTER TABLE mydb.myschema.mytable
  MODIFY COLUMN passport
  SET TAG SNOWFLAKE.CORE.SEMANTIC_CATEGORY = 'PASSPORT';
Copy

There are no changes to the overall classification process or the steps to classify a table, all tables in a schema, or all tables in a database.

Tip

If you pass the EXTRACT_SEMANTIC_CATEGORIES function as an argument to the ASSOCIATE_SEMANTIC_CATEGORY_TAGS stored procedure, be sure to double-check any custom handling that you might have configured to ensure that your workflows do not break due to the pending formatting changes.

Ref: 1110

Language: English