EXTRACT_SEMANTIC_CATEGORIES Function: International Tag Values¶
Attention
This behavior change is in the 2023_05 bundle.
For the current status of the bundle, refer to Bundle History.
The EXTRACT_SEMANTIC_CATEGORIES function behaves as follows:
- Previously:
The output of the function takes the following form:
{ "<col1_name>": { "extra_info" : { "alternates" : [<semantic_categories>], "probability" : "<number>" }, "privacy_category" : "<value>", "semantic_category" : "<value>" }, ... ... "<colN_name>": { "extra_info" : { "alternates" : [<semantic_categories>], "probability" : "<number>" }, "privacy_category" : "<value>", "semantic_category" : "<value>" } }
The possible values for the SNOWFLAKE.CORE.SEMANTIC_CATEGORY tag are defined in this section, which does not specify SEMANTIC_CATEGORY tag values for countries other than the United States of America (i.e.
US_
).- Currently:
The output of the function will change in its formatting, and the output will include support for SEMANTIC_CATEGORY tag values that pertain to Australia, Canada, the United Kingdom, and the United States. To support these countries, the tag values correspond to certain parent category groups. A parent category contains information about the classification result, including whether the column consists of values largely from one country or another.
The formatting changes are:
Remove the
extra_info
andprobability
fields.Move the
alternates
field to a different position in the output.Add these new fields:
valid_value_ratio
, which specifies the ratio of valid values in the sample size. Invalid values include NULL, an empty string, and a string with more than 256 characters.recommendation
, which includes information about each tag and value.confidence
, where the possible values are eitherHIGH
,MEDIUM
, orLOW
.coverage
, which indicates the percent of sampled cell values that match the rules for a particular category.details
, which contains fields and values that can specify a geographical tag value for the SEMANTIC_CATEGORY tag.
For example:
{ "valid_value_ratio": 1.0, "recommendation": { "semantic_category": "PASSPORT", "privacy_category": "IDENTIFIER", "confidence": "HIGH", "coverage": 0.7, "details": [ { "semantic_category": "US_PASSPORT", "coverage": 0.7 }, { "semantic_category": "CA_PASSPORT", "coverage": 0.1 } ] }, "alternates": [ { "semantic_category": "NATIONAL_IDENTIFIER", "privacy_category": "IDENTIFIER", "confidence": "LOW", "coverage": 0.3, "details": [ { "semantic_category": "US_SSN", "privacy_category": "IDENTIFIER", "coverage": 0.3 } ] } ] }
The following table summarizes the relationship between the classification tags, new category groups and group members, and supported countries. The country codes are based on the ISO-3166-1 alpha-2 (link removed) standard. Other semantic categories, such as EMAIL and GENDER, are not affected.
PRIVACY_CATEGORY Tag Values
SEMANTIC_CATEGORY Tag Values (Parent Group)
Group Members
Country Code
IDENTIFIER
BANK_ACCOUNT
CA_BANK_ACCOUNT
.US_BANK_ACCOUNT
.IBAN
CA . US
ORGANIZATION_IDENTIFIER
AU_BUSINESS_NUMBER
.AU_COMPANY_NUMBER
AU
DRIVERS_LICENSE
AU_DRIVERS_LICENSE
.CA_DRIVERS_LICENSE
.US_DRIVERS_LICENSE
AU . CA . US
MEDICARE_NUMBER
AU_MEDICARE_NUMBER
AU
PASSPORT
AU_PASSPORT
.CA_PASSPORT
.US_PASSPORT
AU . CA . US
PHONE_NUMBER
AU_PHONE_NUMBER
.CA_PHONE_NUMBER
.UK_PHONE_NUMBER
.US_PHONE_NUMBER
AU . CA . GB . US
STREET_ADDRESS
CA_STREET_ADDRESS
.US_STREET_ADDRESS
CA . US
TAX_IDENTIFIER
AU_TAX_NUMBER
AU
NATIONAL_IDENTIFIER
CA_SOCIAL_INSURANCE_NUMBER
.UK_NATIONAL_INSURANCE_NUMBER
.US_SSN
CA . GB . US
QUASI_IDENTIFIER
CITY
US_CITY
.CA_CITY
.US . CA .
POSTAL_CODE
AU_POSTAL_CODE
.CA_POSTAL_CODE
.UK_POSTAL_CODE
.US_POSTAL_CODE
AU . CA . GB . US
ADMINISTRATIVE_AREA_1
CA_PROVINCE_OR_TERRITORY
.US_STATE_OR_TERRITORY
CA . US
ADMINISTRATIVE_AREA_2
US_COUNTY
US
The data engineer can use the pending tag values by manually specifying the tag value in the ALTER TABLE or ALTER VIEW statement. Alternatively, the data engineer can call the ASSOCIATE_SEMANTIC_CATEGORY_TAGS stored procedure to set the tag.
For example, use an ALTER TABLE statement to set the
PASSPORT
tag value on the PASSPORT table column manually.ALTER TABLE mydb.myschema.mytable MODIFY COLUMN passport SET TAG SNOWFLAKE.CORE.SEMANTIC_CATEGORY = 'PASSPORT';
There are no changes to the overall classification process or the steps to classify a table, all tables in a schema, or all tables in a database.
Tip
If you pass the EXTRACT_SEMANTIC_CATEGORIES function as an argument to the ASSOCIATE_SEMANTIC_CATEGORY_TAGS stored procedure, be sure to double-check any custom handling that you might have configured to ensure that your workflows do not break due to the pending formatting changes.
Ref: 1110