类别：: 字符串函数（正则表达式）

REGEXP_SUBSTR¶

Returns the substring that matches a regular expression within a string.

语法¶

REGEXP_SUBSTR( <subject> ,
               <pattern>
                 [ , <position>
                   [ , <occurrence>
                     [ , <regex_parameters>
                       [ , <group_num> ]
                     ]
                   ]
                 ]
)

Copy

实参¶

必填：

subject

用于搜索匹配项的字符串。

pattern

要匹配的模式。

有关指定模式的准则，请参阅字符串函数（正则表达式）。

可选：

position

Number of characters from the beginning of the string where the function starts searching for matches. The value must be a positive integer.

默认值：``1``（搜索匹配项时，从左边的第一个字符开始）

occurrence

指定要从中开始返回匹配项的模式的第一次出现。

该函数会跳过第一个 occurrence - 1 匹配项。例如，如果存在 5 个匹配项，并且您为 occurrence 实参指定 3 个匹配项，则该函数将忽略前两个匹配项，并返回第三个、第四个和第五个匹配项。

默认：1

regex_parameters

包含一个或多个字符的字符串，指定用于搜索匹配项的参数。支持的值：

参数	描述
`c`	区分大小写的匹配
`i`	不区分大小写的匹配
`m`	多行模式
`e`	提取子匹配项
`s`	Single-line mode POSIX wildcard character `.` matches `\n`

默认：c

For more information, see 为正则表达式指定参数.

备注

默认情况下，REGEXP_SUBSTR 会返回主题的整个匹配部分。但是，如果指定了 e （用于“提取”）参数，则 REGEXP_SUBSTR 返回字符串中与模式中第一组匹配的部分。如果指定了 e，但未同时指定 group_num，则 group_num 的值将默认为 1（第一组）。如果模式中没有子表达式，则 REGEXP_SUBSTR 表现得如同未设置 e。有关使用 e 的示例，请参阅本主题中的示例。

group_num

指定要提取的组。通过在正则表达式中使用括号来指定组。

如果指定了 group_num，则 Snowflake 允许在未同时指定 'e' 选项的情况下执行提取。隐含了 'e'。

Snowflake 至多支持 1024 个组。

有关使用 group_num 的示例，请参阅本主题中的示例。

返回¶

该函数返回一个匹配子字符串的 VARCHAR 类型值。

在以下情况下，该函数会返回 NULL：

未找到匹配项。
任意实参为 NULL。

使用说明¶

有关使用正则表达式的其他信息，请参阅字符串函数（正则表达式）。

排序规则详细信息¶

Arguments with collation specifications currently aren't supported.

示例¶

此 REGEXP_INSTR 函数的文档包含许多同时使用 REGEXP_SUBSTR 和 REGEXP_INSTR 的示例。您可能也想看看这些例子。

这些示例使用下面创建的字符串：

CREATE OR REPLACE TABLE demo2 (id INT, string1 VARCHAR);

INSERT INTO demo2 (id, string1) VALUES
    (2, 'It was the best of times, it was the worst of times.'),
    (3, 'In    the   string   the   extra   spaces  are   redundant.'),
    (4, 'A thespian theater is nearby.');

SELECT * FROM demo2;

Copy

+----+-------------------------------------------------------------+
| ID | STRING1                                                     |
|----+-------------------------------------------------------------|
|  2 | It was the best of times, it was the worst of times.        |
|  3 | In    the   string   the   extra   spaces  are   redundant. |
|  4 | A thespian theater is nearby.                               |
+----+-------------------------------------------------------------+

字符串具有以下特征：

id 为 2 的字符串中多次出现单词“the”。
id 为 3 的字符串中多次出现单词“the”，单词之间有额外的空格。
id 为 4 字符的字符串中，字符序列“the”出现在多个单词中（如“thespian”和“theater”），但未单独出现“the”这个单词。

以下示例调用 REGEXP_SUBSTR 函数：

在 SELECT 列表中调用 REGEXP_SUBSTR 函数
在 WHERE 子句中调用 REGEXP_SUBSTR 函数

在 SELECT 列表中调用 REGEXP_SUBSTR 函数¶

在 SELECT 列表中调用 REGEXP_SUBSTR 函数以提取或显示与模式匹配的值。

This example looks for first occurrence of the word the, followed by one or more non-word characters --- for example, the whitespace separating words --- followed by one or more word characters.

“单词字符”不仅包括字母 a-z 和 A-Z，还包括下划线（“_”）和十进制数字 0-9，但不包括空格、标点符号等。

SELECT id,
       REGEXP_SUBSTR(string1, 'the\\W+\\w+') AS result
  FROM demo2
  ORDER BY id;

Copy

+----+--------------+
| ID | RESULT       |
|----+--------------|
|  2 | the best     |
|  3 | the   string |
|  4 | NULL         |
+----+--------------+

从字符串的位置 1 开始，查找单词 the 的第二次出现，后跟一个或多个非单词字符，后跟一个或多个单词字符。

SELECT id,
       REGEXP_SUBSTR(string1, 'the\\W+\\w+', 1, 2) AS result
  FROM demo2
  ORDER BY id;

Copy

+----+-------------+
| ID | RESULT      |
|----+-------------|
|  2 | the worst   |
|  3 | the   extra |
|  4 | NULL        |
+----+-------------+

从字符串的位置 1 开始，查找单词 the 的第二次出现，后跟一个或多个非单词字符，后跟一个或多个单词字符。

与其返回整个匹配项，不如只返回“组”（例如，子字符串中与括号中正则表达式部分匹配的部分）。在这种情况下，返回值应该是“the”之后的单词。

SELECT id,
       REGEXP_SUBSTR(string1, 'the\\W+(\\w+)', 1, 2, 'e', 1) AS result
  FROM demo2
  ORDER BY id;

Copy

+----+--------+
| ID | RESULT |
|----+--------|
|  2 | worst  |
|  3 | extra  |
|  4 | NULL   |
+----+--------+

此示例说明如何从其中第一个单词为 A 的双字模式的第一、第二和第三个匹配项中检索第二个单词。此示例也表明，如果尝试获取超出最后一个模式的词，Snowflake 会返回 NULL。

首先，创建表并插入数据：

CREATE OR REPLACE TABLE test_regexp_substr (string1 VARCHAR);;
INSERT INTO test_regexp_substr (string1) VALUES ('A MAN A PLAN A CANAL');

Copy

运行查询：

SELECT REGEXP_SUBSTR(string1, 'A\\W+(\\w+)', 1, 1, 'e', 1) AS result1,
       REGEXP_SUBSTR(string1, 'A\\W+(\\w+)', 1, 2, 'e', 1) AS result2,
       REGEXP_SUBSTR(string1, 'A\\W+(\\w+)', 1, 3, 'e', 1) AS result3,
       REGEXP_SUBSTR(string1, 'A\\W+(\\w+)', 1, 4, 'e', 1) AS result4
  FROM test_regexp_substr;

Copy

+---------+---------+---------+---------+
| RESULT1 | RESULT2 | RESULT3 | RESULT4 |
|---------+---------+---------+---------|
| MAN     | PLAN    | CANAL   | NULL    |
+---------+---------+---------+---------+

此示例说明如何在模式首次出现时检索第一、第二和第三组。在这种情况下，返回值是单词 MAN 的各个字母。

SELECT REGEXP_SUBSTR(string1, 'A\\W+(\\w)(\\w)(\\w)', 1, 1, 'e', 1) AS result1,
       REGEXP_SUBSTR(string1, 'A\\W+(\\w)(\\w)(\\w)', 1, 1, 'e', 2) AS result2,
       REGEXP_SUBSTR(string1, 'A\\W+(\\w)(\\w)(\\w)', 1, 1, 'e', 3) AS result3
  FROM test_regexp_substr;

Copy

+---------+---------+---------+
| RESULT1 | RESULT2 | RESULT3 |
|---------+---------+---------|
| M       | A       | N       |
+---------+---------+---------+

以下是一些其他示例。

创建表并插入数据：

CREATE OR REPLACE TABLE message(body VARCHAR(255));

INSERT INTO message VALUES
  ('Hellooo World'),
  ('How are you doing today?'),
  ('the quick brown fox jumps over the lazy dog'),
  ('PACK MY BOX WITH FIVE DOZEN LIQUOR JUGS');

Copy

通过匹配单词边界 (\b) 返回包含小写字母 o 的第一个匹配项，后跟零个或多个单词字符 (\S)、字母 o，然后是零或多个单词字符，直到下一个单词边界为止：

SELECT body,
       REGEXP_SUBSTR(body, '\\b\\S*o\\S*\\b') AS result
  FROM message;

Copy

+---------------------------------------------+---------+
| BODY                                        | RESULT  |
|---------------------------------------------+---------|
| Hellooo World                               | Hellooo |
| How are you doing today?                    | How     |
| the quick brown fox jumps over the lazy dog | brown   |
| PACK MY BOX WITH FIVE DOZEN LIQUOR JUGS     | NULL    |
+---------------------------------------------+---------+

返回包含小写 o 的第一个匹配项，从主题中的第三个字符开始：

SELECT body,
       REGEXP_SUBSTR(body, '\\b\\S*o\\S*\\b', 3) AS result
  FROM message;

Copy

+---------------------------------------------+--------+
| BODY                                        | RESULT |
|---------------------------------------------+--------|
| Hellooo World                               | llooo  |
| How are you doing today?                    | you    |
| the quick brown fox jumps over the lazy dog | brown  |
| PACK MY BOX WITH FIVE DOZEN LIQUOR JUGS     | NULL   |
+---------------------------------------------+--------+

返回包含小写 o 的第三个匹配项，从主题中的第三个字符开始：

SELECT body,
       REGEXP_SUBSTR(body, '\\b\\S*o\\S*\\b', 3, 3) AS result
  FROM message;

Copy

+---------------------------------------------+--------+
| BODY                                        | RESULT |
|---------------------------------------------+--------|
| Hellooo World                               | NULL   |
| How are you doing today?                    | today  |
| the quick brown fox jumps over the lazy dog | over   |
| PACK MY BOX WITH FIVE DOZEN LIQUOR JUGS     | NULL   |
+---------------------------------------------+--------+

返回包含小写 o 的第三个匹配项，从主题中的第三个字符开始，不区分大小写的匹配：

SELECT body,
       REGEXP_SUBSTR(body, '\\b\\S*o\\S*\\b', 3, 3, 'i') AS result
  FROM message;

Copy

+---------------------------------------------+--------+
| BODY                                        | RESULT |
|---------------------------------------------+--------|
| Hellooo World                               | NULL   |
| How are you doing today?                    | today  |
| the quick brown fox jumps over the lazy dog | over   |
| PACK MY BOX WITH FIVE DOZEN LIQUOR JUGS     | LIQUOR |
+---------------------------------------------+--------+

此示例说明，您可以通过指定空字符串来显式省略任何正则表达式参数。

SELECT body,
       REGEXP_SUBSTR(body, '(H\\S*o\\S*\\b).*', 1, 1, '') AS result
  FROM message;

Copy

+---------------------------------------------+--------------------------+
| BODY                                        | RESULT                   |
|---------------------------------------------+--------------------------|
| Hellooo World                               | Hellooo World            |
| How are you doing today?                    | How are you doing today? |
| the quick brown fox jumps over the lazy dog | NULL                     |
| PACK MY BOX WITH FIVE DOZEN LIQUOR JUGS     | NULL                     |
+---------------------------------------------+--------------------------+

以下示例说明了重叠出现的情况。首先，创建表并插入数据：

CREATE OR REPLACE TABLE overlap (
  id NUMBER,
  a STRING);

INSERT INTO overlap VALUES (1, ',abc,def,ghi,jkl,');
INSERT INTO overlap VALUES (2, ',abc,,def,,ghi,,jkl,');

SELECT * FROM overlap;

Copy

+----+----------------------+
| ID | A                    |
|----+----------------------|
|  1 | ,abc,def,ghi,jkl,    |
|  2 | ,abc,,def,,ghi,,jkl, |
+----+----------------------+

运行查询，在每行中查找以下模式的第二次出现：标点符号后是数字和字母，然后是标点符号。

SELECT id,
       REGEXP_SUBSTR(a,'[[:punct:]][[:alnum:]]+[[:punct:]]', 1, 2) AS result
  FROM overlap;

Copy

+----+--------+
| ID | RESULT |
|----+--------|
|  1 | ,ghi,  |
|  2 | ,def,  |
+----+--------+

以下示例说明如何使用模式匹配和连接从 Apache HTTP Server 访问日志中创建 JSON 对象。首先，创建表并插入数据：

CREATE OR REPLACE TABLE test_regexp_log (logs VARCHAR);

INSERT INTO test_regexp_log (logs) VALUES
  ('127.0.0.1 - - [10/Jan/2018:16:55:36 -0800] "GET / HTTP/1.0" 200 2216'),
  ('192.168.2.20 - - [14/Feb/2018:10:27:10 -0800] "GET /cgi-bin/try/ HTTP/1.0" 200 3395');

SELECT * from test_regexp_log

Copy

+-------------------------------------------------------------------------------------+
| LOGS                                                                                |
|-------------------------------------------------------------------------------------|
| 127.0.0.1 - - [10/Jan/2018:16:55:36 -0800] "GET / HTTP/1.0" 200 2216                |
| 192.168.2.20 - - [14/Feb/2018:10:27:10 -0800] "GET /cgi-bin/try/ HTTP/1.0" 200 3395 |
+-------------------------------------------------------------------------------------+

运行查询：

SELECT '{ "ip_addr":"'
       || REGEXP_SUBSTR (logs,'\\b\\d{1,3}\.\\d{1,3}\.\\d{1,3}\.\\d{1,3}\\b')
       || '", "date":"'
       || REGEXP_SUBSTR (logs,'([\\w:\/]+\\s[+\-]\\d{4})')
       || '", "request":"'
       || REGEXP_SUBSTR (logs,'\"((\\S+) (\\S+) (\\S+))\"', 1, 1, 'e')
       || '", "status":"'
       || REGEXP_SUBSTR (logs,'(\\d{3}) \\d+', 1, 1, 'e')
       || '", "size":"'
       || REGEXP_SUBSTR (logs,'\\d{3} (\\d+)', 1, 1, 'e')
       || '"}' as Apache_HTTP_Server_Access
  FROM test_regexp_log;

Copy

+-----------------------------------------------------------------------------------------------------------------------------------------+
| APACHE_HTTP_SERVER_ACCESS                                                                                                               |
|-----------------------------------------------------------------------------------------------------------------------------------------|
| { "ip_addr":"127.0.0.1", "date":"10/Jan/2018:16:55:36 -0800", "request":"GET / HTTP/1.0", "status":"200", "size":"2216"}                |
| { "ip_addr":"192.168.2.20", "date":"14/Feb/2018:10:27:10 -0800", "request":"GET /cgi-bin/try/ HTTP/1.0", "status":"200", "size":"3395"} |
+-----------------------------------------------------------------------------------------------------------------------------------------+

在 WHERE 子句中调用 REGEXP_SUBSTR 函数¶

在 WHERE 子句中调用 REGEXP_SUBSTR 函数，用于筛选包含与模式匹配的值的行。通过使用该函数，您可以避免多个 OR 条件。

以下示例查询您之前创建的 demo2 表，用于返回包含字符串 best 或字符串 thespian 的行。将 IS NOT NULL 添加到条件，以返回与模式匹配的行。也就是说，REGEXP_SUBSTR 函数未返回 NULL 的行：

SELECT id, string1
  FROM demo2
  WHERE REGEXP_SUBSTR(string1, '(best|thespian)') IS NOT NULL;

Copy

+----+------------------------------------------------------+
| ID | STRING1                                              |
|----+------------------------------------------------------|
|  2 | It was the best of times, it was the worst of times. |
|  4 | A thespian theater is nearby.                        |
+----+------------------------------------------------------+

您可以使用 AND 条件来查找与多个模式匹配的行。例如，以下查询返回包含字符串 best 或字符串 thespian 并以字符串 It 开头的行：

SELECT id, string1
  FROM demo2
  WHERE REGEXP_SUBSTR(string1, '(best|thespian)') IS NOT NULL
    AND REGEXP_SUBSTR(string1, '^It') IS NOT NULL;

Copy

+----+------------------------------------------------------
| ID | STRING1                                              |
|----+------------------------------------------------------|
|  2 | It was the best of times, it was the worst of times. |
+----+------------------------------------------------------+