设置 Openflow Connector for PostgreSQL¶

备注

This connector is subject to the Snowflake Connector Terms.

本主题介绍设置 Openflow Connector for PostgreSQL 的步骤。

备注

This connector can be configured to immediately start replicating incremental changes for newly added tables, bypassing the snapshot load phase. This option is often useful when reinstalling the connector in an account where previously replicated data exists and you want to continue replication without having to re-snapshot tables.

For details on the incremental load process, see Incremental replication.

For information about restarting table replication for failed tables, see Restart table replication.

先决条件¶

确保您已查看关于 Openflow Connector for PostgreSQL。
确保您已查看支持的 PostgreSQL 版本。
Recommended: Ensure that you add only one connector instance per runtime.
Ensure that you have 设置 Openflow - BYOC or Set up Openflow - Snowflake Deployments.
If using Openflow - Snowflake Deployments, ensure that you've reviewed configuring required domains and have granted access to the required domains for the PostgreSQL connector.
作为数据库管理员，请执行以下任务：
1. 配置 wal_level
2. 创建发布
3. 确保 PostgreSQL 服务器上有足够的磁盘空间用于 WAL。这是因为复制槽创建后，将导致 PostgreSQL 会保留从该复制槽所记录的位置开始的 WAL 数据，直到连接器确认并推进该位置。
4. Allow at least 1 logical replication slot and 2 WAL senders per Openflow Connector for PostgreSQL connector instance on the server. Set max_replication_slots and max_wal_senders high enough to cover that and all other replication traffic on the instance.
5. 确保启用复制的每张表都有主键。键可以是单个列或组合。
6. 将表的 REPLICA IDENTITY (https://www.postgresql.org/docs/current/sql-altertable.html#SQL-ALTERTABLE-REPLICA-IDENTITY) 设置为 DEFAULT。这可确保主键在 WAL 中表示，并且连接器可以读取它们。
7. 为连接器创建用户。连接器要求用户具有 REPLICATION 属性，以及对所有需要复制的表拥有 SELECT 权限。使用密码创建该用户，以进入连接器的配置。有关复制安全的更多信息，请参阅安全 (https://www.postgresql.org/docs/current/logical-replication-security.html)。
作为 Snowflake 账户管理员，请执行以下任务：
1. 创建一个类型为 SERVICE 的 Snowflake 用户。创建一个用于存储复制数据的数据库，并为该 Snowflake 用户设置在该数据库中创建对象所需的权限，即授予 USAGE 和 CREATE SCHEMA 权限。
  CREATE DATABASE <destination_database>; CREATE USER <openflow_user> TYPE=SERVICE COMMENT='Service user for automated access of Openflow'; CREATE ROLE <openflow_role>; GRANT ROLE <openflow_role> TO USER <openflow_user>; GRANT USAGE ON DATABASE <destination_database> TO ROLE <openflow_role>; GRANT CREATE SCHEMA ON DATABASE <destination_database> TO ROLE <openflow_role>; CREATE WAREHOUSE <openflow_warehouse> WITH WAREHOUSE_SIZE = 'XSMALL' AUTO_SUSPEND = 300 AUTO_RESUME = TRUE; GRANT USAGE, OPERATE ON WAREHOUSE <openflow_warehouse> TO ROLE <openflow_role>;
2. 创建安全密钥对（公钥和私钥）。将用户的私钥存储在文件中，以便在配置连接器时使用。将公钥分配给 Snowflake 服务用户：
  ALTER USER <openflow_user> SET RSA_PUBLIC_KEY = 'thekey';
  有关更多信息，请参阅密钥对身份验证。
3. Designate a warehouse for the connector to use. Start with the XSMALL warehouse size, then experiment with size depending on the amount of tables being replicated, and the amount of data transferred. Large numbers of tables typically scale better with multi-cluster warehouses, rather than the warehouse size.

配置 wal_level¶

Openflow Connector for PostgreSQL 要求将 wal_level (https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-WAL-LEVEL) 设置为 logical。

根据 PostgreSQL 服务器的托管位置，您可以按如下方式配置 wal_level：


本地	以超级用户或具有 `ALTER SYSTEM` 权限的用户身份执行以下查询： ALTER SYSTEM SET wal_level = logical;
RDS	代理使用的用户需要已分配 `rds_superuser` 或 `rds_replication` 角色。您还需要执行以下操作：将 `rds.logical_replication` 静态参数设置为 1。根据数据库和复制设置，设置 `max_replication_slots`、`max_connections` 和 `max_wal_senders` 参数。
AWS Aurora	将 `rds.logical_replication` 静态参数设置为 1。
GCP	设置以下标记： `cloudsql.logical_decoding=on`。 `cloudsql.enable_pglogical=on`。有关更多信息，请参阅 Google Cloud 文档 (https://cloud.google.com/sql/docs/postgres/replication/configure-logical-replication#set-up-logical-replication-with-pglogical)。
Azure	将复制支持设置为 `Logical`。有关更多信息，请参阅 Azure 文档 (https://learn.microsoft.com/en-us/azure/postgresql/single-server/concepts-logical#set-up-your-server)。

创建发布¶

在开始复制之前，Openflow Connector for PostgreSQL 要求您在 PostgreSQL 中创建并配置一个发布 (https://www.postgresql.org/docs/current/logical-replication-publication.html#LOGICAL-REPLICATION-PUBLICATION)。您可以为所有表或表的子集以及仅具有指定列的特定表创建发布。确保您计划复制的每张表和列都包含在发布中。您也可以稍后在连接器运行时修改发布。要创建和配置发布，请执行以下操作：

以具有 CREATE 权限的用户身份登录数据库，然后运行以下查询：
- For PostgreSQL 13 and later:
  CREATE PUBLICATION <publication name> WITH (publish_via_partition_root = true);
  The additional publish_via_partition_root is needed for correct replication of partitioned tables. To learn more about ingestion of partitioned tables see Replicate a partitioned table.
- For PostgreSQL versions earlier than 13:
  CREATE PUBLICATION <publication name>;
使用以下命令定义数据库代理可见的表：

ALTER PUBLICATION <publication name> ADD TABLE <table name>;
For partitioned tables, it's enough to just add the root partition table to the publication. See Replicate a partitioned table for more details.

重要

** PostgreSQL 15 及更高版本** 支持为表列的指定子集配置发布。为使连接器正确支持此功能，必须使用列筛选设置来包含与发布上设置的相同的列。

如果没有此设置，连接器的行为将如下所示：

在目标表中，未包含在筛选器中的列将以 __DELETED 作为后缀。在快照阶段复制的所有数据都将保留下来。

After you add new columns to the publication, the table will be permanently failed, and you will need to restart its replication.

有关更多信息，请参阅 ALTER PUBLICATION (https://www.postgresql.org/docs/current/sql-alterpublication.html) 。

安装连接器¶

To install the connector, do the following as a data engineer:

Navigate to the Openflow overview page. In the Featured connectors section, select View more connectors.
在 Openflow 连接器页面上，找到连接器并选择 Add to runtime。
In the Select runtime dialog, select your runtime from the Available runtimes drop-down list and click Add.

备注

在安装连接器之前，请确保在 Snowflake 中为连接器创建了数据库和架构，用于存储引入的数据。
使用您的 Snowflake 账户凭据对部署进行身份验证，并在系统提示时选择 Allow，以允许运行时应用程序访问您的 Snowflake 账户。连接器安装过程需要几分钟才能完成。
使用您的 Snowflake 账户凭据进行运行时身份验证。

此时将显示 Openflow 画布，其中添加了连接器进程组。

配置连接器¶

To configure the connector, do the following as a data engineer:

右键点击导入的进程组并选择 Parameters。
Populate the required parameter values.

For more information on the required parameter values, see the following sections:
- PostgreSQL Source Parameters: Used to establish a connection with PostgreSQL.
- PostgreSQL Destination Parameters: Used to establish a connection with Snowflake.
- PostgreSQL Ingestion Parameters: Used to specify the tables to replicate.

Start with setting the parameters of the PostgreSQL Source Parameters context, then the PostgreSQL Destination Parameters context. Once this is done, you can enable the connector, and it should connect both to PostgreSQL and Snowflake and start running. However, it will not replicate any data until any tables are explicitly added to its configuration.

要为复制配置特定的表，请编辑 PostgreSQL 引入参数上下文。在对复制参数上下文应用更改后不久，连接器会检测到这些配置，并为每张表启动复制生命周期。

PostgreSQL Source Parameters¶


参数	描述
PostgreSQL Connection URL	指向源数据库的完整 JDBC URL。示例：`jdbc:postgresql://example.com:5432/public` 如果您要连接到 PostgreSQL 副本服务器，请参阅从 PostgreSQL 副本服务器复制表。
PostgreSQL JDBC Driver	PostgreSQL JDBC 驱动程序 jar (https://jdbc.postgresql.org/) 的路径。从其网站下载 jar，然后选中 Reference asset 复选框将其上传并附加。
PostgreSQL Username	连接器的用户名。
PostgreSQL Password	连接器的密码。
Publication Name	The name of the publication you created earlier.
Replication Slot Name	Optional. When no value is provided, the connector will create a new, uniquely-named slot. When given a value, the connector will use the existing slot, or create a new one with the provided name. Changing the value for a running connector will restart reading the incremental change data capture (CDC) stream from the updated slot's position.

PostgreSQL Destination Parameters¶


参数	描述	必填
目标数据库	The database where data is persisted. It must already exist in Snowflake. The name is case-sensitive. For unquoted identifiers, provide the name in uppercase.	是
Destination Schema Pattern	A pattern for the names of destination schemas where data is persisted. The connector creates the schemas if they don't exist. You can customize the pattern per ingested table using these optional variables: `${source.database.name}`: a source table's database. `${source.schema.name}`: a source table's schema. `${source.table.name}`: a source table's name. For example, for a table with the qualified name `source_db.tenant_a.data`, the pattern `prefix_${source.database.name}_${source.schema.name}` evaluates to `prefix_source_db_tenant_a`. To ingest all tables into a single schema, provide a schema name without any variables, like `destination_schema`. 重要 Don't change this setting after the connector has begun ingesting data. Changing this setting after ingestion has begun breaks the existing ingestion. If you must change this setting, create a new connector instance.	是
Snowflake 身份验证策略	使用以下方式时： Snowflake Openflow Deployment or BYOC: Use SNOWFLAKE_MANAGED_TOKEN. This token is managed automatically by Snowflake. BYOC deployments must have previously configured runtime roles to use SNOWFLAKE_MANAGED_TOKEN. BYOC: Alternatively BYOC can use KEY_PAIR as the value for authentication strategy.	是
Snowflake 账户标识符	使用以下方式时：会话令牌身份验证策略：必须留空。 KEY_PAIR: Snowflake account name formatted as [organization-name]-[account-name] where data is persisted.	是
Snowflake Connection Strategy	When using KEY_PAIR, specify the strategy for connecting to Snowflake: STANDARD (default): Connect using standard public routing to Snowflake services. PRIVATE_CONNECTIVITY: Connect using private addresses associated with the supporting cloud platform such as AWS PrivateLink.	Required for BYOC with KEY_PAIR only, otherwise ignored.
Snowflake 私钥	使用以下方式时：会话令牌身份验证策略：必须留空。 KEY_PAIR：必须是用于身份验证的 RSA 私钥。 The RSA key must be formatted according to PKCS8 standards and have standard PEM headers and footers. Note that either a Snowflake Private Key File or a Snowflake Private Key must be defined.	否
Snowflake 私钥文件	使用以下方式时：会话令牌身份验证策略：私钥文件必须为空。 KEY_PAIR：上传包含用于向 Snowflake 进行身份验证的 RSA 私钥的文件，该文件应根据 PKCS8 标准格式化，并包含标准的 PEM 页眉和页脚。页眉行以 `-----BEGIN PRIVATE` 开头。要上传私钥文件，请选中 Reference asset 复选框。	否
Snowflake 私钥密码	使用以下方式时：会话令牌身份验证策略：必须留空。 KEY_PAIR：提供与 Snowflake 私钥文件关联的密码。	否
Snowflake 角色	使用以下方式时： Session Token Authentication Strategy: Use Snowflake Role assigned to the runtime or child role granted to this Snowflake Role. You can find your runtime Snowflake Role in the Openflow UI, by expanding the More Options [⋮] button for your runtime and selecting Set Snowflake role. KEY_PAIR 身份验证策略：使用为您的服务用户配置的有效角色。	是
Snowflake 用户名	使用以下方式时：会话令牌身份验证策略：必须留空。 KEY_PAIR：提供用于连接到 Snowflake 实例的用户名。	是
Oversized Value Strategy	Determines how the connector handles values that exceed its internal size limits (16 MB) during replication. Possible values are: Fail Table (default): The table is marked as permanently failed, and replication stops for that table. Set Null: The value is replaced with `NULL` in the destination table. Use this to prevent table failures when it is acceptable to lose data in tables beyond the oversized value.	否
Snowflake 仓库	用于运行查询的 Snowflake 仓库。	是

PostgreSQL Ingestion Parameters¶


参数	描述
包括表名	以逗号分隔的表路径列表，包括其架构。示例：`public.my_table, other_schema.other_table`。按名称或正则表达式选择表。如果同时使用这两个选项，则将包含任一选项中所有匹配的表。 Tables being sub-partitions are always excluded from ingestion. See Replicate a partitioned table for more information.
包括表正则表达式	用于匹配表路径的正则表达式。与表达式匹配的每个路径都将被复制，并且还将自动包括与稍后创建的模式相匹配的新表。示例：`public\.auto_.*` 按名称或正则表达式选择表。如果同时使用这两个选项，则将包含任一选项中所有匹配的表。 Tables being sub-partitions are always excluded from ingestion. See Replicate a partitioned table for more information.
列筛选器 JSON	Optional. A JSON array of filter objects specifying which columns to include or exclude per table. For syntax details and examples, see 复制表中列的子集.
合并任务计划 CRON	定义触发从日志到目标表的合并任务的 CRON 表达式。如果您想持续合并或按照计划时间来限制仓库运行时间，请将其设置为 `* * * * * ?`。例如：字符串 `* 0 * * * ?` 表示您要在每小时整点计划合并，持续一分钟字符串 `* 20 14 ? * MON-FRI` 表示您计划在周一到周五每天 2:20 PM 触发合并有关其他信息和示例，请参阅 Quartz 文档 (https://www.quartz-scheduler.org/documentation/quartz-2.2.2/tutorials/tutorial-lesson-06.html) 中的 cron 触发教程
Object Identifier Resolution	Specifies how source object identifiers such as the names of schemas, tables, and columns are stored and queried in Snowflake. This setting specifies that you must use double quotes in SQL queries. Option 1: Default, case-sensitive. For backwards compatibility. Transformation: Case is preserved. For example, `My_Table` remains `My_Table`. Queries: SQL queries must use double quotes to match the exact case for database objects. For example, `SELECT * FROM "My_Table";`. 备注 Snowflake recommends using this option if you must preserve source casing for legacy or compatibility reasons. For example, if the source database includes table names that differ in case only--such as `MY_TABLE` and `my_table`--that would result in a name collision when using when using case-insensitive comparisons. Option 2: Recommended, case-insensitive Transformation: All identifiers are converted to uppercase. For example, `My_Table` becomes `MY_TABLE`. Queries: SQL queries are case-insensitive and don't require SQL double quotes. For example, `SELECT * FROM my_table;` returns the same results as `SELECT * FROM MY_TABLE;`. 备注 Snowflake recommends using this option if database objects are not expected to have mixed case names. 重要 Do not change this setting after the connector has begun ingesting data. Changing this setting after ingestion has begun breaks the existing ingestion. If you must change this setting, create a new connector instance.

从 PostgreSQL 副本服务器复制表¶

连接器可以使用逻辑复制 (https://www.postgresql.org/docs/current/logical-replication.html) 从主服务器、热备用副本 (https://www.postgresql.org/docs/current/hot-standby.html)，或订阅者服务器引入数据。在配置连接器以连接到 PostgreSQL 副本之前，确保主节点和副本节点之间的复制正常工作。在调查连接器中缺失数据的问题时，首先确保连接器使用的副本服务器中存在缺失的行。

连接到备用副本时的其他注意事项：

仅支持连接到热备用副本。请注意，在提升为主实例之前，热备用副本无法接受来自客户端的连接。

服务器的 PostgreSQL 版本必须 >= 16。

连接器所需的发布必须在主服务器上创建，而不是在备用服务器上创建。备用服务器是只读的，不允许创建发布。

如果您连接到热备用实例并看到 Trying to create the replication slot '<replication slot>' timed out.If connecting to a standby instance, ensure there is some traffic on the primary PostgreSQL instance, otherwise the call to create a replication slot will never return. 错误，或者 Read PostgreSQL CDC Stream 处理器未启动，请登录主 PostgreSQL 实例并执行以下查询：

SELECT pg_log_standby_snapshot();

当主服务器中没有数据变更时，会发生此错误。因此，在副本服务器上创建复制槽时，连接器可能会出现卡顿。这是因为副本服务器需要有关主服务器正在运行的事务的信息，才能创建复制槽。主服务器在空闲时不会发送信息。pg_log_standby_snapshot() 函数强制主服务器将有关正在运行的事务的信息发送到副本服务器。

复制表中列的子集¶

The connector can filter the data replicated per table to a subset of configured columns. Primary key columns are always included regardless of exclusions.

To apply column filters, set the Column Filter JSON parameter in the Ingestion Parameters context to a JSON array of filter objects, one per table you want to filter.

Columns can be included or excluded by name or by regular expression pattern. You can apply a single condition per table, or combine multiple conditions, with exclusions always taking precedence over inclusions.

Syntax¶

Each object in the array identifies a table and specifies which columns to include or exclude.

[
    {
        "schema": "<schema>" | "schemaPattern": "<regex>",
        "table": "<table>" | "tablePattern": "<regex>",
        "included": ["<column>", "<column>"],
        "excluded": ["<column>", "<column>"],
        "includedPattern": "<regex>",
        "excludedPattern": "<regex>"
    }
]

The following rules apply:

Use schema and table for exact name matching, or schemaPattern and tablePattern for regex matching. You cannot use both a field and its pattern variant in the same object (for example, schema and schemaPattern cannot both appear).
At least one of included, excluded, includedPattern, or excludedPattern must be provided.
When both included and excluded filters are specified, exclusions take precedence.
When multiple filters match the same table, the last matching filter is used, with exact matches taking precedence over pattern-based filters.
The value can be an array of objects to apply different filters to different tables.

Examples¶

Include specific columns by name:

[
    {
        "schema": "public",
        "table": "orders",
        "included": ["account_id", "status", "created_at"]
    }
]

Exclude specific columns by name:

[
    {
        "schema": "public",
        "table": "orders",
        "excluded": ["internal_note", "debug_flag"]
    }
]

Combine an include pattern with a specific exclusion (for example, include all email columns except admin_email):

[
    {
        "schema": "public",
        "table": "contacts",
        "includedPattern": ".*_email",
        "excluded": ["admin_email"]
    }
]

Mix a schema pattern with an exact table name to apply a filter across schemas:

[
    {
        "schemaPattern": "data_.*",
        "table": "customers",
        "excluded": ["internal_note"]
    }
]

Pass multiple filter objects to apply different rules to different tables:

[
    {"schema": "public", "table": "orders", "included": ["account_id", "status"]},
    {"schema": "public", "table": "customers", "excludedPattern": ".*_internal"}
]

Replicate a partitioned table¶

The connector supports replication of partitioned tables for PostgreSQL servers with version >= 15. A PostgreSQL partitioned table will be replicated into Snowflake as a single destination table.

For example, if you have a partitioned table orders, with sub-partitions orders_2023, orders_2024, and configured the connector to ingest all tables matching orders.* pattern, then only the orders table will be replicated to Snowflake, and it will include data from all sub-partitions.

To support replication of partitioned tables, ensure that the publication created in PostgreSQL has the publish_via_partition_root option set to true.

Ingestion of partitioned tables has currently the following limitations:

When a table is attached as a partition to a partitioned table after ingestion was started, the connector won't fetch data that existed in the partition table before attaching.
When a sub-partition table is detached from the partitioned table after ingestion was started, the connector won't mark the data from this sub-partition as deleted in the root partition table.
Truncate operation on subpartitions will not mark affected records as deleted.

跟踪表中的数据变化¶

连接器不仅复制源表中数据的当前状态，还复制每个变更集中每行的每个状态。这些数据存储在与目标表相同的架构中创建的日志表中。

The journal table names are formatted as: <source_table_name>_JOURNAL_<timestamp>_<schema_generation> where <timestamp> is the value of epoch seconds when the source table was added to replication, and <schema_generation> is an integer increasing with every schema change on the source table. As a result, source tables that undergo schema changes will have multiple journal tables.

When a table is removed from replication, then added back, the <timestamp> value will change, and <schema_generation> will start again from 1.

重要

Snowflake recommends that you do not alter the structure of journal tables in any way. They are used by the connector to update the destination table as part of the replication process.

The connector never drops journal tables, but does make use of the latest journal for every replicated source table, only reading append-only streams on top of journals. To reclaim the storage, you can:

Truncate all journal tables at any time.
Drop the journal tables related to source tables that were removed from replication.
Drop all but the latest generation journal tables for actively replicated tables.

例如，如果您的连接器设置为主动复制源表 orders，并且您之前已将表 customers 从复制中移除，则可能存在以下日志表：在这种情况下，您可以把它们全部删除，仅保留 orders_5678_2。

customers_1234_1
customers_1234_2
orders_5678_1
orders_5678_2

配置合并任务的调度¶

连接器使用仓库将变更数据捕获 (CDC) 数据合并到目标表中。此操作由 MergeSnowflakeJournalTable 处理器触发。如果没有新的更改，或者 MergeSnowflakeJournalTable 队列中没有新的待处理的 FlowFile，则不会触发合并，仓库会自动暂停。

要限制仓库成本并将合并仅限于预定时间，可以在合并任务计划 CRON 参数中使用 CRON 表达式。它限制了流向 MergeSnowflakeJournalTable 处理器的 FlowFile，并且只能在特定的时间段内触发合并。有关调度的更多信息，请参阅调度策略 (https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#scheduling-strategy)。

停止或删除连接器¶

停止或移除连接器时，需要注意连接器使用的复制槽 (https://www.postgresql.org/docs/current/warm-standby.html#STREAMING-REPLICATION-SLOTS)。

连接器会创建自己的复制槽，其名称以 snowflake_connector_ 开头，后跟随机后缀。当连接器读取复制流时会推进复制槽的位置，使 PostgreSQL 能够清理 WAL 日志并释放磁盘空间。

当连接器暂停时，复制槽不会被推进，对源数据库的更改会不断增加 WAL 日志大小。因此，不建议长时间暂停连接器，尤其是在高流量数据库中。

移除连接器时，无论是将其从 Openflow 画布中删除，还是通过任何其他方式（例如删除整个 Openflow 实例），复制槽都将保持原位，必须手动删除。

如果您有多个连接器实例从同一个 PostgreSQL 数据库进行复制，则每个实例将创建自己的唯一命名的复制槽。手动删除复制槽时，请确保删除的是正确的复制槽。通过检查 CaptureChangePostgreSQL 处理器的状态，您可以查看给定连接器实例使用的是哪个复制槽。

运行流¶

右键点击“飞机”图标并选择 Enable all Controller Services。
右键点击导入的进程组并选择 Start。连接器开始数据引入。