ingestion model

This topic describes how the ingests data from the Google Analytics Data API (https://developers.google.com/analytics/devguides/reporting/data/v1) and how sampling may affect ingested data.

引入策略

连接器使用两种引入模式:

  • 数据的 初始加载 在配置报告后直接发生。成功完成 初始加载 后,从选择的开始日期到今天的数据都将被引入。
  • 初始加载 完成后开始数据的 持续加载。增量更新按选择的定期计划发生。

每份报告的引入是一个独立的过程。引入过程可以并行进行。

See Set up data ingestion for your instance to learn how to configure a report or choose a sync schedule and a start date.

选择间隔长度

The Google Analytics Data API (https://developers.google.com/analytics/devguides/reporting/data/v1) requires specifying each request’s date range (startDate and endDate). The connector may make multiple requests during one ingestion load and adjust an interval length as required. The default interval is 31 days. The interval may be shortened automatically in the following situations:

  • API 返回了一个错误,连接器可以通过以更短的间隔重试请求来缓解该错误。
  • API 返回了采样数据(仅当在报告配置期间选择了 避免采样 选项时)。
  • 该报告包含大量数据。在这种情况下,间隔缩短以降低在检索后续结果页面时发生 API 错误的风险。

用户无法设置间隔长度。

监控引入

Ingestion metadata is available in the CONNECTOR_STATS view. See more: Monitoring the .

SELECT * FROM PUBLIC.CONNECTOR_STATS ORDER BY COMPLETED_AT DESC;

The METADATA column contains, among other things, the request body that was sent in a request to the Google Analytics Data API (https://developers.google.com/analytics/devguides/reporting/data/v1). The request body contains information about startDate and endDate.

The STATUS column may be equal to one of the following values:
  • COMPLETED - a successful ingestion.
  • CANCELED - the interval length was shortened and the ingestion will continue with adjusted date ranges.
  • FAILED - ingestion failed and was not continued.

Note

FAILED ingestion doesn’t necessarily mean that the data was lost. The connector may recover from some errors by attempting to download all missing data during the next scheduled report update. If succeeding ingestion runs were successful, the connector ingested all missing data.

To receive email notifications about failed ingestion runs, set up alerting. See more: Manage the .

关于采样

Sampling is the process of selecting and analyzing a subset of data from a larger dataset in order to extrapolate the result. This means that sampling lowers data quality. Data quality depends on number of samples used in the process. For more information see Google Analytics sampling (https://support.google.com/analytics/answer/13331292?hl=en).

Note

默认情况下,连接器不会尝试避免采样。此设置只能在初始报告配置期间更改。

获取采样元数据

The METADATA column from the CONNECTOR_STATS view contains also sampling metadata. It can be joined with the data saved in a destination table.

使用以下语句获取有关采样数据的信息:

SELECT d.date, d.raw, d.last_update_date, cs.metadata:samplingMetadata:samplesReadCount::INTEGER as samplesReadCount, cs.metadata:samplingMetadata:samplingSpaceSize::INTEGER as samplingSpaceSize, samplesReadCount/samplingSpaceSize as ratio
FROM <destination_table> as d
LEFT JOIN <connector_stats_view> as cs
ON d.ingestion_run_id = cs.run_id
WHERE cs.metadata:samplingMetadata:samplingOccurred::BOOLEAN = true;

Replace the placeholders with the actual values, as in the following example for a report named REPORT_1.

SELECT d.date, d.raw, d.last_update_date, cs.metadata:samplingMetadata:samplesReadCount::INTEGER as samplesReadCount, cs.metadata:samplingMetadata:samplingSpaceSize::INTEGER as samplingSpaceSize, samplesReadCount/samplingSpaceSize as ratio
FROM google_analytics_aggregate_data_dest_db.google_analytics_aggregate_data_dest_schema.report_1__raw as d
LEFT JOIN snowflake_connector_for_google_analytics_aggregate_data.public.connector_stats as cs
ON d.ingestion_run_id = cs.run_id
WHERE cs.metadata:samplingMetadata:samplingOccurred::BOOLEAN = true;

结果包含与采样相关的以下信息。

名称描述
samplesReadCount在指定的日期范围内,这份采样报告中读取的事件总数。
samplingSpaceSize在指定日期范围内,该属性的数据中可以被此报告分析的事件总数。
ratio已分析事件的数量与可分析事件的数量之比。

The Google Analytics sampling metadata documentation (https://developers.google.com/analytics/devguides/reporting/data/v1/rest/v1beta/ResponseMetaData#SamplingMetadata) provides more information about the meaning of the sampling metadata values.

Note

在升级到 1.4.0 版本之前,关于引入的元数据不包含关于采样发生的信息。只有当 samplingOccurred 标志为 False 时,才可以确定数据没有被采样。