Snowpark Container Services: Monitoring Services¶
Accessing container logs¶
Snowflake collects whatever your application containers output to standard output and standard error. You should ensure that your code outputs useful information to debug a service.
Snowflake provides two ways to access these service (including job service) container logs:
Using the SYSTEM$GET_SERVICE_LOGS system function: Gives access to logs from a specific container. After a container exits, you can continue to access the logs using the system function for a short time. System functions are most useful during development and testing, when you are initially authoring a service or a job. For more information, see SYSTEM$GET_SERVICE_LOGS.
Using an event table: The account’s event table gives you access to logs from multiple containers for services that enable log collection in their specification. Snowflake persists the logs in the event table for later access. Event tables are best used for the retrospective analysis of services and jobs. For more information, see Using event table.
Using SYSTEM$GET_SERVICE_LOGS¶
You provide the service name, instance ID, container name, and optionally the number of most recent log lines to retrieve. If only one service instance is running, the service instance ID is 0. For example, the following statement command retrieves the
trailing 10 lines from the log of a container named echo
that belongs to instance 0 of a service named echo_service
:
SELECT SYSTEM$GET_SERVICE_LOGS('echo_service', '0', 'echo', 10);
Example output:
+--------------------------------------------------------------------------+
| SYSTEM$GET_SERVICE_LOGS |
|--------------------------------------------------------------------------|
| 10.16.6.163 - - [11/Apr/2023 21:44:03] "GET /healthcheck HTTP/1.1" 200 - |
| 10.16.6.163 - - [11/Apr/2023 21:44:08] "GET /healthcheck HTTP/1.1" 200 - |
| 10.16.6.163 - - [11/Apr/2023 21:44:13] "GET /healthcheck HTTP/1.1" 200 - |
| 10.16.6.163 - - [11/Apr/2023 21:44:18] "GET /healthcheck HTTP/1.1" 200 - |
+--------------------------------------------------------------------------+
1 Row(s) produced. Time Elapsed: 0.878s
If you don’t have the information about the service that you need to call the function (such as the instance ID or container name), you can first run the SHOW SERVICE CONTAINERS IN SERVICE command to get information about the service instances and containers running in each instance.
The SYSTEM$GET_SERVICE_LOGS function has the following limitations:
It merges standard output and standard error streams. The function provides no indication of which stream the output came from.
It reports the captured data for a specific container in a single service instance.
It only reports logs for a running container. The function cannot fetch logs from a previous container that was restarted or from a container of a service that is stopped or deleted.
The function returns up to 100 KB of data.
Using event table¶
Snowflake can capture logs sent from containers to the standard output and standard error streams into the event table configured for your account. For more information about configuring an event table, see Logging, tracing, and metrics.
You control which streams are collected (all, standard error only, or none) that you want stored in an event table by using the spec.logExporters field in the service specification file.
You can then query the event table for events. For example, the following SELECT statement retrieves Snowflake service and job events recorded in the past hour:
SELECT TIMESTAMP, RESOURCE_ATTRIBUTES, RECORD_ATTRIBUTES, VALUE
FROM <current-event-table-for-your-account>
WHERE timestamp > dateadd(hour, -1, current_timestamp())
AND RESOURCE_ATTRIBUTES:"snow.service.name" = <service-name>
AND RECORD_TYPE = 'LOG'
ORDER BY timestamp DESC
LIMIT 10;
Snowflake recommends that you include a timestamp in the WHERE clause of event table queries, as shown in this example. This is particularly important because of the potential volume of data generated by various Snowflake components. By applying filters, you can retrieve a smaller subset of data, which improves query performance.
The event table includes the following columns, which provide useful information regarding the logs collected by Snowflake from your container:
TIMESTAMP: Shows when Snowflake collected the log.
RESOURCE_ATTRIBUTES: Provides a JSON object that identifies the Snowflake service and the container in the service that generated the log message. For example, it furnishes details such as the service name, container name, and compute pool name that were specified when the service was run.
{ "snow.account.name": "SPCSDOCS1", "snow.compute_pool.id": 20, "snow.compute_pool.name": "TUTORIAL_COMPUTE_POOL", "snow.compute_pool.node.id": "a17e8157", "snow.compute_pool.node.instance_family": "CPU_X64_XS", "snow.database.id": 26, "snow.database.name": "TUTORIAL_DB", "snow.schema.id": 212, "snow.schema.name": "DATA_SCHEMA", "snow.service.container.instance": "0", "snow.service.container.name": "echo", "snow.service.container.run.id": "b30566", "snow.service.id": 114, "snow.service.name": "ECHO_SERVICE2", "snow.service.type": "Service" }
RECORD_ATTRIBUTES: For a Snowflake service, it identifies an error source (standard output or standard error).
{ "log.iostream": "stdout" }
VALUE: Standard output and standard error are broken into lines, and each line generates a record in the event table.
"echo-service [2023-10-23 17:52:27,429] [DEBUG] Sending response: {'data': [[0, 'Joe said hello!']]}"
Accessing metrics¶
Snowflake provides metrics for compute pools in your account and services running on those compute pools. These metrics, provided by Snowflake, are also referred to as platform metrics.
Event-table service metrics: Individual services publish metrics. These are a subset of the compute pool metrics that provide information specific to the service. The target use case for this is to observe the resource utilization of a specific service. In the service specification, you define which metrics you want Snowflake to record in the event table while the service is running.
Compute pool metrics: Each compute pool also publishes metrics that provide information about what is happening inside that compute pool. The target use case for this is to observe the compute pool utilization. To access your compute pool metrics, you will need to write a service that uses Prometheus-compatible API to poll the metrics that the compute pool publishes.
Accessing event-table service metrics¶
To log metrics from a service into the event table configured for your account, include the following section in your service specification:
platformMonitor:
metricConfig:
groups:
- <group 1>
- <group 2>
- ...
Where each group N
refers to a predefined metrics group that you are interested in (for example, system
, network
, or storage
). For more information, see the spec.platformMonitor field section in the documentation on the service specification.
While the service is running, Snowflake records these metrics to the event table in your account. You can query your event table to read the metrics. The following query retrieves the service metrics that were recorded in the past hour for the service my_service
:
SELECT timestamp, value
FROM my_event_table_db.my_event_table_schema.my_event_table
WHERE timestamp > DATEADD(hour, -1, CURRENT_TIMESTAMP())
AND RESOURCE_ATTRIBUTES:"snow.service.name" = 'MY_SERVICE'
AND RECORD_TYPE = 'METRIC'
ORDER BY timestamp DESC
LIMIT 10;
If you don’t know the name of the active event table for the account, execute the SHOW PARAMETERS command to display the value of the account-level EVENT_TABLE parameter:
SHOW PARAMETERS LIKE 'event_table' IN ACCOUNT;
For more information about event tables, see Using event table.
Example
Follow these steps to create an example service that records metrics to the event table configured for your account.
Follow Tutorial 1 to create a service named
echo_service
, with one change. In step 3, where you create a service, use the following CREATE SERVICE command that add theplatformMonitor
field in the modified service specification.CREATE SERVICE echo_service IN COMPUTE POOL tutorial_compute_pool FROM SPECIFICATION $$ spec: containers: - name: echo image: /tutorial_db/data_schema/tutorial_repository/my_echo_service_image:latest env: SERVER_PORT: 8000 CHARACTER_NAME: Bob readinessProbe: port: 8000 path: /healthcheck endpoints: - name: echoendpoint port: 8000 public: true platformMonitor: metricConfig: groups: - system - system_limits $$ MIN_INSTANCES=1 MAX_INSTANCES=1;
After the service is running, Snowflake starts recording the metrics in the specified metric groups to the event table, which you can then query. The following query retrieves metrics reported in the last hour by the Echo service.
SELECT timestamp, value ROM my_events WHERE timestamp > DATEADD(hour, -1, CURRENT_TIMESTAMP()) AND RESOURCE_ATTRIBUTES:"snow.service.name" = 'ECHO_SERVICE' AND RECORD_TYPE = 'METRIC' ORDER BY timestamp, group DESC LIMIT 10;
Accessing compute pool metrics¶
Compute pool metrics offer insights into the nodes in the compute pool and the services running on them. Each node reports node-specific metrics, such as the amount of available memory for containers, as well as service metrics, like the memory usage by individual containers. The compute pool metrics provide information from a node’s perspective.
Each node has a metrics publisher that listens on TCP port 9001. Other services can make an HTTP GET request with the path /metrics
to port 9001 on the node. To discover the node’s IP address, retrieve SRV records (or A records) from DNS for the discover.monitor.compute_pool_name.snowflakecomputing.internal
hostname. Then, create another service in your account that actively polls each node to retrieve the metrics.
The body in the response provides the metrics using the Prometheus format (https://prometheus.io/docs/instrumenting/exposition_formats/#text-based-format) as shown in the following example metrics:
# HELP node_memory_capacity Defines SPCS compute pool resource capacity on the node
# TYPE node_memory_capacity gauge
node_memory_capacity{snow_compute_pool_name="MY_POOL",snow_compute_pool_node_instance_family="CPU_X64_S",snow_compute_pool_node_id="10.244.3.8"} 1
node_cpu_capacity{snow_compute_pool_name="MY_POOL",snow_compute_pool_node_instance_family="CPU_X64_S",snow_compute_pool_node_id="10.244.3.8"} 7.21397383168e+09
Note the following:
The response body starts with
# HELP
and# TYPE
, which provide a short description and the type of the metric. In this example, thenode_memory_capacity
metric is of typegauge
.It is then followed by the metric’s name, a list of labels describing a specific resource (data point), and its value. In this example, the metric (named
node_memory_capacity
) provides memory information, indicating that the node has 7.2 GB available memory. The metric also includes metadata in the form of labels as shown:snow_compute_pool_name="MY_POOL", snow_compute_pool_node_instance_family="CPU_X64_S",snow_compute_pool_node_id="10.244.3.8"
You can process these metrics any way you choose; for example, you might store metrics in a database and use a UI (such as a Grafana dashboard) to display the information.
Note
Snowflake does not provide any aggregation of metrics. For example, to get metrics for a given service, you must query all nodes that are running instances of that service.
The compute pool must have a DNS-compatible name for you to access the metrics.
The endpoint exposed by a compute pool can be accessed by a service using a role that has the OWNERSHIP or MONITOR privilege on the compute pool.
For a list of available compute pool metrics, see Available platform metrics.
Example
For an example of configuring Prometheus to poll your compute pool for metrics, see the compute pool metrics tutorials (https://github.com/Snowflake-Labs/spcs-templates/tree/main/user-metrics).
Available platform metrics¶
The following is a list of available platform metrics groups and metrics within each group. Note that storage
metrics are currently only collected from block storage volumes.
Metric group . Metric name |
Unit |
Type |
Description |
---|---|---|---|
system . container.cpu.usage |
cpu cores |
gauges |
Average number of CPU cores used since last measurement. 1.0 indicates full utilization of 1 CPU core. Max value is number of cpu cores available to the container. |
system . container.memory.usage |
bytes |
gauge |
Memory used, in bytes. |
system . container.gpu.memory.usage |
bytes |
gauge |
Per-GPU memory used, in bytes. The source GPU is denoted in the ‘gpu’ attribute. |
system . container.gpu.utilization |
ratio |
gauge |
Ratio of per-GPU usage to capacity. The source GPU is denoted in the ‘gpu’ attribute. |
system_limits . container.cpu.limit |
cpu cores |
gauge |
CPU resource limit from the service specification. If no limit is defined, defaults to node capacity. |
system_limits . container.gpu.limit |
gpus |
gauge |
GPU count limit from the service specification. If no limit is defined, the metric is not emitted. |
system_limits . container.memory.limit |
bytes |
gauge |
Memory limit from the service specification. If no limit is defined, defaults to node capacity. |
system_limits . container.cpu.requested |
cpu cores |
gauge |
CPU resource request from the service specification. If no limit is defined, this defaults to a value chosen by Snowflake. |
system_limits . container.gpu.requested |
gpus |
gauge |
GPU count from the service specification. If no limit is defined, the metric is not emitted. |
system_limits . container.memory.requested |
bytes |
gauge |
Memory request from the service specification. If no limit is defined, this defaults to a value chosen by Snowflake. |
system_limits . container.gpu.memory.capacity |
bytes |
gauge |
Per-GPU memory capacity. The source GPU is denoted in the ‘gpu’ attribute. |
status . container.restarts |
restarts |
gauge |
Number of times Snowflake restarted the container. |
status . container.state.finished |
boolean |
gauge |
When the container is in the ‘finished’ state, this metric will be emitted with the value 1. |
status . container.state.last.finished.reason |
boolean |
gauge |
If the container has restarted previously, this metric will be emitted with the value 1. The ‘reason’ label describes why the container last finished. |
status . container.state.last.finished.exitcode |
integer |
gauge |
If a container has restarted previously, this metric will contain the exit code of the previous run. |
status . container.state.pending |
boolean |
gauge |
When a container is in the ‘pending’ state, this metric will be emitted with the value 1. |
status . container.state.pending.reason |
boolean |
gauge |
When a container is in the ‘pending’ state, this metric will be emitted with value the 1. The ‘reason’ label describes why the container was most recently in the pending state. |
status . container.state.running |
boolean |
gauge |
When a container is in the ‘running’ state, this metric will have value the 1. |
status . container.state.started |
boolean |
gauge |
When a container is in the ‘started’ state, this metric will have value the 1. |
network . network.egress.denied.packets |
packets |
gauge |
Network egress total denied packets due to policy validation failures. |
network . network.egress.received.bytes |
bytes |
gauge |
Network egress total bytes received from remote destinations. |
network . network.egress.received.packets |
packets |
gauge |
Network egress total packets received from remote destinations. |
network . network.egress.transmitted.bytes |
byte |
gauge |
Network egress total bytes transmitted out to remote destinations. |
network . network.egress.transmitted.packets |
packets |
gauge |
Network egress total packets transmitted out to remote destinations. |
storage . volume.capacity |
bytes |
gauge |
Size of the filesystem. |
storage . volume.io.inflight |
operations |
gauge |
Number of active filesystem I/O operations. |
storage . volume.read.throughput |
bytes/sec |
gauge |
Filesystem reads throughput in bytes per second. |
storage . volume.read.iops |
operations/sec |
gauge |
Filesystem read operations per second. |
storage . volume.usage |
bytes |
gauge |
Total number of bytes used in the filesystem. |
storage . volume.write.throughput |
bytes/sec |
gauge |
Filesystem write throughput in bytes per second. |
storage . volume.write.iops |
operations/sec |
gauge |
Filesystem write operations per second. |