评估 AI 应用程序¶

要评估生成式 AI 应用程序，请遵循以下步骤：

使用 TrulensSDK（支持使用 Python 构建的应用程序）构建应用程序并对其进行检测。
在 Snowflake 中注册应用程序。
通过指定输入数据集来创建运行。
执行运行以生成跟踪并计算评估指标。
在 Snowsight 中查看评估结果。

对应用程序进行检测

After you create your generative AI application in Python, import the TruLens SDK to instrument it. The TruLens SDK provides an @instrument() decorator to instrument the functions in your application to generate the traces and compute the metric.

要使用装饰器，请在 Python 应用程序中添加以下导入：
```
from trulens.core.otel.instrument import instrument
```

You can change the granularity of the @instrument() decorator depending on your requirements.

场景 1：跟踪函数¶

You can add @instrument() before the function you need to trace. This automatically captures the inputs to the function, the outputs (return values), and the latency of execution. For example, the following code demonstrates tracing an answer_query function that automatically captures input query and the final response:

@instrument()
def answer_query(self, query: str) -> str:
    context_str = self.retrieve_context(query)
    return self.generate_completion(query, context_str)

场景 2：使用特定 span 类型跟踪函数¶

A span type specifies the nature of the function and improves the readability and understanding of the traces. For example, in a RAG application you can specify span type as RETRIEVAL for your search service (or retriever) and specify the span type as GENERATION for the LLM inference call. The following span types are supported:

RETRIEVAL: Span type for retrieval or search functions
GENERATION: Span type for model inference calls from an LLM
RECORD_ROOT: Span type for the main function in your application

If you don’t specify a span type with the @instrument(), an UNKNOWN span type is assigned by default. To use span attributes, add the following import to your Python application.

from trulens.otel.semconv.trace import SpanAttributes

The following code snippet demonstrates tracing a RAG application. The span type must always be prefixed with SpanAttributes.SpanType.

@instrument(span_type=SpanAttributes.SpanType.RETRIEVAL)
def retrieve_context(self, query: str) -> list:
    """
    Retrieve relevant text from vector store.
    """
    return self.retrieve(query)

@instrument(span_type=SpanAttributes.SpanType.GENERATION)
def generate_completion(self, query: str, context_str: list) -> str:
    """
    Generate answer from context by calling an LLM.
    """
    return response

@instrument(span_type=SpanAttributes.SpanType.RECORD_ROOT)
def answer_query(self, query: str) -> str:
    context_str = self.retrieve_context(query)
    return self.generate_completion(query, context_str)

场景 3：跟踪函数并计算求值¶

In addition to providing span types, you must assign relevant parameters in your application to span attributes to compute the metrics. For example, to compute context relevance in a RAG application, you must assign the relevant query and retrieval results parameter to appropriate attributes RETRIEVAL.QUERY_TEXT and RETRIEVAL.RETRIEVED_CONTEXTS respectively. The attributes required to compute each individual metric can be found in the Metrics page.

每种 span 类型都支持以下 span 属性：

RECORD_ROOT: INPUT, OUTPUT, GROUND_TRUTH_OUTPUT
RETRIEVAL: QUERY_TEXT, RETRIEVED_CONTEXTS
GENERATION: None

要使用 span 属性，您需要在 Python 应用程序中添加以下导入。

from trulens.otel.semconv.trace import SpanAttributes

The following code snippet provides an example to compute context relevance for a retrieval service. The attributes must always follow the format SpanAttributes.<span type>.<attribute name> (e.g., SpanAttributes.RETRIEVAL.QUERY_TEXT).

@instrument(
    span_type=SpanAttributes.SpanType.RETRIEVAL,
    attributes={
        SpanAttributes.RETRIEVAL.QUERY_TEXT: "query",
        SpanAttributes.RETRIEVAL.RETRIEVED_CONTEXTS: "return",
    }
)
def retrieve_context(self, query: str) -> list:
    """
    Retrieve relevant text from vector store.
    """
    return self.retrieve(query)

In the preceding example, query represents the input parameter to retrieve_context() and return represents the value returned. These are assigned to the attributes RETRIEVAL.QUERY_TEXT and RETRIEVAL.RETRIEVED_CONTEXTS to compute context relevance.

Auto-instrument framework applications¶

In addition to manual instrumentation using the @instrument() decorator, TruLens provides specialized wrappers that automatically instrument applications built with popular LLM frameworks. These wrappers provide integration and automatic tracing without requiring manual decoration of individual functions.

TruChain for LangChain¶

TruChain provides automatic instrumentation for applications built with LangChain (https://www.langchain.com/). It automatically captures the execution of key LangChain classes including chains, LLMs, prompts, and retrievers.

from trulens.apps.langchain import TruChain

# Wrap your LangChain application
tru_recorder = TruChain(
    rag_chain,
    app_name="my_langchain_app",
    app_version="v1.0"
)

# Use the recorder as a context manager
with tru_recorder as recording:
    response = rag_chain.invoke(input_query)

TruChain supports:

Automatic instrumentation of LangChain Expression Language (LCEL) chains
Async support through the ainvoke method
Built-in selectors (on_input, on_output, on_context) for RAG triad evaluation

TruGraph for LangGraph¶

TruGraph provides automatic instrumentation for applications built with LangGraph (https://langchain-ai.github.io/langgraph/). It automatically detects LangGraph applications and instruments both LangChain and LangGraph components.

from trulens.apps.langgraph import TruGraph

# Wrap your LangGraph application
tru_recorder = TruGraph(
    graph,
    app_name="my_langgraph_app",
    app_version="v1.0"
)

# Use the recorder as a context manager
with tru_recorder as recording:
    response = graph.invoke({"messages": [("user", input_query)]})

TruGraph supports:

Automatic @task instrumentation with intelligent attribute extraction
Multi-agent evaluation capabilities
Combined instrumentation of both LangChain and LangGraph components

TruLlama for LlamaIndex¶

TruLlama provides automatic instrumentation for applications built with LlamaIndex (https://www.llamaindex.ai/). It automatically captures the execution of key LlamaIndex classes including query engines, retrievers, and response synthesizers.

from trulens.apps.llamaindex import TruLlama

# Wrap your LlamaIndex query engine
tru_recorder = TruLlama(
    query_engine,
    app_name="my_llamaindex_app",
    app_version="v1.0"
)

# Use the recorder as a context manager
with tru_recorder as recording:
    response = query_engine.query(input_query)

TruLlama supports:

Automatic instrumentation of query engines, chat engines, and retrievers
Async support through aquery, achat, and astream_chat methods
Streaming support for LlamaIndex applications
Built-in selectors (on_input, on_output, on_context) for RAG triad evaluation

For more information about framework-specific instrumentation, see the TruLens documentation (https://www.trulens.org/component_guides/instrumentation/).

在 Snowflake 中注册应用程序¶

To register your generative AI application in Snowflake for capturing traces and conducting evaluations, you need to create a TruApp object using the TruLens SDK that records the invocation (execution) of the user’s app and exports traces to Snowflake.

tru_app = TruApp(
    app: Any,
    app_name: str,
    app_version: str,
    connector: SnowflakeConnector,
    main_method: callable  # i.e. app.query
)

Note

If your application is built using LangChain, LangGraph, or LlamaIndex, you can use TruChain, TruGraph, or TruLlama respectively in place of TruApp. These framework-specific wrappers provide the same registration functionality while also enabling automatic instrumentation of your application. See Auto-instrument framework applications for more details.

参数：

app: Any: an instance of the user-defined application that will later be invoked during a run for evaluation. i.e. app = RAG()
app_name: str: is the name of the application user can specify and will be maintained in the user’s Snowflake account.
app_version: str: is the version user can specify for the app to allow experiments tracking and comparison.
connector: SnowflakeConnector: a wrapper class that manages snowpark session and Snowflake DB connection.
main_method: callable (Optional): is the entry point method for the user’s application, which tells the SDK how the app is expected to be called by users and where to start tracing the invocation of the user app (specified by app). For the example of RAG class, the main_method can be specified as app.answer_query, assuming the answer method is the entry point of the app. Alternatively, instrument the entry point method with span attribute RECORD_ROOT. In that case, this parameter is not required.

创建运行

To begin an evaluation job, you need to create a run. Creating a run requires a run configuration to be specified. The add_run() function uses the run configuration to create a new run.

运行配置

A run is created from a RunConfig

run_config = RunConfig(
    run_name=run_name,
    description="desc",
    label="custom tag useful for grouping comparable runs",
    source_type="DATAFRAME",
    dataset_name="My test dataframe name",
    dataset_spec={
        "RETRIEVAL.QUERY_TEXT": "user_query_field",
        "RECORD_ROOT.INPUT": "user_query_field",
        "RECORD_ROOT.GROUND_TRUTH_OUTPUT": "golden_answer_field",
    },
    llm_judge_name: "mistral-large2"
)

run_name: str: name of the run, should be unique under the same TruApp
description: str (optional): string description of the run
label: str (optional): label used to group run together
source_type: str: specifies the source of the dataset. It can either be DATAFRAME for a python dataframe or TABLE for a user table in the Snowflake account.
dataset_name: str: any arbitrary name specified by the user if source_type is DATAFRAME. Or, a valid Snowflake table name under the user’s account under current context (database and schema) or Snowflake fully-qualified name in the form of “database.schema.table_name”.
dataset_spec: Dict[str, str]: a dictionary mapping supported span attributes to user’s column names in the dataframe or table. The allowed keys are span attributes as specified in the Dataset page and the allowed values are column names in the user’s specified dataframe or table. For example, “golden_answer_field” in the run config example above must be a valid column name
llm_judge_name: str (Optional): name to use as LLM judges during LLM-based metric computation. Please see the models page for supported judges. If not specified, the default value is llama3.1-70b

run = tru_app.add_run(run_config=run_config)

请求参数：

run_config: RunConfig: contains the configuration for the run.

检索运行

检索相应运行。

run = tru_app.get_run(run_name=run_name)

请求参数：

run_name: str: name of the run

查看运行元数据

描述运行详细信息。

run.describe()

调用运行

You can invoke the run using the run.start() function. It reads the inputs from the dataset specified in the run configuration, invokes the application for each input, generates the traces, and ingests the information for storage in your Snowflake account. run.start() is a blocking call until the application is invoked for all inputs in your dataset and ingestion is completed or timed out.

run.start()  # if source_type is "TABLE"

run.start(input_df=user_input_df)  # if source_type is "DATAFRAME"

请求参数：

input_df: DataFrame (Optional): is a pandas dataframe from the SDK. If the source_type in run configuration is specified as DATAFRAME, this field is mandatory. If the source_type is TABLE, this field is not required.

计算指标

You can start metric computations using run.compute_metrics() after the application is invoked and all traces are ingested. As long as the status of the run is INVOCATION_IN_PROGRESS, computation cannot be started. Once the status is INVOCATION_COMPLETED or INVOCATION_PARTIALLY_COMPLETED, run.compute_metrics() can be initiated. run.compute_metrics() is an asynchronous non-blocking function. You can call compute_metrics multiple times on the same run with a different set of metrics, and each call will trigger a new computation job. Note that metrics once computed cannot be re-computed again for the same run.

run.compute_metrics(metrics=[
    "coherence",
    "answer_relevance",
    "groundedness",
    "context_relevance",
    "correctness",
])

请求参数：

metrics: List[str]: list of string names of the metrics listed in Metrics. The name of metrics should be specified in snake cases. i.e. Context Relevance should be specified as context_relevance.

检查运行状态

您可以在运行开始后检查其状态。状态列表位于“运行状态”部分。

run.get_status()

取消运行

You can cancel an existing run using run.cancel(). This operation will prevent any future updates to the run, including run status and metadata fields.

run.cancel()

删除运行

You can delete an existing run using run.delete(). This operation deletes the metadata associated with the run and the evaluation results cannot be accessed. However, the traces and evaluations generated as part of the runs are not deleted and remain stored. Please refer to Observability data section for more information about storage and deletion of evaluation and traces.

run.delete()

应用程序的运行列表

You can see the list of all available runs corresponding to a specific TruApp application object using the list_runs() function.

tru_app.list_runs()

响应：

Return a list of all Runs created under the tru_app.

查看评估和跟踪

要查看评估结果，请执行以下操作：

Sign in to Snowsight.
In the navigation menu, select AI & ML » Evaluations.

执行以下操作以查看应用程序运行的评估结果：

要查看与特定应用程序对应的运行，请选择该应用程序。
要查看某个运行的评估结果，请选择该运行。您可以查看汇总结果和每条记录对应的结果。
要查看某条记录的跟踪，请选择该记录。您可以查看应用程序每个暂存区的详细跟踪、延迟、输入和输出、评估结果，以及 LLM 裁判对已生成的准确度分数所做的说明。

To compare runs that use the same dataset, select multiple runs and select Compare to compare the outputs and the evaluation scores.