教程 1:提供商设置和测试 CKE

简介

对于提供商,本教程介绍如何设置和测试您的 CKE.

您将学习以下内容

在本教程中,您将学习如何进行以下操作:

  • 创建 Snowflake 对象

  • 将您的数据加载到 Snowflake

  • 对文档进行分块处理

  • 创建 Cortex Search Service

  • 验证 CKE 正常工作

  • 使用使用者账户共享和测试 CKE

先决条件

要完成本教程,需要满足以下先决条件:

  • 您拥有一个 Snowflake 账户和用户,该用户具有这样的角色:可授予创建数据库、表、虚拟仓库对象、Cortex Search 服务和 Streamlit 应用程序所需的权限。

请参阅 20 分钟学会使用 Snowflake 以获得满足这些要求的说明。

第 1 步:创建 Snowflake 对象

第一步是创建 Snowflake 对象。

使用 accountadmin 角色。

use role accountadmin;
Copy

创建一个名为 xsmall_cke_getting_started 的仓库来创建和更新索引。

create warehouse xsmall_cke_getting_started warehouse_size=xsmall;
Copy

创建一个名为 cke_owner 的单独角色。

create role cke_owner;
grant role cke_owner to user admin;
grant usage on warehouse xsmall_cke_getting_started to role cke_owner;
Copy

创建并使用名为 cke_getting_started 的数据库。

grant create database on account to role cke_owner;
use role cke_owner;
create database cke_getting_started;
use database cke_getting_started;
Copy

创建并使用名为 articles 的架构。

create schema articles;
use schema articles;
Copy

第 2 步:将您的数据加载到 Snowflake

下一步是将您的数据加载到 Snowflake 中。有关更多信息,请参阅 将数据载入 Snowflake

下面的示例代码采用以下格式将数据存储在名为 cke_simple_article 的 Snowflake 表中:

列名称

类型

描述

DOCUMENT_ID

VARCHAR

文档的唯一标识符。这是表的主键。

DOCUMENT_TITLE

VARCHAR

文档的标题。

SOURCE_URL

VARCHAR

指向文档来源的 URL 链接。

DOCUMENT_TEXT

VARCHAR

文档内容,解析为文本。这是将要编入索引和搜索的内容。

请注意,您可以在索引数据集中包含其他文档元数据。在下面的示例中,我们仅包含 SOURCE_URLDOCUMENT_ID,但您可以根据文档来源添加更多列。

创建一个简单的表。

create or replace table cke_simple_article (
    DOCUMENT_ID VARCHAR,
    DOCUMENT_TITLE VARCHAR,
    SOURCE_URL VARCHAR,
    text VARCHAR
);
Copy

现在向该表中插入一些样本数据。

INSERT INTO cke_simple_article (DOCUMENT_ID, DOCUMENT_TITLE, SOURCE_URL, TEXT)
VALUES
    ('DOC_001', 'Sample Article 1', 'https://example.com/article1', 'This is some sample text for the first article.'),
    ('DOC_002', 'Sample Article 2', 'https://example.com/article2', 'Another sample text entry for the second article.'),
    ('DOC_003', 'Sample Article 3', 'https://example.com/article3', 'Yet another piece of text for the third article.');

INSERT INTO cke_simple_article (
    DOCUMENT_ID,
    DOCUMENT_TITLE,
    SOURCE_URL,
    text
)
VALUES (
    'DOC-GREEN-001',
    'The Grand Opening of Greenfield Biosphere',
    'https://www.example.com/news/greenfield-biosphere',
    'Greenfield Biosphere, nestled in the heart of a once-industrial landscape, opened its doors to the public today amid great fanfare and curiosity. This ambitious environmental initiative, spanning over 120 acres of reclaimed land, has been designed to house thousands of diverse plant species and animals under one vast, transparent dome. Over the past decade, teams of botanists, engineers, and conservationists collaborated intensively to restore the soil quality, implement renewable energy solutions, and establish sustainable water sources. Their efforts have resulted in an oasis that stands as a testament to nature''s resilience and humanity''s unwavering determination to coexist with it.

    Upon entering the biosphere, visitors pass through a series of controlled airlocks that maintain precise temperature and humidity levels, ensuring the delicate balance required for each habitat. The moment they step inside, a multitude of colors and scents envelops them. Towering palm trees sway gently, nurtured by a carefully engineered irrigation system that recycles water across various sections of the dome. Exotic butterflies flutter past patches of vibrant orchids, while small reptiles scurry along the edge of meandering pathways. Every detail, from lighting angles to seed selection, has been meticulously planned to promote biodiversity in a space that once lay barren.

    Local officials and environmental organizations herald this project as a bold step toward reversing ecological decline. The region had suffered decades of industrial pollution, leaving the soil depleted and wildlife populations on the brink of collapse. Public interest soared once the Greenfield Biosphere project was announced, prompting unprecedented fundraising campaigns and private investments. Citizens volunteered their time to plant seedlings, build composting facilities, and educate children on the importance of ecological stewardship. Now, as thousands explore the dome on opening day, excitement mingles with a sense of responsibility, fueling hope that this initiative can serve as a catalyst for broader restoration efforts.

    Beyond merely a tourist attraction, the Greenfield Biosphere plays a crucial role in scientific research. Biologists and ecologists from universities around the globe have established research stations within the dome to study plant migration, cross-pollination, and microclimates. Through advanced sensor networks, they collect data on everything from soil moisture levels to carbon sequestration rates, aiming to develop cutting-edge conservation strategies. Already, preliminary findings suggest that certain flora species exhibit faster growth rates under partial shade, which could help inform future reforestation projects. This research extends to aquatic ecosystems as well, with scientists closely monitoring newly formed ponds and streams for indicators of ecosystem health.

    During the grand opening ceremony, Mayor Allison Pierce praised the community for its unwavering dedication to the biosphere''s development. She emphasized how interagency cooperation and community outreach were pivotal in transforming a polluted wasteland into a verdant sanctuary. In her address, she remarked on the significance of involving local youth, who contributed to the design through art projects and educational workshops. According to Mayor Pierce, the next phase of the project will include expanding the biosphere''s capacity for endangered species breeding programs. This could cement the region''s reputation as a global leader in ecological preservation and innovation.

    For many, the real highlight of the day was the unveiling of the arboretum wing, a temperature-controlled section featuring ancient tree species that have long faced threats from illegal logging and habitat loss. Towering redwoods, thought to be too large to grow under a dome, stand proudly after years of careful nurturing. Visitors stood in awe as the directors revealed that these trees'' root systems, painstakingly preserved and transplanted, are now thriving in custom-engineered soil mixtures. A sense of reverence filled the air, with many attendees describing the experience as spiritual. The seed of hope planted in the community has visibly taken root.

    The venture''s economic impact is another key talking point. Local shops and restaurants anticipate an influx of tourists, and hotels report reservations scheduled months in advance. Construction of new eco-lodges in the surrounding areas is already underway, promising a blend of comfortable accommodations with sustainable building practices. The city council has also approved additional funding to improve roads and public transportation to accommodate the expected rise in visitor numbers. Environmental advocates caution, however, that increased foot traffic could inadvertently strain the biosphere''s delicate ecosystems, calling for balanced planning and continued emphasis on conservation education.

    Inside the administrative office, a dedicated operations team monitors real-time data feeds, adjusting temperature, humidity, and nutrient levels to meet each species'' unique needs. Modular solar panels installed around the dome generate sufficient electricity to power the entire facility, showcasing how renewable energy can be integrated seamlessly with large-scale infrastructure. Outside, an innovative wastewater treatment plant recycles greywater for irrigation, minimizing resource consumption. The architects behind the biosphere believe these sustainable technologies can be replicated in other communities looking to rehabilitate degraded land, turning once-polluted sites into living laboratories for environmental stewardship.

    While the facility is only in its first phase, future expansions are already on the drawing board. There are plans to introduce a marine habitat zone featuring coral reef tanks that highlight threats to underwater ecosystems. Specially designed walkways will give visitors a close-up view of these aquatic wonders without disturbing the delicate organisms within. Meanwhile, education programs will be expanded to local schools, offering field trips where students can learn about biodiversity, climate change, and sustainable technologies. The hope is that exposure to this living exhibit will inspire the next generation of environmental scientists, engineers, and policymakers.

    As dusk settled over the glass dome, a soft, multi-colored illumination replaced the natural daylight, casting enchanting shadows across the tropical foliage. Families strolled slowly along the paths, pausing to read plaques about the origins of each plant or to marvel at the occasional flutter of nocturnal pollinators. Meanwhile, a gentle hum of conversation reverberated in the background, carrying sentiments of astonishment and gratitude. The first day at Greenfield Biosphere ended with a collective realization that, with mindful planning, community collaboration, and respect for nature''s inherent wisdom, it is indeed possible to transform a scarred landscape into a flourishing haven for life and innovation.'
);
Copy

第 3 步:对文档进行分块处理

在创建 Cortex Search Service 之前,我们需要确保索引文本的每个“块”不超过大约 375 个单词。为此,我们可以通过导入 LangChain 的 Snowpark UDF 应用分块算法。首先,我们创建分块 UDF。然后,我们将该 UDF 应用于 cke_simple_article 表,并将这些块存储在 cke_simple_article_chunks 表中。最后,我们验证块是否已创建。

运行以下示例,将文章分成几个部分,以方便 Cortex Search 服务进行处理。此过程可能需要几分钟才能完成。

CREATE OR REPLACE FUNCTION text_chunker(text STRING)
    RETURNS TABLE (chunk VARCHAR)
    LANGUAGE PYTHON
    RUNTIME_VERSION = '3.9'
    HANDLER = 'text_chunker'
    PACKAGES = ('snowflake-snowpark-python', 'langchain')
    AS
$$
from snowflake.snowpark.types import StringType, StructField, StructType
from langchain.text_splitter import RecursiveCharacterTextSplitter
from snowflake.snowpark.files import SnowflakeFile
import logging
import pandas as pd

class text_chunker:

    def process(self, text: str):
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size = 2000,  # Adjust this as needed
            chunk_overlap = 300,  # Overlap to keep chunks contextual
            length_function = len
        )

        chunks = text_splitter.split_text(text)
        df = pd.DataFrame(chunks, columns=['chunk'])

        yield from df.itertuples(index=False, name=None)
$$;
Copy

运行以下示例,将文档拆分为多个块以方便编制索引。

CREATE OR REPLACE TABLE cke_simple_article_chunks AS
    SELECT
        c.DOCUMENT_ID,
        c.DOCUMENT_TITLE,
        c.SOURCE_URL,
        t.chunk
    FROM cke_simple_article AS c, TABLE(text_chunker(CONCAT(c.DOCUMENT_TITLE, '\n', c.TEXT))) AS t;
Copy

运行以下命令以验证块是否已创建。

select * from cke_simple_article_chunks;
Copy

第 4 步:创建 Cortex Search Service

现在将名为 cke_simple_cortex_search_service 的 Cortex Search Service 配置为在仓库 xsmall_cke_getting_started 上运行并引用分块文档表 cke_simple_article_chunks。请注意,此步骤可能需要相当长的时间才能完成,具体取决于数据库的大小。

CREATE OR REPLACE CORTEX SEARCH SERVICE cke_simple_cortex_search_service
  ON chunk
  ATTRIBUTES document_title
  WAREHOUSE = xsmall_cke_getting_started
  TARGET_LAG = '1 hour'
  AS (
    SELECT
        chunk,
        document_title,
        source_url
      FROM cke_simple_article_chunks
  );
Copy

第 5 步:测试 CKE

要验证 CKE 是否正常运行,您可以向 Cortex Search Service 发出一个简单的查询。这将验证该服务是否已正确为您的文档编制索引,以及相关文档是否从查询中返回。此查询应返回“The Greenfield Biosphere”这篇文章的第一个分块以及指向来源 URL 的链接。

select snowflake.cortex.search_preview(
 'cke_getting_started.articles.cke_simple_cortex_search_service',
 '{ "query": "whats happening with the greenfield biosphere?", "columns": ["chunk","document_title","source_url"] }');
Copy

第 6 步:私密共享 CKE 以供测试

创建 Cortex Search Service 并且它正确响应查询后,您可以共享该服务。这个共享的 Cortex Search Service 就是 Cortex Knowledge Extension。在此步骤中,您将创建一个 专用列表,并将其与另一个账户共享以进行测试。然后,您将在与之共享 CKE 的使用者账户中测试列表。

创建共享

  1. 登录到 Snowsight 并导航至 Data Products » Provider Studio

  2. 在右上角选择 Listing,然后选择 Specified Consumers

  3. 为列表提供标题,然后点击 Next

  4. 对于 What's in the listing?,点击 + Select

  5. 选择 CKE_GETTING_STARTED

  6. 展开 ARTICLES

  7. 展开 Cortex Search Service

  8. 选择 CKE_SIMPLE_CORTEX_SEARCH_SERVICE,然后选择 Done

  9. 输入列表的描述。

  10. Add consumer accounts 下方,添加您想要共享并用于测试 Cortex Knowledge Extension 的 Snowflake 账户。请注意,必须与提供商位于同一区域,并且您必须有权访问此账户。

在使用者账户中测试共享

  1. 登录到您在上面与之共享 CKE 的 Snowsight 使用者账户。

  2. 导航到 Data Products » Private Sharing

  3. 在这里,您应该看到您在上面共享的 CKE_GETTING_STARTED 列表。选择 Get

  4. 打开一个新工作表并运行以下 SQL 命令以验证该账户是否有权访问共享的数据。

    select
      snowflake.cortex.search_preview(
       'CKE_GETTING_STARTED_GUIDE__FAKE_ARTICLES.ARTICLES.CKE_SIMPLE_CORTEX_SEARCH_SERVICE',
       '{ "query": "whats happening with the biosphere?", "columns": ["chunk","document_title"] }'
      );
    
    Copy

    备注

    如果您在 Get 对话框中指定了 CKE_GETTING_STARTED 以外的名称,则需要在上面的代码片段中对其进行更改。

此时,您已经有了功能性的 Cortex Knowledge Extension!

语言: 中文