Hugging Face pipeline

Model Registry 支持从 transformers.Pipeline 派生的 ` 转换器 <https://huggingface.co/docs/transformers/index (link removed)>`_ 定义的 Hugging Face 模型类。例如:

lm_hf_model = transformers.pipeline(
    task="text-generation",
    model="bigscience/bloom-560m",
    token="...",  # Put your HuggingFace token here.
    return_full_text=False,
    max_new_tokens=100,
)

lmv = reg.log_model(lm_hf_model, model_name='bloom', version_name='v560m')
Copy

调用 options 时,可以在 log_model 字典中使用下列附加选项:

选项

描述

target_methods

可在模型对象上使用的方法的名称列表。默认情况下,Hugging Face 模型具有以下目标方法,假设方法存在:__call__

cuda_version

部署到具有 GPU 的平台时使用的 CUDA 运行时版本;默认值为 11.8。如果手动设置为 None,则无法将模型部署到具有 GPU 的平台。

重要

基于 huggingface_pipeline.HuggingFacePipelineModel 的模型仅包含配置数据;每次您使用模型时,都会从 Hugging Face Hub 下载模型权重。

目前,模型注册表仅支持无需 外部网络访问 即可运行的自包含模型。最佳实践是改为使用 transformers.Pipeline,如上面的示例中所示。这会将模型权重下载到您的本地系统,然后 log_model 上传不需要互联网访问的自包含模型对象。

仅当管道包含以下列表中的一个任务,注册表就会推断 signatures 实参。

  • conversational

  • fill-mask

  • question-answering

  • summarization

  • table-question-answering

  • text2text-generation

  • text-classification``(也称为 ``sentiment-analysis

  • text-generation

  • token-classification``(也称为 ``ner

  • translation

  • translation_xx_to_yy

  • zero-shot-classification

Hugging Face 模型会完全忽略 log_modelsample_input_data 实参。登记未在上面列表中列出的 Hugging Face 模型时指定 signatures 实参,以便注册表知道目标方法的签名。

若要查看推断的签名,请使用 show_functions 方法。例如,下面的字典是 lmv.show_functions() 的结果,其中 lmv 是上面登记的模型:

{'name': '__CALL__',
  'target_method': '__call__',
  'signature': ModelSignature(
                      inputs=[
                          FeatureSpec(dtype=DataType.STRING, name='inputs')
                      ],
                      outputs=[
                          FeatureSpec(dtype=DataType.STRING, name='outputs')
                      ]
                  )}]
Copy

使用以下代码调用 lmv 模型:

import pandas as pd
remote_prediction = lmv.run(pd.DataFrame(["Hello, how are you?"], columns=["inputs"]))
Copy

使用说明

  • 许多 Hugging Face 模型都很庞大,不适合标准仓库。请使用 Snowpark 优化型仓库,或者选择较小版本的模型。例如,不使用 Llama-2-70b-chat-hf 模型,而是尝试 Llama-2-7b-chat-hf

  • Snowflake 仓库没有 GPUs。仅使用 CPU 优化型 Hugging Face 模型。

  • 有些 Hugging Face 转换器会为每个输入行返回一个字典数组。注册表会将字典数组转换为包含数组的 JSON 表示的字符串。例如,多重输出问题回答输出如下所示:

    '[{"score": 0.61094731092453, "start": 139, "end": 178, "answer": "learn more about the world of athletics"},
    {"score": 0.17750297486782074, "start": 139, "end": 180, "answer": "learn more about the world of athletics.\""}]'
    
    Copy

示例

# Prepare model
import transformers
import pandas as pd

finbert_model = transformers.pipeline(
    task="text-classification",
    model="ProsusAI/finbert",
    top_k=2,
)

# Log the model
mv = registry.log_model(
    finbert_model,
    model_name="finbert",
    version_name="v1",
)

# Use the model
mv.run(pd.DataFrame(
        [
            ["I have a problem with my Snowflake that needs to be resolved asap!!", ""],
            ["I would like to have udon for today's dinner.", ""],
        ]
    )
)
Copy

结果:

0  [{"label": "negative", "score": 0.8106237053871155}, {"label": "neutral", "score": 0.16587384045124054}]
1  [{"label": "neutral", "score": 0.9263970851898193}, {"label": "positive", "score": 0.05286872014403343}]
Copy

Hugging Face 管道的推断签名

Snowflake Model Registry 会自动推断包含以下列表中的单个任务的 Hugging Face 管道的签名:

  • conversational

  • fill-mask

  • question-answering

  • summarization

  • table-question-answering

  • text2text-generation

  • text-classification``(别名:``sentiment-analysis

  • text-generation

  • token-classification``(别名:``ner

  • translation

  • translation_xx_to_yy

  • zero-shot-classification

本部分介绍以下这些类型的 Hugging Face (link removed) 管道的签名,包括所需输入和预期输出的描述和示例。所有输入和输出均为 Snowpark DataFrames。

对话管道

` 对话 <https://huggingface.co/docs/transformers/en/main_classes/pipelines#transformers.ConversationalPipeline (link removed)>`_ 任务的管道具有以下输入和输出。

输入

  • user_inputs:表示用户先前和当前输入的字符串列表。列表中的最后一个是当前输入。

  • generated_responses:表示模型先前响应的字符串列表。

示例:

---------------------------------------------------------------------------
|"user_inputs"                                    |"generated_responses"  |
---------------------------------------------------------------------------
|[                                                |[                      |
|  "Do you speak French?",                        |  "Yes I do."          |
|  "Do you know how to say Snowflake in French?"  |]                      |
|]                                                |                       |
---------------------------------------------------------------------------

输出

  • generated_responses:表示模型先前和当前响应的字符串列表。列表中的最后一个是当前响应。

示例:

-------------------------
|"generated_responses"  |
-------------------------
|[                      |
|  "Yes I do.",         |
|  "I speak French."    |
|]                      |
-------------------------

Fill-mask 管道

“Fill-mask (link removed)”任务的管道具有以下输入和输出。

输入

  • inputs:要填充掩码的字符串。

示例:

--------------------------------------------------
|"inputs"                                        |
--------------------------------------------------
|LynYuu is the [MASK] of the Grand Duchy of Yu.  |
--------------------------------------------------

输出

  • outputs:一个字符串,包含以 JSON 格式表示的对象列表,列表中的每个对象都可能包含 scoretokentoken_strsequence 等键。 有关详细信息,请参阅 FillMaskPipeline (link removed)。

示例:

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"outputs"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|[{"score": 0.9066258072853088, "token": 3007, "token_str": "capital", "sequence": "lynyuu is the capital of the grand duchy of yu."}, {"score": 0.08162177354097366, "token": 2835, "token_str": "seat", "sequence": "lynyuu is the seat of the grand duchy of yu."}, {"score": 0.0012052370002493262, "token": 4075, "token_str": "headquarters", "sequence": "lynyuu is the headquarters of the grand duchy of yu."}, {"score": 0.0006560495239682496, "token": 2171, "token_str": "name", "sequence": "lynyuu is the name of the grand duchy of yu."}, {"score": 0.0005427763098850846, "token": 3200, "token_str"...  |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

令牌分类

“ner”或“` token-classification <https://huggingface.co/docs/transformers/en/main_classes/pipelines#transformers.TokenClassificationPipeline (link removed)>`_”任务的管道具有以下输入和输出。

输入

  • inputs:包含要分类的令牌的字符串。

示例:

------------------------------------------------
|"inputs"                                      |
------------------------------------------------
|My name is Izumi and I live in Tokyo, Japan.  |
------------------------------------------------

输出

  • outputs:一个字符串,包含以 JSON 格式表示的结果对象的列表,列表中的每个对象都可能包含 entityscoreindexwordnamestartend 等键。 有关详细信息,请参阅 TokenClassificationPipeline <https://huggingface.co/docs/transformers/en/main_classes/pipelines#transformers.TokenClassificationPipeline (link removed)>`_。

示例:

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"outputs"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|[{"entity": "PRON", "score": 0.9994392991065979, "index": 1, "word": "my", "start": 0, "end": 2}, {"entity": "NOUN", "score": 0.9968984127044678, "index": 2, "word": "name", "start": 3, "end": 7}, {"entity": "AUX", "score": 0.9937735199928284, "index": 3, "word": "is", "start": 8, "end": 10}, {"entity": "PROPN", "score": 0.9928083419799805, "index": 4, "word": "i", "start": 11, "end": 12}, {"entity": "PROPN", "score": 0.997334361076355, "index": 5, "word": "##zumi", "start": 12, "end": 16}, {"entity": "CCONJ", "score": 0.999173104763031, "index": 6, "word": "and", "start": 17, "end": 20}, {...  |

问答(单个输出)

“` question-answering <https://huggingface.co/docs/transformers/en/main_classes/pipelines#transformers.QuestionAnsweringPipeline (link removed)>`_”任务的管道,其中 top_k 未设置或设置为 1,具有以下输入和输出。

输入

  • question:包含要回答的问题的字符串。

  • context:可能包含答案的字符串。

示例:

-----------------------------------------------------------------------------------
|"question"                  |"context"                                           |
-----------------------------------------------------------------------------------
|What did Doris want to do?  |Doris is a cheerful mermaid from the ocean dept...  |
-----------------------------------------------------------------------------------

输出

  • score:浮点置信度分数从 0.0 到 1.0。

  • start:在上下文中,答案第一个词元的整数索引。

  • end:在原始上下文中,答案最后一个词元的整数索引。

  • answer:包含找到的答案的字符串。

示例:

--------------------------------------------------------------------------------
|"score"           |"start"  |"end"  |"answer"                                 |
--------------------------------------------------------------------------------
|0.61094731092453  |139      |178    |learn more about the world of athletics  |
--------------------------------------------------------------------------------

问答(多个输出)

任务是“` question-answering <https://huggingface.co/docs/transformers/en/main_classes/pipelines#transformers.QuestionAnsweringPipeline (link removed)>`_”的管道,其中 top_k 设置为大于 1,具有以下输入和输出。

输入

  • question:包含要回答的问题的字符串。

  • context:可能包含答案的字符串。

示例:

-----------------------------------------------------------------------------------
|"question"                  |"context"                                           |
-----------------------------------------------------------------------------------
|What did Doris want to do?  |Doris is a cheerful mermaid from the ocean dept...  |
-----------------------------------------------------------------------------------

输出

  • outputs:一个字符串,包含以 JSON 格式表示的结果对象的列表,列表中的每个对象都可能包含 scorestartendanswer 等键。

示例:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"outputs"                                                                                                                                                                                                                                                                                                                                        |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|[{"score": 0.61094731092453, "start": 139, "end": 178, "answer": "learn more about the world of athletics"}, {"score": 0.17750297486782074, "start": 139, "end": 180, "answer": "learn more about the world of athletics.\""}, {"score": 0.06438097357749939, "start": 138, "end": 178, "answer": "\"learn more about the world of athletics"}]  |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

摘要

任务是“` summarization <https://huggingface.co/docs/transformers/en/main_classes/pipelines#transformers.SummarizationPipeline (link removed)>`_”的管道,其中 return_tensors 为 False 或未设置,具有以下输入和输出。

输入

  • documents:包含要汇总的文本的字符串。

示例:

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"documents"                                                                                                                                                                                               |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|Neuro-sama is a chatbot styled after a female VTuber that hosts live streams on the Twitch channel "vedal987". Her speech and personality are generated by an artificial intelligence (AI) system  wh...  |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

输出

  • summary_text:包含生成的摘要的字符串,或者,如果 num_return_sequences 大于 1,则字符串包含以 JSON 格式表示的结果列表,每个结果都是一个包含字段的字典,其中包括 summary_text

示例:

---------------------------------------------------------------------------------
|"summary_text"                                                                 |
---------------------------------------------------------------------------------
| Neuro-sama is a chatbot styled after a female VTuber that hosts live streams  |
---------------------------------------------------------------------------------

表问答

任务是“` table-question-answering <https://huggingface.co/docs/transformers/en/main_classes/pipelines#transformers.TableQuestionAnsweringPipeline (link removed)>`_”的管道具有以下输入和输出。

输入

  • query:包含要回答的问题的字符串。

  • table:包含 JSON 序列化字典的字符串,形式为 {column -> [values]},表示可能包含答案的表。

示例:

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"query"                                  |"table"                                                                                                                                                                                                                                                   |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|Which channel has the most subscribers?  |{"Channel": ["A.I.Channel", "Kaguya Luna", "Mirai Akari", "Siro"], "Subscribers": ["3,020,000", "872,000", "694,000", "660,000"], "Videos": ["1,200", "113", "639", "1,300"], "Created At": ["Jun 30 2016", "Dec 4 2017", "Feb 28 2014", "Jun 23 2017"]}  |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

输出

  • answer:包含可能答案的字符串。

  • coordinates:表示答案所在的单元格坐标的整数列表。

  • cells:包含答案所在的单元格内容的字符串列表。

  • aggregator:包含所用聚合器名称的字符串。

示例:

----------------------------------------------------------------
|"answer"     |"coordinates"  |"cells"          |"aggregator"  |
----------------------------------------------------------------
|A.I.Channel  |[              |[                |NONE          |
|             |  [            |  "A.I.Channel"  |              |
|             |    0,         |]                |              |
|             |    0          |                 |              |
|             |  ]            |                 |              |
|             |]              |                 |              |
----------------------------------------------------------------

文本分类(单个输出)

“text-clasification (link removed)”任务的管道,其中 top_k 未设置或为 None,具有以下输入和输出。

输入

  • text:要分类的字符串。

  • text_pair:与 text 一起分类的字符串,用于计算文本相似度的模型。如果模型不使用它,则留空。

示例:

----------------------------------
|"text"       |"text_pair"       |
----------------------------------
|I like you.  |I love you, too.  |
----------------------------------

输出

  • label:表示文本分类标签的字符串。

  • score:浮点置信度分数从 0.0 到 1.0。

示例:

--------------------------------
|"label"  |"score"             |
--------------------------------
|LABEL_0  |0.9760091304779053  |
--------------------------------

文本分类(多个输出)

“text-clasification (link removed)”任务的管道,其中 top_k 设置为一个数字,具有以下输入和输出。

备注

如果将 top_k 设置为任何数字,即使该数字为 1,文本分类任务也被视为多个输出。要获取 单个输出,请将 top_k 值设为 None。

输入

  • text:要分类的字符串。

  • text_pair:与 text 一起分类的字符串,用于计算文本相似度的模型。如果模型不使用它,则留空。

示例:

--------------------------------------------------------------------
|"text"                                              |"text_pair"  |
--------------------------------------------------------------------
|I am wondering if I should have udon or rice fo...  |             |
--------------------------------------------------------------------

输出

  • outputs:一个字符串,包含以 JSON 格式表示的结果列表,每个结果都包含包括 labelscore 的字段。

示例:

--------------------------------------------------------
|"outputs"                                             |
--------------------------------------------------------
|[{"label": "NEGATIVE", "score": 0.9987024068832397}]  |
--------------------------------------------------------

文本生成

任务是“` text-generation <https://huggingface.co/docs/transformers/en/main_classes/pipelines#transformers.TextGenerationPipeline (link removed)>`_”的管道,其中 return_tensors 为 False 或未设置,具有以下输入和输出。

备注

文本生成管道,其中 return_tensors 是 True,不受支持。

输入

  • inputs:包含提示的字符串。

示例:

--------------------------------------------------------------------------------
|"inputs"                                                                      |
--------------------------------------------------------------------------------
|A descendant of the Lost City of Atlantis, who swam to Earth while saying, "  |
--------------------------------------------------------------------------------

输出

  • outputs:一个字符串,包含以 JSON 格式表示的结果对象的列表,每个对象都包含包括 generated_text 的字段。

示例:

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"outputs"                                                                                                                                                                                                 |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|[{"generated_text": "A descendant of the Lost City of Atlantis, who swam to Earth while saying, \"For my life, I don't know if I'm gonna land upon Earth.\"\n\nIn \"The Misfits\", in a flashback, wh...  |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

文本到文本的生成

任务是“` text2text-generation <https://huggingface.co/docs/transformers/en/main_classes/pipelines#transformers.Text2TextGenerationPipeline (link removed)>`_”的管道,其中 return_tensors 为 False 或未设置,具有以下输入和输出。

备注

文本到文本的生成管道,其中 return_tensors 是 True,不受支持。

输入

  • inputs:包含提示的字符串。

示例:

--------------------------------------------------------------------------------
|"inputs"                                                                      |
--------------------------------------------------------------------------------
|A descendant of the Lost City of Atlantis, who swam to Earth while saying, "  |
--------------------------------------------------------------------------------

输出

  • generated_text:如果 num_return_sequences 为 1,则为包含生成文本的字符串;如果 num_return_sequences 大于 1,则为以 JSON 格式表示字典结果列表的字符串,字典包含 generated_text 在内的字段。

示例:

----------------------------------------------------------------
|"generated_text"                                              |
----------------------------------------------------------------
|, said that he was a descendant of the Lost City of Atlantis  |
----------------------------------------------------------------

翻译生成

任务是“` translation <https://huggingface.co/docs/transformers/en/main_classes/pipelines#transformers.TranslationPipeline (link removed)>`_”的管道,其中 return_tensors 为 False 或未设置,具有以下输入和输出。

备注

翻译生成管道,其中 return_tensors 是 True,不受支持。

输入

  • inputs:包含要翻译的文本的字符串。

示例:

------------------------------------------------------------------------------------------------------
|"inputs"                                                                                            |
------------------------------------------------------------------------------------------------------
|Snowflake's Data Cloud is powered by an advanced data platform provided as a self-managed service.  |
------------------------------------------------------------------------------------------------------

输出

  • translation_text:如果 num_return_sequences 为 1,则为表示生成的翻译的字符串,或者是以 JSON 格式表示字典结果列表的字符串,每个字典均包含包括 translation_text 的字段。

示例:

---------------------------------------------------------------------------------------------------------------------------------
|"translation_text"                                                                                                             |
---------------------------------------------------------------------------------------------------------------------------------
|Le Cloud de données de Snowflake est alimenté par une plate-forme de données avancée fournie sous forme de service autogérés.  |
---------------------------------------------------------------------------------------------------------------------------------

Zero-shot 分类

“zero-shot-classification (link removed)”任务的管道具有以下输入和输出。

输入

  • sequences:包含要分类的文本的字符串。

  • candidate_labels:包含要应用于文本的标签的字符串列表。

示例:

-----------------------------------------------------------------------------------------
|"sequences"                                                       |"candidate_labels"  |
-----------------------------------------------------------------------------------------
|I have a problem with Snowflake that needs to be resolved asap!!  |[                   |
|                                                                  |  "urgent",         |
|                                                                  |  "not urgent"      |
|                                                                  |]                   |
|I have a problem with Snowflake that needs to be resolved asap!!  |[                   |
|                                                                  |  "English",        |
|                                                                  |  "Japanese"        |
|                                                                  |]                   |
-----------------------------------------------------------------------------------------

输出

  • sequence:输入字符串。

  • labels:表示已应用的标签的字符串列表。

  • scores:每个标签的浮点置信度分数列表。

示例:

--------------------------------------------------------------------------------------------------------------
|"sequence"                                                        |"labels"        |"scores"                |
--------------------------------------------------------------------------------------------------------------
|I have a problem with Snowflake that needs to be resolved asap!!  |[               |[                       |
|                                                                  |  "urgent",     |  0.9952737092971802,   |
|                                                                  |  "not urgent"  |  0.004726255778223276  |
|                                                                  |]               |]                       |
|I have a problem with Snowflake that needs to be resolved asap!!  |[               |[                       |
|                                                                  |  "Japanese",   |  0.5790848135948181,   |
|                                                                  |  "English"     |  0.42091524600982666   |
|                                                                  |]               |]                       |
--------------------------------------------------------------------------------------------------------------
语言: 中文