Cortex AI Functions: Audio¶
Cortex AI Audio 提供先进的 LLM 音频处理功能,包括:
转录: 将口语转换为文本。
说话者识别: 确定多扬声器音频文件的每个部分的发言人。
时间戳提取: 识别每个口语文字的时间戳。
这些功能可通过 AI_TRANSCRIBE 功能获得。由于 AI_TRANSCRIBE 是在 Snowflake 中管理和托管的,因此您可以轻松地将音频处理集成到数据工作流程中,无需繁琐的设置或基础设施管理。
备注
The AI_TRANSCRIBE function also processes audio tracks in video files.
AI_TRANSCRIBE¶
AI_TRANSCRIBE is a fully managed SQL function that transcribes audio and video files stored in a stage, extracting text, timestamps, and speaker information. See Create stage for media files for information on creating a stage suitable for storing files for processing by AI_TRANSCRIBE.
Under the hood, AI_TRANSCRIBE orchestrates optimized AI models for transcription and speaker diarization, processing audio files of up to two hours in length. AI_TRANSCRIBE is horizontally scalable, allowing efficient batch processing by processing multiple files at the same time. Audio can be processed directly from object storage to avoid unnecessary data movement.
默认情况下,AI_TRANSCRIBE 将音频文件转换为干净、可读的文本。您还可以指定时间戳粒度,以提取每个单词或说话者变化的时间戳。对于字幕等应用程序或让用户通过点击脚本中的单词来跳转到音频的特定部分,单词级时间戳非常有用。说话者级别的时间戳有助于了解谁在会议、面试或电话中说了什么。
时间戳粒度模式 |
结果 |
|---|---|
默认值 |
将整个音频文件转录成一段 |
文字 |
转录的每个文字都有时间戳 |
发言人 |
每次更换发言人时指明谁在说话以及时间戳 |
支持的语言¶
AI_TRANSCRIBE supports the following languages, which are automatically detected. Files can contain multiple supported languages.
备注
Language detection requires audio to begin within the first five seconds of the file. For best results, trim excess silence before uploading.
阿拉伯语
保加利亚语
广东话
加泰罗尼亚语
中文
捷克语
荷兰语
英语
法语
德语
希腊语
Hebrew
Hindi
匈牙利语
印尼语
意大利语
日语
韩语
拉脱维亚语
Malay
Norwegian
波兰语
葡萄牙语
罗马尼亚语
俄语
塞尔维亚语
斯洛文尼亚语
西班牙语
瑞典语
泰语
土耳其语
乌克兰语
Supported media formats¶
AI_TRANSCRIBE supports the following audio and video file formats:
Audio |
FLAC, MP3, MP4, OGG, WAV, WEBM |
|---|---|
Video |
MKV, MP4, OGV, WEBM |
Video files must contain at least one audio track in FLAC, MP3, OPUS, VORBIS, or WAV format.
示例¶
文本转录¶
以下示例转录存储在 financial_consultation 暂存区中的 an audio file,返回整个文件的文本记录。TO_FILE 函数 将暂存文件转换为文件引用。
响应:
使用时间戳进行单词级分段¶
将时间戳粒度设置为“单词”,以提取所说的每个单词的精确时间戳,从而实现可搜索、可浏览的记录。请注意,this audio file 是西班牙语。
响应:
备注
为简洁起见,输出被截断。完整输出包含音频文件中每个单词的片段。
说话者识别¶
将时间戳粒度设置为“speaker”,以检测、分离和识别对话或会议中的唯一说话者。此示例使用 an audio file,该音频文件包含两位说话者,一位讲英语,另一位讲西班牙语。
响应:
备注
为简洁起见,输出被截断。完整输出包含音频文件中每个对话“回合”的片段。
Use with other AI Functions¶
Call transcript analysis¶
You can pass the output of AI_TRANSCRIBE to other AI Functions for further processing. For example, you can use
AI_SUMMARIZE to summarize the transcription, or AI_CLASSIFY to classify the content of the transcription. This example
uses AI_SENTIMENT and AI_COMPLETE to analyze the text transcribed from
customer call audio and provide sentiment on four dimensions
and an assessment of the agent.
备注
AI_SENTIMENT 仅分析文本,不考虑语气等语音特征。
AI_SENTIMENT 响应:
AI_COMPLETE 响应:
Video transcript analysis¶
The following example transcribes a video file (link removed) stored in the podcast_videos_S3 stage,
响应:
Once you have the transcript, you can use AI_COMPLETE to perform additional analysis. This example identifies retail brands mentioned in the conversation for use in advertising or sponsorship analytics.
Response
成本注意事项¶
Billing for all AI Functions is based on token consumption. For transcription, each second of audio processed is 50 tokens, regardless of language or segmentation method. A full hour of audio is therefore 180,000 tokens. Assuming that processing a million tokens costs 1.3 credits, and that Snowflake credits cost US $3 each, each hour of audio processed costs about US $0.702. This estimate is subject to change. For current pricing information, see the Snowflake Service Consumption Table.
备注
AI_TRANSCRIBE 最低计费时长为 1 分钟。少于 1 分钟的文件仍会被处理,但按照 1 分钟计费。要高效地处理大量的短音频文件,可以考虑将它们批处理成一个文件,并使用时间戳来标识生成的转录中每个原始文件的开头和结尾。