# 使用结构化输出规范 OpenAI API 的输出结果
在 AI 驱动的应用程序中,生成结构化数据是一个核心需求。我们通常希望 AI 能够从非结构化的输入中提取信息,并将其转换为结构化数据格式以便后续的处理。随着 OpenAI API 的发展,** 结构化输出(Structured Outputs)** 特性已经被加入。
# 什么是结构化输出?
结构化输出是 OpenAI API 的一个新功能,允许我们通过提供 JSON Schema 来限制模型输出,使其完全符合预定义的数据结构。
# 实际应用:生物标志物规范化
我们通过一个例子来展示如何使用结构化输出规范 OpenAI API 的输出结果。
我有一个衰老相关的生物标志物表格,它们是从文献中手动提取的,由于提取比较粗糙,我只有它们的名称并且可能并不标准。现在我想借助 GPT 来规范化,同时生成新的两列 GPTAnno 和 Symbol,要求
- 规范生物标志物的名称
- 注释此生物标志物的功能
- 如果生物标志物是明确的基因,给出基因 SYMBOL;如果是体内的化学物质 / 小分子,给出直接相关的基因 SYMBOL。多个 SYMBOL 用英文逗号加空格隔开。其他情况返回 “/”。
# Prompt 设计
我们需要设计两部分:给模型的角色(role)以及提示词(prompt)
例如,角色可以定义为
You are a knowledgeable assistant specialized in aging-related biomarkers. |
提示词则需要更详细:
I have a biomarker related to aging. | |
1. Please standardize the name of this biomarker if it's unclear, and provide a concise annotation describing its function in biological processes, especially related to aging. | |
2. If this biomarker represents specific gene(s), provide the official SYMBOL(s) for the gene(s). If it's a chemical substance or small molecule, provide the directly related gene SYMBOL(s). If no relevant gene is associated, return "/". | |
3. If there are multiple SYMBOLs, separate them with a comma followed by a space (", "). | |
The output should be in JSON format with the following fields: | |
{{
"Biomarker_Std": "the standardized name of this biomarker",
"GPTAnno": "standardized biomarker name and its biological function",
"Symbol": "associated gene SYMBOL(s) or '/' if none"
}}
| |
Here is the biomarker: '{biomarker_name}' |
# 代码实现
我们使用 OpenAI 的结构化输出功能,通过定义自定义的 JSON Schema,确保模型返回的结果始终包含标准化后的生物标志物名称、注释以及相关的基因符号。以下是代码实现。
# 1. 加载环境变量和设置代理
我们可以在工作目录创建一个 .env
文件,用于存储环境变量:
MODEL=gpt-4o-mini | |
OPENAI_API_KEY=sk-xxxx | |
HTTP_PROXY=http://127.0.0.1:7890 | |
HTTPS_PROXY=http://127.0.0.1:7890 |
使用 python-dotenv
加载 .env
文件中的 API key、模型名称及代理设置。这样可以确保敏感信息不会硬编码在代码中,并且方便在不同的环境中部署。
from dotenv import load_dotenv | |
import os | |
# 加载 .env 文件 | |
load_dotenv() | |
# 设置代理(如果需要) | |
def set_proxy(): | |
http_proxy = os.getenv('HTTP_PROXY') | |
https_proxy = os.getenv('HTTPS_PROXY') | |
if http_proxy: | |
os.environ['HTTP_PROXY'] = http_proxy | |
if https_proxy: | |
os.environ['HTTPS_PROXY'] = https_proxy |
# 2. 定义 Schema
为了确保 OpenAI API 的输出符合我们的需求,我们定义了一个 BiomarkerAnnotation
的数据结构。这个结构包含三部分: Biomarker_Std
(标准化后的生物标志物名称)、 GPTAnno
(注释)和 Symbol
(基因符号)。
from pydantic import BaseModel | |
# 定义生物标志物注释的 Schema | |
class BiomarkerAnnotation(BaseModel): | |
Biomarker_Std: str # 标准化后的生物标志物名称 | |
GPTAnno: str # 标准化后的名称及其生物功能注释 | |
Symbol: str # 相关基因的 SYMBOL 或 "/" |
# 3. 生成规范化的注释
annotate_biomarker
函数通过调用 OpenAI API,生成符合上述 Schema 的结构化输出。我们在函数中构建一个提示(prompt),要求模型规范化生物标志物名称,生成功能注释,并提供相关基因符号。使用 OpenAI 提供的 completions.parse
方法,可以保证输出与我们定义的 Schema 一致。
from openai import OpenAI | |
# 初始化 OpenAI 客户端 | |
def init_openai_client(api_key: str): | |
return OpenAI(api_key=api_key) | |
# 注释生物标志物函数 | |
def annotate_biomarker(biomarker_name: str, model: str, api_key: str) -> BiomarkerAnnotation: | |
prompt = f""" | |
I have a biomarker related to aging. | |
1. Please standardize the name of this biomarker if it's unclear, and provide a concise annotation describing its function in biological processes, especially related to aging. | |
2. If this biomarker represents a specific gene, provide the official SYMBOL for the gene. If it's a chemical substance or small molecule, provide the directly related gene SYMBOL(s). If no relevant gene is associated, return "/". | |
3. If there are multiple SYMBOLs, separate them with a comma followed by a space (", "). | |
The output should be in JSON format with the following fields: | |
{{ "Biomarker_Std": "the standardized name of this biomarker", "GPTAnno": "standardized biomarker name and its biological function", "Symbol": "associated gene SYMBOL(s) or '/' if none" }} | |
Here is the biomarker: '{biomarker_name}' | |
""" | |
# 使用 OpenAI 客户端请求 | |
client = init_openai_client(api_key=api_key) | |
completion = client.beta.chat.completions.parse( | |
model=model, | |
messages=[ | |
{"role": "system", "content": "You are a knowledgeable assistant specialized in aging-related biomarkers."}, | |
{"role": "user", "content": prompt}, | |
], | |
response_format=BiomarkerAnnotation, # 使用自定义 Schema | |
) | |
# 解析结果 | |
message = completion.choices[0].message | |
if message.parsed: | |
return message.parsed | |
else: | |
return message.refusal |
# 4. 处理批量生物标志物
我们需要处理多个生物标志物,并将结果保存为 CSV 文件。 process_biomarkers
函数从输入文件中读取生物标志物,然后调用 annotate_biomarker
函数生成结构化输出,并将结果存储在一个新的 CSV 文件中。
import pandas as pd | |
import time | |
# 批量处理生物标志物并保存结果 | |
def process_biomarkers(df: pd.DataFrame, model: str, api_key: str, output_file: str, batch_size: int = 10): | |
df_ans = pd.DataFrame(columns=['PMID', 'Biomarker', 'Biomarker_Std', 'GPTAnno', 'Symbol']) | |
for i in range(min(batch_size, len(df))): | |
if i > 0 and i % 50 == 0: | |
time.sleep(5) | |
p = df.iloc[i]['PMID'] | |
biomarker = df.iloc[i]['Biomarker'] | |
print(f'正在处理第 {i + 1}/{len(df)} 个生物标志物,PMID 为:{p}') | |
response = None | |
for attempt in range(5): | |
try: | |
response = annotate_biomarker(biomarker, model=model, api_key=api_key) | |
break | |
except Exception as e: | |
print(f"Error on attempt {attempt + 1} for PMID {p}: {e}") | |
time.sleep(2) | |
continue | |
if response is None: | |
print(f"Failed to process biomarker '{biomarker}' after 5 attempts.") | |
continue | |
# 保存结果 | |
dfrow = { | |
'PMID': p, | |
'Biomarker': biomarker, | |
'Biomarker_Std': response.Biomarker_Std, | |
'GPTAnno': response.GPTAnno, | |
'Symbol': response.Symbol | |
} | |
df_ans = pd.concat([df_ans, pd.DataFrame([dfrow])], ignore_index=True) | |
print(f"Processed: {response}\n") | |
if (i + 1) % 500 == 0: | |
df_ans.to_csv(f'intermediate_result_{i+1}.csv', index=False, encoding='utf-8') | |
time.sleep(1) | |
df_ans.to_csv(output_file, index=False, encoding='utf-8') | |
print(f"All biomarkers processed and saved to {output_file}") |