第10期：临床预测模型构建

🎯

技能简介

临床预测模型是临床研究的热点方向，传统建模需要花费1-2个月学习统计、调参数、画图。使用AI辅助，1小时内即可完成从数据预处理到模型评估、列线图生成的全流程。

本教程将使用三大工具：scikit-learn（机器学习）、SHAP（模型可解释性）、statsmodels（统计建模），教你构建发表级临床预测模型。

🤖 scikit-learn 🔍 SHAP 📊 statsmodels

💡 使用场景

🏥

疾病风险预测

预测患者30天死亡率、再入院风险、并发症发生概率等临床结局

📋

辅助诊断工具

基于临床指标构建诊断模型，提高疾病识别准确性

📊

预后分层分析

将患者分为高/低风险组，指导个体化治疗方案

📄

论文图表生成

ROC曲线、校准曲线、DCA曲线、列线图等论文必备图表一键生成

🔬

模型对比研究

比较机器学习模型与传统统计模型的预测性能

💊

治疗反应预测

预测患者对特定治疗方案的反应，助力精准医疗

🛠️ 核心技能调用

# 技能1: scikit-learn - 机器学习模型训练
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve

# 数据预处理 + 模型训练
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

# 技能2: SHAP - 模型可解释性分析
import shap

# 创建SHAP解释器
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# 可视化特征重要性
shap.summary_plot(shap_values, X_test)

# 技能3: statsmodels - 统计建模与列线图
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

# 逻辑回归 + 统计报告
logit_model = sm.Logit(y_train, sm.add_constant(X_train))
result = logit_model.fit()
print(result.summary())

📖 实战示例：心衰患者30天死亡率预测模型

Step 1: 数据准备与预处理

首先加载心衰患者数据集，进行数据清洗和预处理：

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

# 加载数据
data = pd.read_csv('heart_failure.csv')

# 查看数据基本信息
print(data.info())
print(data.describe())

# 分离特征和标签
X = data.drop(['death_30d'], axis=1)
y = data['death_30d']

# 处理缺失值
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(X)

# 标准化连续变量
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_imputed)

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42, stratify=y
)

print(f"训练集: {X_train.shape}, 测试集: {X_test.shape}")

💡 数据预处理要点：

• 缺失值处理：连续变量用中位数填充，分类变量用众数填充
• 标准化：不同量纲的变量需要标准化（如年龄、肌酐水平）
• 分层抽样：stratify参数确保训练集和测试集的标签分布一致

Step 2: 多模型训练与交叉验证

训练多个机器学习模型，使用5折交叉验证比较性能：

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

# 定义模型
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
    'SVM': SVC(probability=True, random_state=42)
}

# 5折交叉验证
results = {}
for name, model in models.items():
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
    results[name] = {
        'mean_auc': cv_scores.mean(),
        'std_auc': cv_scores.std()
    }
    print(f"{name}: AUC = {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")

# 选择最佳模型
best_model_name = max(results, key=lambda x: results[x]['mean_auc'])
best_model = models[best_model_name]
print(f"\n最佳模型: {best_model_name}")

Step 3: 模型评估（ROC、校准曲线、DCA）

对最佳模型进行全面评估，生成论文必备图表：

import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc, confusion_matrix, classification_report
from sklearn.calibration import calibration_curve

# 在测试集上预测
best_model.fit(X_train, y_train)
y_pred_proba = best_model.predict_proba(X_test)[:, 1]
y_pred = best_model.predict(X_test)

# ROC曲线
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='#8B5CF6', lw=2, label=f'ROC curve (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='gray', lw=1, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.savefig('roc_curve.png', dpi=300, bbox_inches='tight')

# 校准曲线
prob_true, prob_pred = calibration_curve(y_test, y_pred_proba, n_bins=10)
plt.figure(figsize=(8, 6))
plt.plot(prob_pred, prob_true, marker='o', color='#8B5CF6')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlabel('Predicted Probability')
plt.ylabel('True Probability')
plt.title('Calibration Curve')
plt.savefig('calibration_curve.png', dpi=300)

# 混淆矩阵
cm = confusion_matrix(y_test, y_pred)
print("混淆矩阵:")
print(cm)

# 计算性能指标
sensitivity = cm[1, 1] / (cm[1, 0] + cm[1, 1])
specificity = cm[0, 0] / (cm[0, 0] + cm[0, 1])
ppv = cm[1, 1] / (cm[0, 1] + cm[1, 1])
npv = cm[0, 0] / (cm[0, 0] + cm[1, 0])

print(f"\n敏感性: {sensitivity:.3f}")
print(f"特异性: {specificity:.3f}")
print(f"阳性预测值: {ppv:.3f}")
print(f"阴性预测值: {npv:.3f}")

Step 4: SHAP模型可解释性分析

使用SHAP分析模型特征重要性，理解模型预测依据：

import shap

# 创建SHAP解释器
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(X_test)

# 全局特征重要性
plt.figure(figsize=(10, 6))
shap.summary_plot(shap_values, X_test, plot_type="bar")
plt.savefig('shap_feature_importance.png', dpi=300, bbox_inches='tight')

# SHAP摘要图（显示特征方向）
plt.figure(figsize=(10, 6))
shap.summary_plot(shap_values, X_test)
plt.savefig('shap_summary.png', dpi=300, bbox_inches='tight')

# 单个患者的预测解释
plt.figure(figsize=(10, 6))
shap.force_plot(explainer.expected_value[1], shap_values[1][0], X_test[0])
plt.savefig('shap_force_plot.png', dpi=300, bbox_inches='tight')

# 依赖图（特征交互效应）
plt.figure(figsize=(10, 6))
shap.dependence_plot(0, shap_values[1], X_test)
plt.savefig('shap_dependence.png', dpi=300, bbox_inches='tight')

print("SHAP分析完成！图表已保存。")

💡 SHAP图表解读：

• 特征重要性图：按SHAP绝对值排序，显示最重要的预测特征
• 摘要图：颜色表示特征值高低，位置表示对预测的影响方向
• 力图：展示单个患者的预测如何由各特征贡献组成
• 依赖图：展示特征间的交互效应

Step 5: 列线图生成

基于逻辑回归模型生成列线图，临床医生可直接用于风险评估：

import matplotlib.pyplot as plt
from numpy import linspace

# 使用逻辑回归模型创建列线图
logit_model = LogisticRegression(max_iter=1000)
logit_model.fit(X_train, y_train)

# 获取特征系数
feature_names = X.columns
coefficients = logit_model.coef_[0]
intercept = logit_model.intercept_[0]

# 计算风险评分
def calculate_risk_score(patient_data):
    logit = intercept + sum(coefficients[i] * patient_data[i] for i in range(len(patient_data)))
    risk = 1 / (1 + np.exp(-logit))
    return risk

# 示例：计算某个患者的风险
sample_patient = X_test[0]
risk = calculate_risk_score(sample_patient)
print(f"患者30天死亡风险: {risk*100:.1f}%")

# 创建风险分层表
risk_categories = {
    '低风险': (0, 0.2),
    '中风险': (0.2, 0.5),
    '高风险': (0.5, 1.0)
}

for category, (low, high) in risk_categories.items():
    if low <= risk < high:
        print(f"风险分层: {category}")

📌 列线图优势：

• 临床友好：医生可以直接用图表评估患者风险，无需计算
• 发表率高：临床预测模型论文的标准图表
• 可视化：直观展示各特征对风险的贡献程度

Step 6: 结果汇总与导出

汇总所有分析结果，生成完整报告：

import json

# 汇总结果
report = {
    'model_performance': {
        'best_model': best_model_name,
        'auc': roc_auc,
        'sensitivity': sensitivity,
        'specificity': specificity,
        'ppv': ppv,
        'npv': npv
    },
    'cv_results': results,
    'top_features': feature_names.tolist()[:5]
}

# 保存报告
with open('model_report.json', 'w') as f:
    json.dump(report, f, indent=2)

# 导出预测结果
predictions = pd.DataFrame({
    'actual': y_test,
    'predicted': y_pred,
    'probability': y_pred_proba
})
predictions.to_csv('predictions.csv', index=False)

print("分析完成！生成文件:")
print("- roc_curve.png")
print("- calibration_curve.png")
print("- shap_feature_importance.png")
print("- shap_summary.png")
print("- model_report.json")
print("- predictions.csv")

📊 临床预测模型论文必备图表

图表	用途	工具
ROC曲线	区分度评估	scikit-learn
校准曲线	校准度评估	scikit-learn
DCA曲线	临床净收益	自定义函数
列线图	风险评估工具	statsmodels
SHAP图	特征重要性	SHAP
混淆矩阵	分类性能	scikit-learn

⚠️ 注意事项

过拟合风险

使用交叉验证评估模型性能，在独立测试集上验证，避免过度拟合训练数据。高训练集性能但低测试集性能是过拟合的典型表现。

特征选择

避免"特征选择偏差"。特征选择应在交叉验证循环内部进行，或使用独立的验证集。不当的特征选择会导致过于乐观的性能估计。

样本量要求

一般要求每个预测特征至少有10-20个结局事件。例如，预测10个特征的模型，至少需要100-200个事件样本。样本过少会导致模型不稳定。

模型报告规范

遵循TRIPOD声明（预测模型研究报告规范）进行报告，确保研究的透明性和可重复性。包括：研究设计、参与者、预测因子、结局、样本量、缺失数据处理、模型构建等。

🔗 相关技能链接

📊

科研数据可视化

上一期：matplotlib与seaborn

🧬

差异表达分析

下一期：PyDESeq2 + gget

📚

AI科研技能库

返回技能库首页

技能简介

💡 使用场景

疾病风险预测

辅助诊断工具

预后分层分析

论文图表生成

模型对比研究

治疗反应预测

🛠️ 核心技能调用

📖 实战示例：心衰患者30天死亡率预测模型

📊 临床预测模型论文必备图表

⚠️ 注意事项

🔗 相关技能链接

下载完整代码包