首先加载心衰患者数据集,进行数据清洗和预处理:
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.impute import SimpleImputer # 加载数据 data = pd.read_csv('heart_failure.csv') # 查看数据基本信息 print(data.info()) print(data.describe()) # 分离特征和标签 X = data.drop(['death_30d'], axis=1) y = data['death_30d'] # 处理缺失值 imputer = SimpleImputer(strategy='median') X_imputed = imputer.fit_transform(X) # 标准化连续变量 scaler = StandardScaler() X_scaled = scaler.fit_transform(X_imputed) # 划分训练集和测试集 X_train, X_test, y_train, y_test = train_test_split( X_scaled, y, test_size=0.3, random_state=42, stratify=y ) print(f"训练集: {X_train.shape}, 测试集: {X_test.shape}")
💡 数据预处理要点:
- • 缺失值处理:连续变量用中位数填充,分类变量用众数填充
- • 标准化:不同量纲的变量需要标准化(如年龄、肌酐水平)
- • 分层抽样:stratify参数确保训练集和测试集的标签分布一致