使用KNN进行心力衰竭预测目录:一、KNN算法二、任务说明三、代码实现一、KNN算法KNN (K-Nearest Neighbor)K近邻算法是一种有监督机器学习算法。它既能用于分类也能用于回归。KNN的核心原理是根据它距离最近的K个样本点是什么类别来判断该新样本属于哪个类别如上图所示当K3时距离待预测点最近的三个样本中蓝色三角形数量大于红色圆形因此会将待预测样本分类为蓝色三角形而当K5时由于最近五个样本中红色圆形样本较多KNN将待测样本分类为红色圆形。OCR识别技术主要依赖于图像处理和模式识别算法通过捕捉文档中的字符特征如笔画、形状、大小、间距等与预设的字符库进行比对从而识别相应的文字信息。二、任务说明目标:构建机器学习模型根据患者的临床生理指标,预测其是否患有心脏病输入特征基础信息年龄Age、性别Sex。体征指标静息血压 (RestingBP)、胆固醇 (Cholesterol)、最大心率 (MaxHR)、空腹血糖 (FastingBS)。症状与心电图胸痛类型 (ChestPainType)、运动心绞痛 (ExerciseAngina)、静息心电图 (RestingECG)、ST段旧峰值 (Oldpeak)、ST段斜率 (ST_Slope)。三、代码实现1️⃣首先进行数据分析import pandas as pd import seaborn as sns import matplotlib.pyplot as plt df pd.read_csv(Heart.csv) print(df.head)AgeSexChestPainTypeRestingBPCholesterolFastingBSRestingECGMaxHRExerciseAnginaOldpeakST_SlopeHeartDisease040MATA1402890Normal172N0Up0149FNAP1601800Normal156N1Flat1237MATA1302830ST98N0Up0348FASY1382140Normal108Y1.5Flat1454MNAP1501950Normal122N0Up0 1.1 绘制直方图查看病例的数据分布使用import seaborn as sns库绘制直方图将HeartDisease作为参数输入sns.countplot(xHeartDisease, datadf, palettemagma) plt.title(Distribution of Heart Disease) plt.show() 1.2 绘制直方图查看年龄的数据分布使用sns库绘制直方图将Age作为参数输入sns.histplot(datadf, xAge, kdeTrue, color#3B0F70, alpha0.7) plt.title(Age Distribution) plt.show() 1.3 绘制热力图查看不同指标之间的相关关系使用sns库绘制直方图首先对表格数据进行筛选筛选出包含数字的指标只有数字类型才能计算皮尔逊相关系数numeric_onlydf.select_dtypes(include[number]) #筛选出只包含数字的字段 plt.figure(figsize(12, 8)) sns.heatmap(numeric_only.corr(), annotTrue, cmapmagma, fmt.2f, linewidths0.5) plt.title(Correlation Matrix ) plt.show()2️⃣数据处理与训练 2.1 对数据进行处理使用get_dummies(df, drop_firstTrue)函数对数据进行独热编码主要用于将分类变量转换为机器学习模型可以理解的数值形式。drop_firstTrue是为了避免多重共线性会删除生成的虚拟变量中的第一列。df pd.get_dummies(df, drop_firstTrue)AgeRestingBPCholesterolFastingBSMaxHROldpeakHeartDiseaseSex_MChestPainType_ATAChestPainType_NAPChestPainType_TARestingECG_NormalRestingECG_STExerciseAngina_YST_Slope_FlatST_Slope_Up04014028901720011001000114916018001561100101001023713028309800110001001348138214010811000010110454150195012200101010001可以看到原本的RestingECG字段编码为了RestingECG_Normal,RestingECG_ST两个字段使用drop_first之后原本Sex字段包括M与F经过处理之后只保留了编码为0和1的Sex_M字段。随后将所有数据都统一为Int类型df df.astype(int) 2.2 对数据进行训练对数据进行切分将最后一个字段HeartDisease作为标签其余字段作为训练数据字段from sklearn.model_selection import train_test_split Xdf.drop(HeartDisease,axis1) ydf[HeartDisease] X_train , X_test , y_train , y_test train_test_split(X,y,test_size 0.25,random_state42,stratifyy)基于网格搜索法来确定knn模型的最佳超参from sklearn.pipeline import Pipeline from sklearn.model_selection import GridSearchCV knn_pipelinePipeline([(scaler,StandardScaler()),(knn,KNeighborsClassifier())]) knn_param_grid{knn__n_neighbors:[3,5,7,9,11,13,15,17,19,21],knn__weights:[uniform,distance],knn__metric:[minkowski, euclidean, manhattan]} knn_gridGridSearchCV(estimatorknn_pipeline,param_gridknn_param_grid,cv5,scoringaccuracy)使用Pipeline将数据预处理标准化和模型训练K近邻分类捆绑在一起形成一个完整的、可一键执行的流程。然后对模型中指定超参设置训练参数命名遵循Scikit-learn Pipeline的双下划线语法规范格式步骤名称 __ (两个下划线) 模型内部参数名此处knn__n_neighbors表示待选K值knn__metric对应KNN算法中的距离度量方式。对数据进行处理和对模型进行基本设置之后进行训练knn_grid.fit(X_train,y_train) print(Best KNN parameters:, knn_grid.best_params_) print(Best KNN cross validation accuracy:, knn_grid.best_score_)得到训练最优超参与验证集最佳的精度Best KNN parameters: {‘knn__metric’: ‘manhattan’, ‘knn__n_neighbors’: 21, ‘knn__weights’: ‘uniform’}Best KNN cross validation accuracy: 0.8677245318946365 2.3 对测试集进行预测from sklearn.metrics import accuracy_score log_y_pred log_grid.predict(X_test) accuracyaccuracy_score(y_test, log_y_pred) print(Test Accuracy:,accuracy)得到测试结果Test Accuracy: 0.9完整代码import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.metrics import accuracy_score from sklearn.neighbors import KNeighborsClassifier from sklearn.linear_model import LogisticRegression from sklearn.naive_bayes import GaussianNB import warnings warnings.filterwarnings(ignore) # 1. Loading Data df pd.read_csv(heart.csv) # 2. Explratory Data Analysis (EDA) print(df.head()) sns.countplot(x HeartDisease, datadf, palettemagma) plt.title(Distribution of Heart Disease) plt.show() sns.histplot(datadf, xAge, kdeTrue, color#3B0F70, alpha0.7) plt.title(Age Distribution) plt.show() numeric_only df.select_dtypes(include[number]) plt.figure(figsize(12, 8)) sns.heatmap(numeric_only.corr(), annotTrue, cmapmagma, fmt.2f, linewidths0.5) plt.title(Correlation Matrix) plt.show() # 3. Data Preprocessing Feature Engineering df pd.get_dummies(df, drop_firstTrue) print(df) df df.astype(int) print(df) X df.drop(HeartDisease, axis1) y df[HeartDisease] X_train, X_test, y_train, y_test train_test_split(X, y, test_size0.25, random_state42, stratifyy) # 4.Model Building Hyperparameter Tuning knn_pipeline Pipeline([(scaler, StandardScaler()),(knn, KNeighborsClassifier())]) knn_param_grid {knn__n_neighbors: [3, 5, 7, 9, 11, 13, 15, 17, 19, 21], knn__weights:[uniform, distance], knn__metric:[minkowski, euclidean, manhattan]} knn_grid GridSearchCV(estimatorknn_pipeline, param_gridknn_param_grid, cv5, scoringaccuracy) knn_grid.fit(X_train, y_train) print(Best KNN parameters:, knn_grid.best_params_) print(Best KNN cross validation accuracy:, knn_grid.best_score_) knn_y_pred knn_grid.predict(X_test) accuracy accuracy_score(y_test, knn_y_pred) print(Test Accuracy:, accuracy)