作者Natasha Latysheva;Charles Ravarani
Creating the dataset
e.g. make_moons generates crescent-shaped data
Check out make_classification, which generates linearly-separable data
from sklearn.datasets import make_moons
X, y = make_moons( n_samples=500, # the number of observations random_state=1, noise=0.3 )Take a peek
[[ 0.50316464 0.11135559]
[ 1.06597837 -0.63035547] [ 0.95663377 0.58199637] [ 0.33961202 0.40713937] [ 2.17952333 -0.08488181] [ 2.00520942 0.7817976 ] [ 0.12531776 -0.14925731] [ 1.06990641 0.36447753] [-0.76391099 -0.6136396 ] [ 0.55678871 0.8810501 ]] [1 1 0 0 1 1 1 0 0 0]你刚生成的数据集例如以下图所看到的:
import matplotlib.pyplot as plt from matplotlib.colors import ListedColorma %matplotlib inline # for the plots to appear inline in jupyter notebooks
Plot the first feature against the other, color by class
plt.scatter(X[y == 1, 0], X[y == 1, 1], color=”#EE3D34”, marker=”x”)
plt.scatter(X[y == 0, 0], X[y == 0, 1], color=”#4458A7”, marker=”o”)“`
from sklearn.cross_validation import train_test_split
Split into training and test sets
XTrain, XTest, yTrain, yTest = train_test_split(X, y, random_state=1, test_size=0.5)
使用K近邻(KNN)分类器预測数据集类别.第二章提供了关于KNN理论很好介绍.我是ISLR书的脑残粉.你相同能够看看之前文章 .
假设K很高(k=99),模型在对未知数据点类别做决策是会考虑大量近邻.这意味着模型是相当受限的,由于它分类实例时,考虑了大量信息.换句话说,一个大的k值导致相当”刚性”的模型行为. 相反,假设k很低(k=1,或k=2),在做分类决策时仅仅考虑少量近邻,这是很灵活而且很复杂的模型,它能完美拟合数据的精确形式.因此模型预測更依赖于数据的局部趋势(关键的是,包括噪声). 让我们看一看k=99与k=1时KNN算法分类数据的情况.绿色的线是训练数据的决策边界(算法中的阈值决定一个数据点是否属于蓝或红类).![](http://i.imgur.com/iUFQFPw.jpg)
眼下你仅仅能看到训练数据,可是量化训练误差没多大用处.对模型概括刚学习的训练集性能有多好,你不感兴趣.让我们看看在測试集表现怎样,由于这会对模型好坏给你一个更直观的印象.试着使用不同的K值:from sklearn.neighbors import KNeighborsClassifierfrom sklearn import metrics knn99 = KNeighborsClassifier(n_neighbors = 99)knn99.fit(XTrain, yTrain)yPredK99 = knn99.predict(XTest)print "Overall Error of k=99 Model:", 1 - round(metrics.accuracy_score(yTest, yPredK99), 2)knn1 = KNeighborsClassifier(n_neighbors = 1)knn1.fit(XTrain, yTrain)yPredK1 = knn1.predict(XTest)print "Overall Error of k=1 Model:", 1 - round(metrics.accuracy_score(yTest, yPredK1), 2)
Overall Error of k=99 Model: 0.15
Overall Error of k=1 Model: 0.15 实际上,看起来这些模型对測试集表现的大约相同出色.以下是通过训练集学习到的决策边界应用于測试集.看是否能找出两个模型错误的预測.![](http://i.imgur.com/IbDmgAf.jpg)
knn50 = KNeighborsClassifier(n_neighbors = 50)knn50.fit(XTrain, yTrain)yPredK50 = knn50.predict(XTest)print "Overall Error of k=50 Model:", 1 - round(metrics.accuracy_score(yTest, yPredK50), 2)
Overall Error of k=50 Model: 0.11
总的来说,当你对一个数据集训练机器学习算法,关注模型在一个独立数据模型的表现怎样.对于训练集做好分类是不够的.本质上来讲,仅仅关心构建可泛化的模型–对于训练集获得100%的准确率并不令人印象深刻,仅仅是过拟合的指标.过拟合是紧密拟合模型,而且调优噪声而不是信号的情况. 更清楚的讲,你不是建模数据集中的趋势.而是尝试建模真实世界过程,引导我们研究数据.你恰好使用的详细数据集仅仅是基础事实的一小部分实例,当中包括噪声和自身的特点. 下列汇总图片展示在训练集和測试集上欠拟合(高偏差,低方差),正确拟合,以及过拟合(低偏差,高方差)模型怎样表现:![](http://i.imgur.com/hLSW8aJ.jpg)
注:实践中,当扫描这样的參数,使用训练集測试模型是以个糟糕的主意.相同的方式,你不能使用測试集多次浏览一个參数(每一个參数值一次).接下来,你是用这些计算仅仅是作为样例.实践中,仅仅有K折交叉验证是一种安全的方法!import numpy as npfrom sklearn.cross_validation import train_test_split, cross_val_scoreknn = KNeighborsClassifier()# the range of number of neighbors you want to testn_neighbors = np.arange(1, 141, 2)# here you store the models for each dataset usedtrain_scores = list()test_scores = list()cv_scores = list()# loop through possible n_neighbors and try them outfor n in n_neighbors:knn.n_neighbors = nknn.fit(XTrain, yTrain)train_scores.append(1 - metrics.accuracy_score(yTrain, knn.predict(XTrain))) # this will over-estimate the accuracytest_scores.append(1 - metrics.accuracy_score(yTest, knn.predict(XTest)))cv_scores.append(1 - cross_val_score(knn, XTrain, yTrain, cv = 10).mean()) # you take the mean of the CV scores
# what do these different datasets think is the best value of k?
print( 'The best values of k are: n' '{} according to the Training Setn' '{} according to the Test Set andn' '{} according to Cross-Validation'.format( min(n_neighbors[train_scores == min(train_scores)]), min(n_neighbors[test_scores == min(test_scores)]), min(n_neighbors[cv_scores == min(cv_scores)]) ) )
1 according to the Training Set 23 according to the Test Set and 11 according to Cross-Validation不仅仅是收集最优的k,还须要对一系列測试的K看看预測误差.
# let's plot the error you get with different values of kplt.figure(figsize=(10,7.5))plt.plot(n_neighbors, train_scores, c="black", label="Training Set")plt.plot(n_neighbors, test_scores, c="black", linestyle="--", label="Test Set")plt.plot(n_neighbors, cv_scores, c="green", label="Cross-Validation")plt.xlabel('Number of K Nearest Neighbors')plt.ylabel('Classification Error')plt.gca().invert_xaxis()plt.legend(loc = "lower left")plt.show()
包括机器学习中拆分数据集。算法拟合以及測试的部分。def detect_plot_dimension(X, h=0.02, b=0.05):x_min, x_max = X[:, 0].min() - b, X[:, 0].max() + by_min, y_max = X[:, 1].min() - b, X[:, 1].max() + bxx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))dimension = xx, yyreturn dimensiondef detect_decision_boundary(dimension, model):xx, yy = dimension # unpack the dimensionsboundary = model.predict(np.c_[xx.ravel(), yy.ravel()])boundary = boundary.reshape(xx.shape) # Put the result into a color plotreturn boundarydef plot_decision_boundary(panel, dimension, boundary, colors=['#DADDED', '#FBD8D8']):xx, yy = dimension # unpack the dimensionspanel.contourf(xx, yy, boundary, cmap=ListedColormap(colors), alpha=1)panel.contour(xx, yy, boundary, colors="g", alpha=1, linewidths=0.5) # the decision boundary in greendef plot_dataset(panel, X, y, colors=["#EE3D34", "#4458A7"], markers=["x", "o"]):panel.scatter(X[y == 1, 0], X[y == 1, 1], color=colors[0], marker=markers[0])panel.scatter(X[y == 0, 0], X[y == 0, 1], color=colors[1], marker=markers[1])def calculate_prediction_error(model, X, y):yPred = model.predict(X)score = 1 - round(metrics.accuracy_score(y, yPred), 2)return scoredef plot_prediction_error(panel, dimension, score, b=.3):xx, yy = dimension # unpack the dimensionspanel.text(xx.max() - b, yy.min() + b, ('%.2f' % score).lstrip('0'), size=15, horizontalalignment='right')def explore_fitting_boundaries(model, n_neighbors, datasets, width):# determine the height of the plot given the aspect ration of each panel should be equalheight = float(width)/len(n_neighbors) * len(datasets.keys())nrows = len(datasets.keys())ncols = len(n_neighbors)# set up the plotfigure, axes = plt.subplots(nrows,ncols,figsize=(width, height),sharex=True,sharey=True)dimension = detect_plot_dimension(X, h=0.02) # the dimension each subplot based on the data# Plotting the dataset and decision boundariesi = 0for n in n_neighbors:model.n_neighbors = nmodel.fit(datasets["Training Set"][0], datasets["Training Set"][1])boundary = detect_decision_boundary(dimension, model)j = 0for d in datasets.keys():try:panel = axes[j, i]except (TypeError, IndexError):if (nrows * ncols) == 1:panel = axeselif nrows == 1: # if you only have one datasetpanel = axes[i]elif ncols == 1: # if you only try one number of neighborspanel = axes[j]plot_decision_boundary(panel, dimension, boundary) # plot the decision boundaryplot_dataset(panel, X=datasets[d][0], y=datasets[d][1]) # plot the observationsscore = calculate_prediction_error(model, X=datasets[d][0], y=datasets[d][1])plot_prediction_error(panel, dimension, score, b=0.2) # plot the score# make compacted layoutpanel.set_frame_on(False)panel.set_xticks([])panel.set_yticks([])# format the axis labelsif i == 0:panel.set_ylabel(d)if j == 0:panel.set_title('k={}'.format(n))j += 1i += 1plt.subplots_adjust(hspace=0, wspace=0) # make compacted layout
# specify the model and settingsmodel = KNeighborsClassifier()n_neighbors = [200, 99, 50, 23, 11, 1]datasets = {"Training Set": [XTrain, yTrain],"Test Set": [XTest, yTest]}width = 20# explore_fitting_boundaries(model, n_neighbors, datasets, width)explore_fitting_boundaries(model=model, n_neighbors=n_neighbors, datasets=datasets, width=width)