sklearn 中的 predict 和 predict_proba

关于sciki-learn中的predict()predict_proba()

说明

predict() 返回的是样本的预测类标号

predict_proba() 返回预测样本是每个类的概率

代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
In [1]: import numpy as np
In [2]: from sklearn.linear_model import LogisticRegression
In [3]: from sklearn.model_selection import train_test_split
# 生成随机数据,10 x 3
In [4]: data = np.round(np.random.randn(10, 3), 2)
In [5]: data
Out[5]:
array([[-0.03, -0.31, 0.1 ],
[-0.88, 0.24, -1.11],
[-0.97, 2.23, 0.37],
[ 0.2 , -0.49, 0.32],
[-0.35, 1.57, 0.49],
[-0.05, 2.82, -0.06],
[-1.37, -1.01, -0.42],
[-1.23, -0.63, 0.28],
[-0.31, 0.22, -0.11],
[-1.15, 0.42, 0.94]])
# 划分训练集和测试集
In [6]: x_train, x_test = train_test_split(data, test_size=0.2)
In [7]: x_train
Out[7]:
array([[-0.97, 2.23, 0.37],
[-1.15, 0.42, 0.94],
[-0.31, 0.22, -0.11],
[-1.37, -1.01, -0.42],
[-0.88, 0.24, -1.11],
[-1.23, -0.63, 0.28],
[ 0.2 , -0.49, 0.32],
[-0.03, -0.31, 0.1 ]])
In [8]: x_test
Out[8]:
array([[-0.05, 2.82, -0.06],
[-0.35, 1.57, 0.49]])
# 模拟生成类值
In [9]: y_train = np.array([3, 3, 3, 2, 2, 2, 1, 1])
In [10]: clf = LogisticRegression()
In [11]: clf.fit(x_train, y_train)
Out[11]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
# predict() 得出的是测试集的预测类
In [12]: clf.predict(x_test)
Out[12]: array([3, 3])
# predict_proba() 得出的是测试集的每个样本分别是 3 2 1 类的概率
# 比如第一行就是预测第一个测试样本的类是 3 的概率为 0.14674168, 类是 2 的概率为0.07374653,类是 1 的概率0.77951179;第二行就是预测第二个测试样本的类分别是 3 2 1 的概率为 0.19197241,0.12085576,0.68717183
In [13]: clf.predict_proba(x_test)
Out[13]:
array([[0.14674168, 0.07374653, 0.77951179],
[0.19197241, 0.12085576, 0.68717183]])