运行环境基于Anaconda的Tensorflow环境，方便管理以及后续的Tensorflow RNN卷积神经网络的使用。使用到配置完毕的LIBSVM(前文)。

step1:前期准备,模块安装

1. 模块安装

激活环境后安装mpl_finance模块:

1	(python36) $ pip install https://github.com/matplotlib/mpl_finance/archive/master.zip

mpl_finance模块所包含的函数可参阅文档

安装tushare模块:

1	(python36)$ pip install tushare

此处本人安装报错了，有提示报错原因，直接按其提示安装相关模块即可。

1	(python36)$pip3 install module_name

2. 可视化数据

使用tushare的接口，get_h_data()获得一段时间内股价的复权数据。编辑器使用了jupyter，’%’是魔法函数。

import numpy as np
import tushare as ts
import matplotlib.pyplot as plt
#内置图像
%matplotlib inline	
#生成svg矢量图格式的图形
%config InlineBackend.figure_format = 'svg'
data = ts.get_h_data('002337', start='2015-01-01', end='2015-12-16') #两个日期之间的前复权数据
plt.plot(data['close'])

line

除了折线图，我们加上蜡烛图（k线图），均线和成交量。首先我们需要安装talib用来计算均线。

1	(python36)$ pip install talib

报错，安装失败。具体报错信息如下：
taliberr
解决方案：https://stackoverflow.com/questions/49648391/how-to-install-ta-lib-in-google-colab
导入matlibplot.pyplot时出现错误：
plterr
解决方案：
https://stackoverflow.com/questions/31373163/anaconda-runtime-error-python-is-not-installed-as-a-framework/41433353
使用tushare接口的get_k_data容易出现,原因是retry_count的默认次数太少，容易出现超时连接，所以我们设置get_k_data(…,retry_count = 10)。
整体代码如下：

import numpy as np
import tushare as ts
import matplotlib.pyplot as plt
import mpl_finance as mpf
import talib
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
data = ts.get_k_data('399300', index=True, start='2017-01-01', end='2017-06-31')
#计算10日、30日均线
sma_10 = talib.SMA(np.array(data['close']), 10)
sma_30 = talib.SMA(np.array(data['close']), 30)
#生成同享X轴的子图（Create a figure and a set of subplots），返回一个figure对象和两个axes轴对象
fig, (ax, ax2) = plt.subplots(2, 1, sharex=True)
#设置两个轴的比例,否则k线和成交量两个子图大小相同
# ax = fig.add_axes([0,0.2,1,0.5])
# ax2 = fig.add_axes([0,0,1,0.2])
#第一个轴ax:画蜡烛图，10、30日均线
mpf.candlestick2_ochl(ax, data['open'], data['close'], data['high'], data['low'], 
                      width=0.5, colorup='r', colordown='g', alpha=0.6)
ax.set_xticklabels(data['date'][::10])
ax.plot(sma_10, label='10 MA')
ax.plot(sma_30, label='30 MA') #Moving Average
ax.legend(loc='upper left')
ax.grid(True)
#第二个轴ax2:画成交量
mpf.volume_overlay(ax2, data['open'], data['close'], data['volume'], colorup='r', colordown='g', width=0.5, alpha=0.8)
ax2.set_xticks(range(0, len(data['date']), 10))
ax2.set_xticklabels(data['date'][::10], rotation=30)
ax2.grid(True)
plt.subplots_adjust(hspace=0)
plt.show()

kdata

step2:数据处理

3. 通过tushare读取某只股票的两年数据并保存到本地。

1 2	df = ts.get_h_data('002337', start='2015-01-01', end='2017-01-01') df.to_csv('/Users/wangxiaobin/Documents/git&hexo_blog/source/_posts/股价绘图以及SVM-SVR策略/002337.csv')

4. 利用pandas处理数据:提取特征,然后分类。

讲一下特征提取的方法：瞎几把提取(￣▽￣)”：amt,volume,amt/volume,并且当(第三天的最高价-第一天的最低价)/第一天的开盘价 >= 0.02的时候设该行为1,否则置为-1。主要指导思想是证券和期货投资大师威廉.欧奈尔的说法：个股的成交量能用来衡量个股供需双方的力量。当股价开始奔向近期新高，准备上一台阶的时候，此时的成交量应比最近几个月来的日平均成交量放大至少50%以上。

import pandas as pd
import os

def loadDataSetByPandas(filesname):
	'''read csv files to a DataFrame
	Args:
		the list of the files
	Returns:
		merge all the files' data into a DataFrame exclude the rows 0
	'''
	dataFrame = pd.DataFrame()
	for f in filesname:
		dataFrame = dataFrame.append(pd.read_csv(f,header=0),ignore_index=True) #列名设为第0行,忽略index重复名
	dataFrame = dataFrame[(dataFrame['amount']>0) & (dataFrame['volume']>0)]	#处理缺失值0
	return dataFrame

def listcsvFiles(_filepath):
	''' list .csv files in specific filepath
	Args:
		_filepath: The path to list
	Returns:
		a list combined by the .csv file name under the path
	'''
	filecsv_list=[]
	os.chdir(_filepath)
	for root, dir, files in os.walk(_filepath):
		for f in files:
			if os.path.splitext(f)[1] == '.csv':
				filecsv_list.append(f) 
	return filecsv_list


def count(_dataFrame):
	'''统计DataFrame中label分别为1或-1的个数（要有好的分类效果，-1和1的个数最好持平）
		Args:
		Returns -> tuple: 例如：
	return	  1414, 1514
	label:	  	-1 ,  1			
	'''
	res = 0,0
	minus1 = _dataFrame[_dataFrame['label'] == -1]
	plus1 = _dataFrame[_dataFrame['label'] == 1]
	res = len(minus1),len(plus1)
	return res

def add_label(_dataFrame):
	df = _dataFrame
	df['Amt_div_Vol'] = df['amount'] / df['volume']    #给df增加一列Amt_div_Vol
	df['label'] = (df['high'].shift(1)- df['low']) / df['close']    #增加一列label
	df['label'] = df['label'].apply(lambda x : 1 if x >= 0.03  else -1) #修改label的值

	'''
	选取amt,volume,amt_div_vol作为三个特征，
	label作为分类的标志
	'''
	df = df[['label','amount','volume','Amt_div_Vol']]
	return df

if __name__ == '__main__':
	root_dir = '/Users/wangxiaobin/Documents/git&hexo_blog/source/_posts/股价绘图以及SVM-SVR策略'
	svm_train_dir = '/Users/wangxiaobin/Documents/git&hexo_blog/source/_posts/股价绘图以及SVM-SVR策略'
	# svm_test_dir = '/Users/wangxiaobin/Desktop/大二下创新实践/作业4/作业4 Libsvm-股票数据分析/svm_test'

	filecsv_list = listcsvFiles(root_dir)
	df = loadDataSetByPandas(filecsv_list)
	df = add_label(df)
	df.to_csv('data_by_pandas.csv',header=None,index=None)
	# 查看数据中正负类的个数
	print(count(df))

5. 将提取的特征转成LIBSVM支持的格式

LIBSVM支持的格式长这个样子:
label 1:attr1 2:attr2 3:attr3 ···
我们直接使用convert.c编译成的convert把数据转成libsvm支持的格式。

1	convert data_by pandas > /Users/wangxiaobin/Documents/git\&hexo_blog/source/_posts/股价绘图以及SVM-SVR策略/converted

6. 数据归一化

由于每个特征的权重由其大小决定，为了保证每个特征的权重相同，我们必须把数据标准化、归一化。我们使用libsvm内置的工具svm-scale。

/usr/local/lib/libsvm-3.22/svm-scale /Users/wangxiaobin/Documents/git\&hexo_blog/source/_posts/股价绘图以及SVM-SVR策略/converted > /Users/wangxiaobin/Documents/git\&hexo_blog/source/_posts/股价绘图以及SVM-SVR策略/scaled

step3:svm应用

7. 训练预测

from svmutil import *
#读取训练文件，训练400个数据，预测后面的88个。
y,x = svm_read_problem('scaled')
m = svm_train(y[:400],x[:400],'-c 4')
p_label,p_acc,p_val = svm_predict(y[400:],x[400:],m)

结果如下：

➜  股价绘图以及SVM-SVR策略 git:(master) ✗ python svmtrain.py 
*..
WARNING: using -h 0 may be faster
*
optimization finished, #iter = 1152
nu = 0.685000
obj = -1095.381574, rho = -0.927679
nSV = 299, nBSV = 262
Total nSV = 299
Accuracy = 73.8636% (65/88) (classification)

8.优化C,gamma 参数后进行预测

➜ 股价绘图以及SVM-SVR策略 git:(master) ✗ python /usr/local/lib/libsvm-3.22/tools/grid.py scaled

[local] 5 -7 67.2131 (best c=32.0, g=0.0078125, rate=67.2131)
[local] -1 -7 67.2131 (best c=0.5, g=0.0078125, rate=67.2131)
...
[local] 13 -9 67.2131 (best c=128.0, g=8.0, rate=69.4672)
[local] 13 -3 67.2131 (best c=128.0, g=8.0, rate=69.4672)
128.0 8.0 69.4672

注意在使用grid.py前，查看是否需要修改grid.py源码中的路径，若与你的文件路径不符，请修正。否则会抛出无法找到文件的异常。

可以得到最优参数：c=128, g=8，然后，在第七步的基础上加上这个最优参数。最后得到

➜  股价绘图以及SVM-SVR策略 git:(master) ✗ python svmtrain.py
.................*..................................*
optimization finished, #iter = 20659
nu = 0.642750
obj = -32444.879084, rho = -0.564842
nSV = 273, nBSV = 246
Total nSV = 273
Accuracy = 68.1818% (60/88) (classification)

你没有看错，准确率变低了(ﾟДﾟ) ，还莫得思绪。

9.从混淆矩阵，ROC曲线中发现问题

我们知道，准确率(Accuracy)只是反映一个模型好坏的一个指标，但是光准确率一个指标不能较为全面地评价一个模型，因此为了更加客观评价一个模型，我们需要引入其他维度的指标：精确率(Precision)、召回率(Recall)以及以精确率和召回率为坐标轴的ROC曲线。这几个指标的具体含义，本文不展开(有时间在另写一片文章)。这里我发现使用Python的sklearn包能很方便的提供了我所需要的数学公式的接口,那么，我就开始用sklearn包来实现了。

1
2
3

#读入数据，
import pandas as pd
df = pd.read_csv('data_after_pandas.csv')

# 确定特征集X和目标集y
features_cols = ['amount','volume','Amt_div_Vol']
X = df[features_cols]
# 特征(X)标准化
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(X)#创建一个缩放器对象
X = scaler.transform(X)
y = df.label

# 把数据分为训练数据和测试数据
# 'from sklearn.cross_validation import train_test_split' is deprecated
from sklearn.model_selection import train_test_split
#参数说明random_state =0：多次运行产生相同的随机数；stratify：解决类不平衡问题
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size = 0.25,stratify=y)

from sklearn import svm
clf = svm.SVC()
# 训练模型
clf.fit(X_train,y_train)
# print(clf.best_params_)
# best_clf = clf.best_estimator_
# 预测结果
predictions = clf.predict(X_test)
print('predictions:',predictions)
# 计算正确率
from sklearn import metrics
accuracy_score = metrics.accuracy_score(y_test, predictions)
print('accuracy_score:',accuracy_score)
#计算召回率（Recall）
recall_score = metrics.recall_score(y_test,predictions)
print('recall_score:',recall_score)
#计算准确率(Precision)
precision_score = metrics.precision_score(y_test,predictions)
print('precision_score',precision_score)
#计算预测的具体值,后文画roc曲线要用的
y_scores = clf.decision_function(X_test)
print('y_scores:',y_scores)

predictions: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
     1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
     1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
     1 1 1 1 1 1 1 1 1 1 1]
    accuracy_score: 0.6721311475409836
	 recall_score: 1.0
    precision_score: 0.6721311475409836
    y_scores: [1.   1.   0.95 1.   1.   1.   1.02 0.91 1.   0.81 1.   1.   1.   1.
     1.   1.   1.   1.   1.   1.17 1.01 1.   1.   1.08 0.96 1.   1.   1.
     1.   1.   1.   1.   0.99 0.84 1.   1.   1.   0.98 1.   0.94 1.02 1.
     0.98 0.96 1.   1.   1.01 1.04 0.97 1.   1.   1.   1.   1.   1.   0.99
     1.   0.99 1.   1.   0.99 1.   0.26 0.97 1.   1.   0.94 0.97 1.   1.
     1.   1.   1.   0.98 1.   1.   1.   1.   1.   1.   1.   1.   1.02 0.9
     1.   1.   1.   1.   0.99 1.   1.   0.99 1.   1.   0.94 1.   1.26 1.
     1.   0.99 1.   1.   1.   1.   0.9  0.96 1.   1.   1.04 1.   1.   1.
     0.94 1.   0.94 1.   1.   1.   1.   1.   1.   1.03]

观察predictions，发现预测结果都为1，很明显这个分类器没有起到很好的分类效果，虽然recall有1,但是accuracy_score和precision_score都只有0.672。指标的差异较大，且recall能有1可以说是非常不正常了，所以我们试着把混淆矩阵画出来以便理解recall等于1的结果。

# 混淆矩阵
import itertools
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """        
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, predictions)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['good','bad'],
                      title='Confusion matrix, without normalization')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['good','bad'], normalize=True,
                      title='Normalized confusion matrix')

Confusion matrix, without normalization
[[ 0 40]
 [ 0 82]]
Normalized confusion matrix
[[0. 1.]
 [0. 1.]]

股价绘图以及SVM-SVR策略

总算有点明白了按照Recall的公式: $R = \frac{TP}{TP+FN} $，其中$TP = 80，FN=0$，当然recall为1了。
总结这个分类器是很失败的，把所有的实例都判断成了正类，其中究竟出了什么问题？在和同鞋的聊天中，他说的一点很中肯：数据的特征选择是重中之重。没选择好特征，分类器也就不能更好的工作了，可是难点也是特征选择,这些特征都是人工挑选出来的，无法得知特征的优劣，所以最后结果不符合要求也是可以理解的了。而对比本人此次实验，选的三个特征：成交量、成交总额、成交总额/成交量，也许并不太合适。我应该寻找一些更有具代表性的特征。

Prev Home Next

股价绘图以及SVM/SVR策略