文本描述
1 基于随机森林的股价走势预测研究 ——以消费行业为例 摘要 预测股价走势一直是投资者们的追求甚至是梦想,但既然称之为 梦想,这项工作的难度也就可想而知。过去,人们主要通过分析上市 公司的生产经营、财务状况、股票交易指标、甚至投资者行为来判断 股价未来的表现,出现了大大小小各种流派。随着计算机技术的不断 进步,人工智能技术成为现实并有了突飞猛进的发展,在人类的生产 和生活中产生了全方位的影响。其中,机器学习的出现,为人们分析 研究股价走势开辟了新的思路。 随机森林算法就是一种机器学习,它利用已有样本训练出大量不 同的决策树,再将决策树集成,对新输入的样本做出分类判断。本文 基于随机森林算法,试图对国内上市公司的股价走势进行预测。在同 一行业内,上市公司的数据有更好的可比性。国内消费行业上市公司 的数量较多,以之为研究对象能使随机森林算法得到更好的发挥。行 业分类采用的是Wind行业分类标准,具体涵盖食品饮料、烟草、服装 纺织、生活用品、酒店餐饮、娱乐、汽车及零配件、奢侈品等。本文 选取2015年至2018年沪深股市消费行业上市公司的公开的财务指标、 交易技术指标以及次年年度涨跌幅作为原始数据。将上市公司各项指 标与其行业的中位数之差作为随机森林模型的特征变量。将次年年度 涨跌幅是否优于行业的中位数作为输出结果,数字表现形式为0或1, 以此保证输出数据的直观性和统一性。将2015年至2017年的样本作 为训练集,将2018年的样本作为测试集。构建随机森林模型之后,通 过优化ntree值和mtry值提高RF模型的性能。依据基尼指数的平均 减少度计算特征变量的重要性,删去重要性最低的若干特征,进一步 实现对RF模型的优化。以上思路预计可以有效消除系统性波动和行业 波动对个股的影响。如果模型成立,同样的方法可以应用到其他行业。 机器学习使得股票分析及决策判断更加客观和理性。本文将随机 森林算法应用在股票分析研究上,可以较好地利用机器学习的优势, 中国政法大学硕士学位论文 基于随机森林的股价走势预测研究 2 克服主观经验的局限和情绪起伏的干扰,筛选出优质的投资标的,为 投资者提供一定的参考。 关键字:随机森林 特征选择 上市公司 股票价格 ABSTRACT 3 STUDY OF STOCK PRICE TREND FORECAST BASED ON RANDOM FORESTS ALGORITHM: A CASE STUDY OF CONSUMER INDUSTRY ABSTRACT Predicting stock price trends has always been the pursuit and even dream of investors, but since it is called a dream, the difficulty of this task can be imagined. In the past, people mainly judged the future performance of stock prices by analyzing the production and operation of listed companies, financial conditions, stock trading indicators, and even investor behavior. With the continuous advancement of computer technology, artificial intelligence technology has become a reality and has undergone great development, which has produced a full range of impacts in human production and life. Among them, the emergence of machine learning has opened up new ideas for people to analyze and study the trend of stock prices. The random forests algorithm is a kind of machine learning. It uses existing samples to train a large number of different decision trees, and then integrates the decision trees to make classification judgments on the newly input samples. Based on the random forests algorithm, this article attempts to predict the stock price trend of domestic listed companies. Within the same industry, the data of listed companies have better comparability. In view of the large number of listed companies in the domestic consumer industry, using them as the research object can make the random forests algorithm better play. The industry classification adopts the Wind industry classification standard, which specifically covers food, beverage, tobacco, clothing, textiles, daily necessities, hotels, restaurants, entertainment, automobiles, luxury goods, etc. This paper selects the public financial 中国政法大学硕士学位论文 基于随机森林的股价走势预测研究 4 indicators, trading technical indicators and the annual increase and decrease of the following year from 2015 to 2018 of the listed companies in the consumer industry as the original data. The difference between each index of a listed company and the median of its industry is used as the characteristic variable of the random forests model. The output result is whether the annual fluctuation rate of the following year is better than the median of the industry. The result is in the form of 0 or 1, so as to ensure the intuitiveness and unity of the output data. The samples from 2015 to 2017 are used as the training set, and the samples from 2018 are used as the test set. After constructing the random forests model, the performance of the RF model is improved by optimizing the ntree value and mtry value. Calculate the importance of feature variables according to the average reduction degree of the Gini index, and delete some of the least important features to further optimize the RF model. The above ideas are expected to eliminate the impact of systemic fluctuations and industry fluctuations. The same method can be applied to other industries. Machine learning makes stock analysis and decision-making judgments more objective and rational. In this paper, the random forests algorithm is applied to stock analysis and research, which can make good use of the advantages of machine learning, overcome the limitations of subjective experience and the interference of emotional fluctuations, screen out high-quality investment targets to provide investors with a certain reference. KEY WORDS: Random Forests; feature selection; listed company; stock price 目 录 1 1 目录 目录 .................................................................................................................................. 1 图目录 .............................................................................................................................. 3 表格目录 .......................................................................................................................... 4 第1章 绪论 .................................................................................................................... 1 1.1 研究背景及意义 ................................................................................................... 1 1.1.1 研究背景 ......................................................................................................... 1 1.1.2 研究意义 ......................................................................................................... 2 1.2 国内外研究现状 ................................................................................................... 2 1.2.1 国外研究现状 ................................................................................................. 2 1.2.2 国内研究现状 ................................................................................................. 4 1.3 本文的研究内容与研究方法 ............................................................................... 6 1.3.1 研究内容与论文结构 ..................................................................................... 6 1.3.2 研究方法和研究思路 ..................................................................................... 7 1.4 本文的创新点 ....................................................................................................... 7 第2章 随机森林理论概述 ............................................................................................ 9 2.1 决策树方法 ........................................................................................................... 9 2.2 Bagging方法 ..................................................................................................... 10 2.3 随机森林(Random Forests) ........................................................................... 11 2.3.1 随机森林算法 ............................................................................................... 11 2.3.2 泛化误差与OOB估计 ................................................................................ 11 2.3.3 特征重要性评价 ........................................................................................... 12 第3章 国内股价波动分析及行业分类 ...................................................................... 13 3.1 国内股票市场波动特点 .................................................................