决策树算法 id3
A “Decision Tree” is one of the most important and most useful structure types for Machine Learning. Especially it is very important for Supervised Learning which is, learning with classifying the labelled data set concerning given behaviour and applying this knowledge to different data. We can say it is similar to learning with a teacher. Unlike unsupervised learning, in this type, you provide labelled data to the program.
“决策树”是机器学习最重要和最有用的结构类型之一。 尤其对于监督学习而言,这很重要,那就是学习对涉及给定行为的标记数据集进行分类并将该知识应用于不同的数据。 我们可以说这类似于与老师学习。 与无监督学习不同,在这种类型中,您可以为程序提供带标签的数据。
决策树的组成部分 (Component of Decision Tree)
In Decision Tree, each branch represents different outcomes. We can think of it as a flowchart. Each decision takes us deeper and classifies our data more.
在决策树中,每个分支代表不同的结果。 我们可以将其视为流程图。 每个决定都会使我们更深入,并对数据进行更多分类。
For each node, we assign some attributes and ask some questions to those attributes. Answers of these questions show the ways(edges) to us. So the nodes are the attributes and edges are the values to these attributes.
对于每个节点,我们分配一些属性,并对这些属性提出一些问题。 这些问题的答案说明了我们的方法(优势)。 因此,节点是属性,边是这些属性的值。
So, in this case, the attribute is “Hungry” and the question is “Are you hungry?” and each edge represents a possible answer. With the new answer, we reach a new layer and program creates a new node with a new attribute and applies to the same classification method recursively.
因此,在这种情况下,属性为“饥饿”,而问题为“您饿了吗?” 每个边缘代表一个可能的答案。 有了新的答案,我们到达了一个新层,程序创建了一个具有新属性的新节点,并递归地应用于相同的分类方法。
At the end we can say, leaves represent outcomes.
最后,我们可以说,叶子代表成果。
In this case, leaves are the result of the event. If we are hungry we eat now, if we are not we eat later.
在这种情况下,离开是事件的结果。 如果我们饿了,我们现在就吃饭,如果我们不饿,我们以后再吃饭。
So as you already understand the main goal in Decision Tree is dividing the data, looking for the most suitable classification and repeat that until all data set become related to each other. To envision of full decision tree here is a more visualized example;
因此,您已经了解了决策树的主要目标是分割数据,寻找最合适的分类,然后重复进行直到所有数据集相互关联为止。 为了设想完整的决策树,这里是一个更形象的示例;
In this example, first, we have the data set which contains different shapes with different colours. First, we classify this data concerning shape and after that concerning colour. In this example, the first classification made considering shape but it might be colour also. So the question is how does the program choose classification order? To understand this first we need to learn several concepts.
在此示例中,首先,我们具有包含不同形状和不同颜色的数据集。 首先,我们对有关形状的数据进行分类,然后对颜色进行分类。 在此示例中,第一个分类考虑了形状,但也可能是颜色。 那么问题是程序如何选择分类顺序? 为了首先理解这一点,我们需要学习几个概念。
熵和信息增益 (Entropy and Information Gain)
Entropy: Entropy, in machine learning spirit, measures the homogeneity of a data set. If entropy is 0, it means all the data set are the same. The mathematical formula of entropy is;
熵:以机器学习的精神,熵衡量数据集的同质性。 如果熵为0,则表示所有数据集都相同。 熵的数学公式为:
In this formula “S” is represented for data set, “c” represented for the total count of different classification, “Pi” is the proportion of the data with the classification.
在该公式中,“ S”代表数据集,“ c”代表不同分类的总数,“ Pi”是具有分类的数据的比例。
For instance, let’s say we have a data table about eating dinner outside and in this table, we have different variables. These variables can be “Am I hungry?”, “Is the weather good?”,”Do I want to stay at home?”, “Do I want to order something?”, etc. These variables lead us to 2 different outcomes which are yes or no. So we have 2 classes. Positive or negative.
例如,假设我们有一个关于在外面吃晚饭的数据表,并且在此表中,我们有不同的变量。 这些变量可以是“我饿了吗?”,“天气好吗?”,“我想待在家里吗?”,“我要点东西吗?”等。这些变量使我们得出两种不同的结果是或否。 因此,我们有2个班级。 正面或负面。
So if we had 7 positive outcomes in our table and 4 negative outcomes in our table we can calculate the entropy with the following way;
因此,如果我们在表中有7个阳性结果,在表中有4个阴性结果,则可以通过以下方式计算熵:
Information Gain: Information gain measures on how efficient is an attribute for classifying data. The classification status of the program increases, knowledge of information gain becomes more. We can think this concept as, when we go deeper in our tree we want to see our data more separated each other with respect to their label.
信息增益:信息增益用于衡量数据分类属性的效率。 程序的分类状态增加,信息获取的知识变得更多。 我们可以认为这个概念是,当我们深入研究树时,我们希望看到我们的数据在标签方面彼此更加分离。
We always want to maximize information gain so we need the entropies of the partitioned data to be as low as possible.
我们一直希望最大化信息增益,因此我们需要分区数据的熵尽可能低。
Above formula represents the information gain. T is the given set and X is the attribute which we want to calculate gain. So the answer to the previous question is, when our program chooses the classification order, it calculates the information gain first.
上式代表信息增益。 T是给定的集合,X是我们要计算增益的属性。 因此,前一个问题的答案是,当我们的程序选择分类顺序时,它将首先计算信息增益。
ID3算法 (The ID3 Algorithm)
So we learn decision tree basics and we understand how does the decision tree split the data with each other. Now we can see how does the ID3 algorithm accomplishes that.
因此,我们学习了决策树的基础知识,并且了解了决策树如何将数据彼此拆分。 现在我们可以看到ID3算法是如何实现的。
ID3 is an algorithm that generates a decision tree from the given labelled data set. It is using in machine learning and natural language processing. ID3 uses a top-down greedy approach which means we build the tree from top to down and each iteration we try to choose the best classification. ID3 algorithm is all about finding the attribute that returns the highest information gain.
ID3是一种根据给定的标记数据集生成决策树的算法。 它用于机器学习和自然语言处理。 ID3使用自上而下的贪婪方法,这意味着我们从上到下构建树,每次迭代都尝试选择最佳分类。 ID3算法是关于找到返回最高信息增益的属性的。
Summary of the algorithm;
算法摘要;
- Take the data set 取数据集
- Determine the best attribute with respect to information gain确定有关信息获取的最佳属性
- Split S into subsets concerning best attribute将S分成与最佳属性有关的子集
- Create a decision tree which root has the best attribute创建一个决策树,其根具有最佳属性
- Apply the same steps for new root and its new subsets. If there are no more attributes to split on, choose the most popular one. 对新的根及其新子集应用相同的步骤。 如果没有其他可分割的属性,请选择最受欢迎的属性。
This algorithm performs until all the data classified/spitted.
该算法一直执行到对所有数据进行分类/填充为止。
What if two data has the same instance or they are the same data but they have two different data? So what if our training data has some noise?
如果两个数据具有相同的实例,或者它们是相同的数据,但是它们具有两个不同的数据怎么办? 那如果我们的训练数据有些杂音怎么办?
过度拟合和不足拟合问题 (Overfitting and Underfitting Problem)
Overfitting problem happens, when your program models your training data perfectly (at the same time it learns noises too), so when it tries to classify new data it doesn’t have necessary ability to model it. So we can say your program memorizes the training data with every detail.
当您的程序完美地对训练数据建模(同时也学习噪声)时,就会发生过度拟合的问题,因此,当程序尝试对新数据进行分类时,就没有必要对其建模。 因此,我们可以说您的程序会详细记录训练数据。
On the other hand, underfitting is your program neither model the training data nor generalize it. So basically it couldn’t learn from training data.
另一方面,欠拟合是您的程序既不能对训练数据建模也不可以对其进行概括。 因此,基本上它不能从训练数据中学习。
We can overcome the overfitting problem with few methods;
我们可以用很少的方法克服过度拟合的问题;
- Cross-Validation: In this method, you split your training data for small training and testing parts. In this way, you can tune your program easily. 交叉验证:通过这种方法,您可以将训练数据拆分为小的训练和测试部分。 这样,您可以轻松调整程序。
- Pruning: You can decrease the size of the tree by removing some redundant parts of the tree that do not help for classifying data. In other words, you remove the unnecessary noise parts from the tree. 修剪:您可以通过删除树的某些多余部分来减少树的大小,这些部分无助于对数据进行分类。 换句话说,您从树上删除了不必要的噪音部分。
Tip: Try to start with simple models. When going with a more basic model, it will be easy to determine the road of the program. (Occam’s razor)
提示:尝试从简单模型开始。 使用更基本的模型时,很容易确定程序的方向。 (奥卡姆的剃刀)
Since the main problem is learning for the underfitting problem, in order to overcome this you can increase training set, training time and you can increase complexity in your training data.
由于主要问题是学习欠拟合问题,因此要解决此问题,您可以增加训练集,训练时间,并且可以增加训练数据的复杂性。
In this article, I wanted to briefly explain the decision tree and ID3 algorithm that I just learned. What I covered is very superficial information on these issues. If you want to learn more about these topics, I strongly recommend you to visit the sources in the references section.
在本文中,我想简要解释一下我刚刚学到的决策树和ID3算法。 我所介绍的是关于这些问题的非常肤浅的信息。 如果您想了解有关这些主题的更多信息,强烈建议您访问参考部分中的资源。
翻译自: https://medium.com/analytics-vidhya/decision-trees-and-id3-algorithm-62b40c50e371
决策树算法 id3