decision-trees



decision-trees

0 0


decision-trees

Decision Trees Presentation

On Github rcliao / decision-trees

Why Am I Here?

For fun.

To present another way to predict Titanics survival

Review

Types of Learning

  • Supervised Learning

    Teacher provides labeled examples

  • Unsupervised Learning

    No teacher to provide examples

  • Reinforcement Learning

    Agent only reveives reward for taking an action

Learning Algorithm

  • Batch Learning Algorithm

    Got a large set of data all at once and selects single hypothesis

  • Incremental Learning Algorithm

    Receives one example at once and continually modifies its hypothesis

Decision Trees

Data Structures that provide an agent with means of classifying examples

Trees as Rules

Decision tree is just a compiled set of rules

    if name == 'Daniel Soto':
        idiot
    else:
        not idiot

Examples

Here we will just use titanic data set as example

Recall of Titanic

Challenges

Choosing Useful Test Attribute

We would like attributes that "split" the training set

We can construct a decision tree recursively:

Base cases: If all examples are positve or negative, we are done. If there are no examples left, then we haven’t seen an instance of this classification, so we use the majority classification of the parent. If there are no attributes left to test, then we have instances with the same description, but different classifications. Insufficient description Noisy data, nondeterministic domain* Use majority vote Recursive step: Else, pick the best attribute to split on and recursively construct subtrees

Choosing Useful Attribute

How do we know which test attribute provides us the best information

Information Theory

You can compute the information content by the following formula:

n∑i=1−P(vi)∗log2(P(vi))

Example of coin flip:−12log2(12)+−12log2(12)=1

Sometimes Information Content is also called Entropy

Using Information Theory

We want to know how valuable each possible test is

  • We can estimate the probabilities of possible answers from the training set.
  • Usually, a single test will not be enough
  • Then, we need to think about how much better if we ask another question after this one
  • This is called information gain.

Information Gain

If training set has p positive examples and n negative examples, the information in a correct answer is:

I(pp+n,np+n)=−pp+nlog2(pp+n)−np+nlog2(np+n)

The remainder is the sum of the information in each of these subsets

Remainder(A)=v∑i=1(pi+nip+n∗I(pipi+ni,nipi+ni))

Information Gain

Difference between original information(before the test) and new information(after the test)

Gain(A)=I(pp+n,np+n)−Remainder(A)

Choose the the test attribute having the largest information gain.

Pseudocode

def makeTree(dataset) :
    if all data are in the same class :
        return a single Node with that classification
    if there are no attributes left to test:
        return a single Node with majority classification
    else :
        select the attribute that produces the largest information gain
        split the dataset according to the values of this attribute to
            create $v$ smaller datasets.
        create a new Node - each child will be created by calling
        makeTree with one on the $v$ subsets.

link to python file

TL; DL;

header, data = create_numpy_array_from_csv(csv_file_path='train.csv')

S = data[0::, 2:] # Cut out survived column
# tons of process between
Y = data[0::, 0]

decision_tree_classifer = tree.DecisionTreeClassifier(max_depth = 6)
decision_tree_classifer = decision_tree_classifer.fit(S, Y)

Overfitting

In decision trees, overfitting is dealt with through pruning.

Pruning

If a node does not provide any information, then remove the node and move the examples to the parent.

Features of Decision Trees

  • Work well with symbolic data
  • Use information gain to determine what questions are most effective at classifying
  • Produce a human-understandable hypothesis
  • Fixes need to deal with noise or missing data
  • Can represent all boolean functions