# LEARNING FROM DATA PDF

- Contents:
- MODERATORS
- Peek into the book
- S L I D E S
- Learning From Data

Contents

The recommended textbook covers 14 out of the 18 lectures. The rest is covered by online material that is freely available to the book readers. This book, together with specially prepared online material freely accessible to our readers, provides a complete introduction to Machine Learning, the. Does anybody have any experience with the Learning from Data textbook by to this (with free PDF download): soeprolrendiele.gq~ullman/soeprolrendiele.gq

Author: | TERRY NORDEN |

Language: | English, Japanese, Hindi |

Country: | Mauritania |

Genre: | Environment |

Pages: | 455 |

Published (Last): | 12.03.2016 |

ISBN: | 284-9-35650-795-4 |

ePub File Size: | 20.31 MB |

PDF File Size: | 15.83 MB |

Distribution: | Free* [*Register to download] |

Downloads: | 44355 |

Uploaded by: | HELENE |

Learning From Data Click button below to download or read this book. Description. A PHP Error was encountered Severity: Notice Message. LEARNING FROM DATA. The book website AMLbook. com contains supporting material for instructors and readers. LEARNING FROM DATA A SHORT. Download as PDF, TXT or read online from Scribd . There is also a forum that covers additional topics in learning from data. and made it a point to cover the.

XN in V are picked independently according to P. If not. The training examples play the role of a sample from the bin. P can be unknown to us as well. If the sample was not randomly selected but picked in a particular way. With this equivalence. Take any single hypothesis h E 'H and compare it to f on each point x E X. In real learning.

How does the bin model relate to the learning problem? The color that each point gets is not known to us. The two situations can be connected. Let us see if we can extend the bin equivalence to the case where we have multiple hypotheses in order to capture real learning. If v happens to be close to zero. The learning problem is now reduced to a bin problem. If the inputs xi. If we have only one hypothesis to begin with. The probability is based on the distribution P over X which is used to sample the data points x.

Probability added to the basic learning setup To do that. The error rate within the sample. In the same way. We have made explicit the dependency of Ein on the particular h that we are considering. Let us consider an entire hypothesis set H instead of just one hypothesis h. If you are allowed to change h after you generate the data set. Each bin still represents the input space X.

The probability of red marbles in the mth bin is Eout hm and the fraction of red marbles in the mth sample is Ein hm. With multiple hypotheses in H. The out-of-sample error Eout. Why is that? The in-sample error Ein. Cmin is the coi n that had the m i n i m u m frequency of heads pick the earlier one in case of a tie. Crand is a coin you choose at random. R u n a computer sim u lation for flipping 1. The hypothesis g is not fixed ahead o f time before generating the data.

There is a simple but crude way of doing that. Since g has to be one of the hm 's regardless of the algorithm and the sample. Let's focus on 3 coins as follows: Vrand a n d Vmin be the fraction of heads you obtai n for the respective three coi ns.

Let v1. Flip each coi n independently times. Vrand and a nd plot the histograms of the distributions of v1. The next exercise considers a simple coin experiment that further illustrates the difference between a fixed h and the final hypothesis g selected by the learning algorithm.

BM are any events. If we insist on a deterministic answer. If we accept a probabilistic answer. The question of whether V tells us anything outside of V that we didn't know before has two different answers. Let us reconcile the two arguments.

B2 means that event B1 implies event B2. One argument says that we cannot learn anything outside of V. We would like to reconcile these two arguments and pinpoint the sense in which learning is feasible: Putting the two rules together. We will improve on that in Chapter 2. We now apply two basic rules in probability. We have thus traded the condition Eout g Rj 0. Let us pin down what we mean by the feasibility of learning.

Remember that Eout g is an unknown quantity. S smart a n d crazy. We don't insist on using any particular probability distribution. We cannot guarantee that we will find a hypothesis that achieves Ein g Rj 0. What enabled this is the Hoeffding Inequality 1. That's what makes the Hoeffding Inequality applicable. What we get instead is Eout g Rj Ein g. If learning is successful. We still have to make Ein g Rj 0 in order to conclude that Eout g Rj 0.

By adopting the probabilistic view. Assume i n t h e probabilistic view that there i s a probability distribution on X. We consider two learning a lgorithms.

Is it possible that the hypothesis that produces turns out to be better than the hypothesis that S produces? Of course this ideal situation may not always happen in practice.

The second question is answered after we run the learning algorithm on the actual data and see how small we can get Ein to be. If the number of hypotheses ]VJ goes up. She is wil ling to pay you to solve her problem a n d produce for her a g which a pproximates f. What is the best that you can promise her a mong the following: I f you d o return a hypothesis g.

Breaking down the feasibility of learning into these two questions provides further insight into the role that different components of the learning problem play. Financial forecasting is an example where market unpredictability makes it impossible to get a forecast that has anywhere near zero error. Can we make Ein g small enough? The Hoeffding Inequality 1. The feasibility of learning is thus split into two questions: One such insight has to do with the 'complexity' of these components.

All we hope for is a forecast that gets it right more often than not. Can we make sure that Eout g is close enough to Ein g? If we get that. This means that a hypothesis that has Ein g somewhat below 0. Even when we cannot learn a particular f. In many situations.

If we want an affirmative answer to the first question. If we fix the hypothesis set and the number of training examples. If the target function is complex. What are the ramifications of having such a 'noisy' target on the learning problem? The first notion is what approximation means when we say that our hypothesis approximates the target function well.

In the extreme case. Either way we look at it. Let us examine if this can be inferred from the two questions above. The second notion is about the nature of the target function. The complexity of f. We might try to get around that by making our hypothesis set more complex so that we can fit the data better and get a lower Ein g.

A close look at Inequality 1. This is obviously a practical observation. Remember that 1. This means that we will get a worse value for Ein g when f i s complex. The choice of an error measure affects the outcome of the learning process.

What are the criteria for choosing one error measure over another? We address this question here. J as the 'cost' of using h when you should use f. An error measure quantifies how well each hypothesis h in the model approximates the target function f. This cost depends on what h is used for.

The final hypothesis g is only an approximation of f. In an ideal world. Different error measures may lead to different choices of the final hypothesis. Example 1. Here is a case in point.

So far. Consider the problem of verifying that a fingerprint belongs to a particular person. While E h. One may view E h.

If we define a pointwise error measure e h x. J should be user-specified. The same learning task in different contexts may warrant the use of different error measures. All future revenue from this annoyed customer is lost. The costs of the different types of errors can be tabulated in a matrix. In the supermarket and CIA scenarios.

You just gave away a discount to someone who didn't deserve it. On the other hand. We need to specify the error values for a false accept and for a false reject. For the supermarket. The right values depend on the application. Consider two potential clients of this fingerprint system. This should be reflected in a much higher cost for the false accept. The inconvenience of retrying when rejected is just part of the job. The other is the CIA who will use it at the entrance to a secure facility to verify that you are authorized to enter that facility.

For our examples. For the CIA. An unauthorized person will gain access to a highly sensitive facility. If the right person is accepted or an intruder is rejected. D The moral of this example is that the choice of the error measure depends on how the system is going to be used.

False rejects. One is that the user may not provide an error specification. The other is that the weighted cost may be a difficult objective function for optimizers to work with. We have already seen an example of this with the simple binary error used in this chapter.

The general supervised learning problem that we can independently determine during the learning process. Assume we randomly picked all the y's according to the distribution P y I x over the entire input space X. While both distributions model probabilistic aspects of x and y. Remember the two questions of learning?

With the same learning model. This situation can be readily modeled within the same framework that we have. This realization of P y I x i s effectively a target function. A data point x.

The noisy target will look completely random. This does not mean that learning a noisy target is as easy as learning a deterministic one. Eout may be as close to Ein in the noisy case as it is in the This view suggests that a deterministic target function can be considered a special case of a noisy target. If we use the same h to a pproximate a noisy version of f given by y f x. If y is real-valued for example.

One can think of a noisy target as a deterministic target plus added noise. Our entire analysis of the feasibility of learning applies to noisy target functions as well.

## MODERATORS

In Chapter 2. N Yn wnxn. Use induction. For simplicity. Technical ly. You now pick the second ba l l from that same bag.

When you look at the ba l l it is black. I n more tha n two d i mensions. One bag has 2 black ba l ls and the other has a black and a white ba l l.

Problem 1. The fol lowing steps wil l guide you through the proof. You pick a bag at ra ndom a nd then pick one of the ba lls in that bag at random. What is the pro bability that this ba l l is also black?

Use Bayes ' Theorem: Report the n u m ber of u pdates that the a lgorith m ta kes before converging. Com pare you r resu lts with b. Compare you r resu lts with b. In practice. Com ment on whether f is close to g. Plot a histogra m for the n u m ber of u pdates that the a lgorith m takes to converge. How many u pdates does the a lgorithm ta ke to converge? I n t h e iterations of each experiment. PLA converges more q uickly tha n the bound p suggests. Compare you r results with b.

Be sure to mark the exa m ples from different classes d ifferently. This problem leads you to explore the algorith m fu rther with data sets of d ifferent sizes a n d dimensions. Report the error on the test set. Plot the training data set. In this problem. T h e algorithm a bove i s a variant of the so ca l led Adaline Adaptive Linear Neuron a lgorithm for perceptron learn ing. To get g. I n each iteration.

Generate a test data set of size In each it eration t. That is. In P roblem 1. Remember that for a single coin. One of the sim plest forms of that law is the Chebyshev Inequality. For a given coin. The proba bility of obtaining k heads in N tosses of this coin is given by the binomial distribution: UN are iid random varia bles. Assume we have a n u mber of coins that generate different sa m ples independently. On the same plot show the bound that wou ld be obtained usi ng the Hoeffding I neq u a lity.

Eval u ate U s as a fun ction of s. For a fixed V of size N. We focus on the simple case of flipping a fair coin. Tf3 N. Argue that for a ny two deterministic a lgorithms Ai a nd A2. This in-sa mple error should weight the different types of errors based on the risk matrix.

What happens to you r two estimators hmean and hmed? Similar results can be proved for more genera l settings. You have now proved that i n a noiseless setting. You have N data points y YN and wish to estimate a ' representative' val ue.

For the two risk matrices in Exa mple 1. Chapter 2 Training versus Testing Before the final exam. Eout is based on the performance over the entire input space X. Doing well in the exam is not the goal in and of itself. It expressly measures training performance. The in sample error Ein. If the exam problems are known ahead of time.

If the professor's goal is to help you do better in the exam. We will also discuss the conceptual and practical implications of the contrast between training and testing.

They are the 'training set' in your learning. Such performance has the benefit of looking at the solutions and adjusting accordingly. The exam is merely a way to gauge how well you have learned the material. Although these problems are not the exact ones that will appear on the exam.

We began the analysis of in-sample error in Chapter 1. The same distinction between training and testing happens in learning from data. The goal is for you to learn the course material. The error bound ln in 2. Eout 2: The mathematical results provide fundamental insights into learning from data. This can be rephrased as follows. To make it easier on the not-so-mathematically inclined.

The Eout h 2: Ein h. We will also make the contrast between a training set and a test set more precise. Not only do we want to know that the hypothesis g that we choose say the one with the best training error will continue to do well out of sample i. We would like to replace with M as 1 Sometimes 'generalization error' is used another name for Eout. E also holds. Notice that the other side of IEout Ein l Pick a tolerance level 8. E for all h E 1-l.

E direction of the bound assures us that we couldn't do much better because every hypothesis with a higher Ein than the g we have chosen will have a comparably higher Eout.

To see that the Hoeffding Inequality implies 1. A word of warning: Generalization is a key issue in learning. We have already discussed how the value of Ein does not always generalize to a similar value of Eout. We may now 2Me 2NE2. If 1-l is an infinite set. Generalization error. This is important for learning. We then over-estimated the probability using the union bound. If you take the perceptron model for instance. Once we properly account for the overlaps of the different hypotheses.

The mathematical theory of generalization hinges on this observation. The union bound says that the total area covered by If the events B1. In a typical learning model. To do this. If h1 is very similar to h2 for instance. The definition o f the growth function i s based on the number o f different hypotheses that 1-l can implement. Let x1. Each h E 1-l generates a dichotomy on x1.

For any 1-l. To compute mH N. If 1-l is capable of generating all possible dichotomies on x1. Definition 2. This signifies that 1-l is as diverse as can be on this particular sample. These three steps will yield the generalization bound that we need.

The dichotomies generated by 1-l on these points are defined by 1-l x1. Such an N-tuple is called a dichotomy since it splits x1. If h E 1-l is applied to a finite sample x1. XN into two groups: A larger 1-l x1. The growth function is defined for a hypothesis set 1-l by where I I denotes the cardinality number of elements of a set.

We will focus on binary target functions for the purpose of this analysis. These examples will confirm the intuition that m1-l N grows faster when the hypothesis set 1-l becomes more complex. One can verify that there are no 4 points that the perceptron can shatter. D Let us now illustrate how to compute mH N for some simple hypothesis sets. The dichotomy of red versus blue on the 3 colinear points in part a cannot be generated by a perceptron.

The most a perceptron can do on any 4 points is 14 dichotomies out of the possible Positive rays: In the case of 4 points. Example 2. At most 14 out of the possible 16 dichotomies on any 4 points can be generated. Figure 2. Illustration of the growth function for a two dimensional per ceptron. Let us find a formula for mH N in each of the following cases. If you connect the 1 points with a polygon. The dichotomy we get is decided Nil by which two regions contain the end values of the interval.

Notice that m1-l N grows as the square of of the 'simpler' positive ray case. Since this is the most we can get for any points. Positive intervals: Per the next: The dichotomy we get on the points is decided by which region contains the value a. This does since it is defined based on the maximum 2. Adding up these possibilities. To compute m1-l N. N which is allowed. Each hypothesis is specified by the two end values of that interval.

If both end values fall in the same region. To compute m1-l N in this case. For the dichotomies that have less than three 1 points. Convex sets: As we vary a. We now use the break point k to derive a bound on the growth function m11 N for all values of N. Verify that m If k is a break point. Exercise 2. D It is not practical to try to compute m11 N for every hypothesis set we use.

Getting a good bound on mH N will prove much easier than computing m1l N itself. In general. If no data set of size k can be shattered by 1-l. Since B N. A similar green box will tell you when rejoin. The fact that the bound is polynomial is crucial. Absent a break point as is the case in the convex hypothesis example. To evaluate B N.

This means that we will generalize well given a sufficient number of examples. If you trust our math. The notation B comes from ' Binomial' and the reason will become clear shortly. If m1-l N replaced M in Equa- tion 2. The definition of B N. To prove the polynomial bound. This bound will therefore apply to any 1-l. We will exploit this idea to get a significant bound on m1-l N in general. A second different dichotomy must differ on at least one point and then that subset of size 1 would be shattered.

Consider the B N. We now assume N 2: We collect these dichotomies in the set S Let S1 have a rows. We collect these dichotomies in the set S2 which can be divided into two equal parts. Consider the dichotomies on xi.

The remaining dichotomies on the first N 1 points appear twice. Since the total number of rows in the table is B N.. We list these dichotomies in the following table. XN in the table are labels for the N points of the dichotomy. St and S-. Since no subset of k of these first N 1 points can - We have chosen a convenient order in which to list the dichotomies. Lemma 2.

The proof is by induction on N. N0 and all k. If there existed such a subset. Assume the statement is true for all N We can also use the recursion to bound B N. Theorem 2. Those who skipped are now rejoining us. We have 1: II It turns out that B N. The RHS is polynomial in N of degree k. For a given break point k. The implication of Theorem 2. If dvc i s the VC dimension o f ti. The Vapnik. Chervonenkis dimension of a hypothesis set ti. It is easy to see that no smaller break point exists since ti can shatter dvc points.

The smaller the break point. We state a useful form here. It is also the best we can do using this line of reasoning. The form of the polynomial bound can be further simplified to make the dependency on dvc more salient. One way to gain insight about dvc is to try to compute it for learning models that we are familiar with. This is done in two steps.

The smaller dvc is. For any finite value of dvc. The 'bad models' have infinite dvc. Perceptrons are one case where we can compute dvc exactly. Ein will be close to Eout. With a bad model. If we manage to do that. If we were to directly replace M by mH N in 2. There is a logical difference in arguing that dvc is at least a certain value.

This is because dvc 2. One implication of this discussion is that there is a division of models into two classes. Because of its significant role. The 'good models' have finite dvc. This means that some vector is a linear combination of all the other vectors.

The VC dimension measures these effective parameters or 'degrees of freedom' that enable the model to express a diverse set of hypotheses. Any set of N points can be shattered by 1-l. This is consistent with Figure 2. Show that the dimension of the perceptron with d 1 para m eters.

The more parameters a model has. Conclude that there is some dichotomy that cannot be implemented. Construct a nonsingular 1 x 1 matrix whose rows represent the d 1 points.

Wd In other models. One can view the VC dimension as measuring the 'effective' number of parameters. In the case of perceptrons. Diversity is not necessarily a good thing in the context of generalization. Represent each point in as a vector of length d 1. There is a set of N points that cannot be shattered by 1-l. Based only on this information.

No set of N points can be shattered by 1-l. The perceptron case provides a nice intuition about the VC dimension. The key is that the effective number of hypotheses. The probability of a point is determined by which Xn 's in X happen to be in that particular V. Since the formal proof is somewhat lengthy and technical. It establishes the feasibility of learning with infinite hypothesis sets. Consider the space of all possible data sets. The VC generalization bound is the most important mathematical result in the theory of learning.

Let 's think of probabilities of different events as areas on that canvas. The data set V is the source of randomization in the original Hoeffding Inequality.

Eout g: If you compare the blue items in 2. Let us think of this space as a 'canvas' Figure 2. Each V is a point on that canvas. This means that. The quantities in red need to be technically modified to make 2. The correct bound. There are two parts to the proof.

Sketch of the proof. This is the worst case that the union bound considers. The area covered by all the points we colored will be at most the sum of the two individual areas. For a particular h. The argument goes as follows. This was the problem with using the union bound in the Hoeffding Inequality 1. Even if each h contributed very little. If we keep throwing in a new colored area for each h E 1-i. What the basic Hoeffding Inequality tells us is that the colored area on the canvas will be small Figure 2.

This is the essence of the VC bound as illustrated in Figure 2. For a given hypothesis h E 1-i. The bulk of the VC proof deals with how to account for the overlaps.

Illustration of the proof of the VC bound. If you were told that the hypotheses in 1-i are such that each point on the canvas that is colored will be colored times because of different h's. Let us paint these points with a different color. Here is the idea. It accounts for the total size of the two samples D and D'. This is where the 2N comes from. Any statement based on D alone will be simultaneously true or simultaneously false for all the hypotheses that look the same on that particular D.

The reason m 1-l 2N appears in the VC bound instead of m 1-l N is that the proof uses a sample of 2N points instead of N points. Why do we need 2N points?

It can be extended to other types of target functions as well. This breaks the main premise of grouping h's based on their behavior on D. When you put all this together. If it happens that the number of dichotomies is only a polynomial.

What the growth function enables us to do is to account for this kind of hypothesis redundancy in a precise way. This is the essence of the proof of Theorem 2. When 1-l is infinite. Given the generality of the result.

The slack in the bound can be attributed to a number of technical factors. The inequality gives the same bound whether Eout is close to 0. In real applications. Bounding mH N by a simple polynomial of order dvc. Using mH N to quantify the number of dichotomies on N points. Why did we bother to go through the analysis then? Two reasons. XN and used I H x1. The basic Hoeffding Inequality used in the proof already has a slack.

## Peek into the book

Some effort could be put into tightening the VC bound. This is an observation from practical experience. The estimate will be ridiculous. Among them. With this understanding.

The reality is that the VC line of analysis leads to a very loose bound. From Equation 2. E and 8. If we replace m1-l 2N in 2.

This gives an implicit bound for the sample complexity N. How fast N grows as E and 8 become smaller4 indicates how much data is needed to get good generalization. The error tolerance E determines the allowed generalization error. We can obtain a numerical value for N using simple iterative methods. We can use the VC bound to estimate the sample complexity for a given learning model.

How big a data set do we need? Using 2. The performance is specified by two parameters. The constant of proportionality it suggests is If dvc were 4. D 4 The term 'complexity' comes from a similar metaphor in computational complexity.

The bound in 2. If we use the polynomial bound based on dvc instead of m1-l 2N. The first part is Ein. In most practical situations. One way to think of rl N. D Let us look more closely at the two parts that make up the bound on Eout in 2. If someone manages to fit a simpler model with the same training We could ask what error bar can we offer with this confidence.

Eout may still be close to 1.

Etest is just a sample estimate like Ein. While the estimate can be useful as a guideline for the training process. When we report Etest as our estimate of Eout. If you are developing a system for a customer. Let us call the error we get on the test set Etest. The final hypothesis g is evaluated on the test set. How do we know When we use a more complex learning model. Although O N. An alternative approach that we alluded to in the beginning of this chapter is to estimate Eout by using a test set.

A combination of the two. The optimal model is a compromise that minimizes a combination of the two terms. You use a learning model with 1.

To properly test the performa nce of the fin a l hypothesis. This hypothesis would not change if we used a different test set as it would if we used a different training set.

Etest g. Another aspect that distinguishes the test set from the training set is that the test set is not biased. This is a much tighter bound than the VC bound. There is a price to be paid for having a test set. The test set does not affect the outcome of our learning process. Had the choice of g been affected by the test set in any shape or form.

There is only one hypothesis as far as the test set is concerned. The test set just has straight finite-sample variance. The bigger the test set you use. We have access to two estimates: Ein g. The test set just tells us how well we did. We pick a random sample of N independent marbles with replacement from this bin. Notice that only the size N of the sample affects the bound. A random sample from a population tends to agree with the views of the population at large. We can get mostly green marbles in the sample while the bin has mostly red marbles.

One answer is that regardless of the colors of the N marbles that we picked. There is a subtle point here. Putting Inequality 1. We By contrast. Use binomial distribution. It states that for any sample size N. It is just a constant. The training examples play the role of a sample from the bin. XN in V are picked independently according to P. The color that each point gets is not known to us. In real learning.

If not. How does the bin model relate to the learning problem? P can be unknown to us as well. Take any single hypothesis h E 'H and compare it to f on each point x E X.

If the sample was not randomly selected but picked in a particular way. If v happens to be close to zero. If we have only one hypothesis to begin with. The learning problem is now reduced to a bin problem. The two situations can be connected. With this equivalence. Let us see if we can extend the bin equivalence to the case where we have multiple hypotheses in order to capture real learning. If the inputs xi.

In the same way. The error rate within the sample. The probability is based on the distribution P over X which is used to sample the data points x. Probability added to the basic learning setup To do that. We have made explicit the dependency of Ein on the particular h that we are considering.

The out-of-sample error Eout. The in-sample error Ein. If you are allowed to change h after you generate the data set. Each bin still represents the input space X. Why is that? Let us consider an entire hypothesis set H instead of just one hypothesis h. The probability of red marbles in the mth bin is Eout hm and the fraction of red marbles in the mth sample is Ein hm. With multiple hypotheses in H.

Let v1. Vrand and a nd plot the histograms of the distributions of v1. Crand is a coin you choose at random. The hypothesis g is not fixed ahead o f time before generating the data.

Since g has to be one of the hm 's regardless of the algorithm and the sample. Let's focus on 3 coins as follows: Vrand a n d Vmin be the fraction of heads you obtai n for the respective three coi ns. Flip each coi n independently times. Cmin is the coi n that had the m i n i m u m frequency of heads pick the earlier one in case of a tie. R u n a computer sim u lation for flipping 1. The next exercise considers a simple coin experiment that further illustrates the difference between a fixed h and the final hypothesis g selected by the learning algorithm.

There is a simple but crude way of doing that. B2 means that event B1 implies event B2. We will improve on that in Chapter 2. The question of whether V tells us anything outside of V that we didn't know before has two different answers. We would like to reconcile these two arguments and pinpoint the sense in which learning is feasible: BM are any events.

One argument says that we cannot learn anything outside of V. If we accept a probabilistic answer. If we insist on a deterministic answer. We now apply two basic rules in probability. Let us reconcile the two arguments. Putting the two rules together. That's what makes the Hoeffding Inequality applicable. We don't insist on using any particular probability distribution. Let us pin down what we mean by the feasibility of learning.

We still have to make Ein g Rj 0 in order to conclude that Eout g Rj 0. We cannot guarantee that we will find a hypothesis that achieves Ein g Rj 0. Of course this ideal situation may not always happen in practice.

S smart a n d crazy. What enabled this is the Hoeffding Inequality 1. Assume i n t h e probabilistic view that there i s a probability distribution on X. What we get instead is Eout g Rj Ein g. We consider two learning a lgorithms. Is it possible that the hypothesis that produces turns out to be better than the hypothesis that S produces? If learning is successful. We have thus traded the condition Eout g Rj 0.

Remember that Eout g is an unknown quantity. By adopting the probabilistic view. She is wil ling to pay you to solve her problem a n d produce for her a g which a pproximates f. Financial forecasting is an example where market unpredictability makes it impossible to get a forecast that has anywhere near zero error.

Breaking down the feasibility of learning into these two questions provides further insight into the role that different components of the learning problem play.

I f you d o return a hypothesis g. All we hope for is a forecast that gets it right more often than not. If the number of hypotheses ]VJ goes up. The feasibility of learning is thus split into two questions: This means that a hypothesis that has Ein g somewhat below 0. Can we make sure that Eout g is close enough to Ein g? One such insight has to do with the 'complexity' of these components. The second question is answered after we run the learning algorithm on the actual data and see how small we can get Ein to be.

If we get that. Can we make Ein g small enough? The Hoeffding Inequality 1. What is the best that you can promise her a mong the following: Even when we cannot learn a particular f.

In many situations. Remember that 1. This is obviously a practical observation. In the extreme case. This means that we will get a worse value for Ein g when f i s complex. The second notion is about the nature of the target function. The complexity of f. A close look at Inequality 1. What are the ramifications of having such a 'noisy' target on the learning problem?

The first notion is what approximation means when we say that our hypothesis approximates the target function well. If we want an affirmative answer to the first question. Let us examine if this can be inferred from the two questions above. We might try to get around that by making our hypothesis set more complex so that we can fit the data better and get a lower Ein g. If we fix the hypothesis set and the number of training examples.

Either way we look at it. If the target function is complex. The final hypothesis g is only an approximation of f. J as the 'cost' of using h when you should use f. If we define a pointwise error measure e h x.

One may view E h. This cost depends on what h is used for. What are the criteria for choosing one error measure over another? We address this question here. Example 1. An error measure quantifies how well each hypothesis h in the model approximates the target function f. While E h. The choice of an error measure affects the outcome of the learning process. Here is a case in point.

In an ideal world. J should be user-specified. The same learning task in different contexts may warrant the use of different error measures. So far. Different error measures may lead to different choices of the final hypothesis. Consider the problem of verifying that a fingerprint belongs to a particular person. For the supermarket.

False rejects. All future revenue from this annoyed customer is lost. The inconvenience of retrying when rejected is just part of the job. In the supermarket and CIA scenarios. You just gave away a discount to someone who didn't deserve it. For our examples. The right values depend on the application. An unauthorized person will gain access to a highly sensitive facility.

The costs of the different types of errors can be tabulated in a matrix.

## S L I D E S

D The moral of this example is that the choice of the error measure depends on how the system is going to be used. For the CIA. This should be reflected in a much higher cost for the false accept. The other is the CIA who will use it at the entrance to a secure facility to verify that you are authorized to enter that facility. We need to specify the error values for a false accept and for a false reject. On the other hand. Consider two potential clients of this fingerprint system.

If the right person is accepted or an intruder is rejected. The other is that the weighted cost may be a difficult objective function for optimizers to work with. One is that the user may not provide an error specification. We have already seen an example of this with the simple binary error used in this chapter.

The general supervised learning problem that we can independently determine during the learning process. This view suggests that a deterministic target function can be considered a special case of a noisy target. A data point x. The noisy target will look completely random. This realization of P y I x i s effectively a target function. If we use the same h to a pproximate a noisy version of f given by y f x. Our entire analysis of the feasibility of learning applies to noisy target functions as well.

This situation can be readily modeled within the same framework that we have. If y is real-valued for example. Eout may be as close to Ein in the noisy case as it is in the While both distributions model probabilistic aspects of x and y. One can think of a noisy target as a deterministic target plus added noise. This does not mean that learning a noisy target is as easy as learning a deterministic one.

Assume we randomly picked all the y's according to the distribution P y I x over the entire input space X. Remember the two questions of learning? With the same learning model. In Chapter 2. N Yn wnxn. Technical ly. I n more tha n two d i mensions. For simplicity. The fol lowing steps wil l guide you through the proof.

You pick a bag at ra ndom a nd then pick one of the ba lls in that bag at random. You now pick the second ba l l from that same bag. What is the pro bability that this ba l l is also black? Use Bayes ' Theorem: One bag has 2 black ba l ls and the other has a black and a white ba l l. Problem 1. When you look at the ba l l it is black.

Use induction. Compare you r results with b. Plot a histogra m for the n u m ber of u pdates that the a lgorith m takes to converge. Com ment on whether f is close to g. This problem leads you to explore the algorith m fu rther with data sets of d ifferent sizes a n d dimensions. Com pare you r resu lts with b.

Be sure to mark the exa m ples from different classes d ifferently. PLA converges more q uickly tha n the bound p suggests. In practice.

**Other books:**

*PDF FILE FROM URL ONLINE*

I n t h e iterations of each experiment. How many u pdates does the a lgorithm ta ke to converge? Compare you r resu lts with b. Report the n u m ber of u pdates that the a lgorith m ta kes before converging.

To get g. In this problem. T h e algorithm a bove i s a variant of the so ca l led Adaline Adaptive Linear Neuron a lgorithm for perceptron learn ing. In each it eration t. That is. Plot the training data set. Report the error on the test set.

I n each iteration. Generate a test data set of size UN are iid random varia bles. For a given coin. Remember that for a single coin. Assume we have a n u mber of coins that generate different sa m ples independently. The proba bility of obtaining k heads in N tosses of this coin is given by the binomial distribution: On the same plot show the bound that wou ld be obtained usi ng the Hoeffding I neq u a lity.

One of the sim plest forms of that law is the Chebyshev Inequality. In P roblem 1. We focus on the simple case of flipping a fair coin. Eval u ate U s as a fun ction of s. For a fixed V of size N. Tf3 N. Argue that for a ny two deterministic a lgorithms Ai a nd A2. You have now proved that i n a noiseless setting. This in-sa mple error should weight the different types of errors based on the risk matrix.

For the two risk matrices in Exa mple 1. You have N data points y YN and wish to estimate a ' representative' val ue. Similar results can be proved for more genera l settings.

What happens to you r two estimators hmean and hmed? We began the analysis of in-sample error in Chapter 1. The goal is for you to learn the course material. The in sample error Ein. Eout is based on the performance over the entire input space X. We will also discuss the conceptual and practical implications of the contrast between training and testing. If the exam problems are known ahead of time. It expressly measures training performance. Doing well in the exam is not the goal in and of itself.

Chapter 2 Training versus Testing Before the final exam. They are the 'training set' in your learning. The same distinction between training and testing happens in learning from data. If the professor's goal is to help you do better in the exam. Although these problems are not the exact ones that will appear on the exam. Such performance has the benefit of looking at the solutions and adjusting accordingly.

The exam is merely a way to gauge how well you have learned the material. If 1-l is an infinite set. This can be rephrased as follows. A word of warning: We will also make the contrast between a training set and a test set more precise. Not only do we want to know that the hypothesis g that we choose say the one with the best training error will continue to do well out of sample i.

We have already discussed how the value of Ein does not always generalize to a similar value of Eout. This is important for learning. E for all h E 1-l. We would like to replace with M as 1 Sometimes 'generalization error' is used another name for Eout. Eout 2: E direction of the bound assures us that we couldn't do much better because every hypothesis with a higher Ein than the g we have chosen will have a comparably higher Eout.

Notice that the other side of IEout Ein l To make it easier on the not-so-mathematically inclined. To see that the Hoeffding Inequality implies 1. Pick a tolerance level 8. Generalization error. The mathematical results provide fundamental insights into learning from data. E also holds. The error bound ln in 2. Generalization is a key issue in learning. We may now 2Me 2NE2.

The Eout h 2: Ein h. If h1 is very similar to h2 for instance. If you take the perceptron model for instance. In a typical learning model.

**Related Post:**

*PDF FROM URL PHP*

The union bound says that the total area covered by If the events B1. Once we properly account for the overlaps of the different hypotheses. To do this. We then over-estimated the probability using the union bound. The mathematical theory of generalization hinges on this observation. XN into two groups: The dichotomies generated by 1-l on these points are defined by 1-l x1.

Such an N-tuple is called a dichotomy since it splits x1. If 1-l is capable of generating all possible dichotomies on x1. We will focus on binary target functions for the purpose of this analysis. Definition 2. A larger 1-l x1. Each h E 1-l generates a dichotomy on x1. These three steps will yield the generalization bound that we need.

To compute mH N. The definition o f the growth function i s based on the number o f different hypotheses that 1-l can implement. This signifies that 1-l is as diverse as can be on this particular sample. The growth function is defined for a hypothesis set 1-l by where I I denotes the cardinality number of elements of a set. For any 1-l.

Let x1. If h E 1-l is applied to a finite sample x1. One can verify that there are no 4 points that the perceptron can shatter. Example 2. Let us find a formula for mH N in each of the following cases. The most a perceptron can do on any 4 points is 14 dichotomies out of the possible Illustration of the growth function for a two dimensional per ceptron.

These examples will confirm the intuition that m1-l N grows faster when the hypothesis set 1-l becomes more complex. D Let us now illustrate how to compute mH N for some simple hypothesis sets. In the case of 4 points. Figure 2. At most 14 out of the possible 16 dichotomies on any 4 points can be generated. Positive rays: The dichotomy of red versus blue on the 3 colinear points in part a cannot be generated by a perceptron.

N which is allowed. Per the next: To compute m1-l N in this case. The dichotomy we get on the points is decided by which region contains the value a. The dichotomy we get is decided Nil by which two regions contain the end values of the interval. Each hypothesis is specified by the two end values of that interval.

If both end values fall in the same region. Positive intervals: If you connect the 1 points with a polygon. Since this is the most we can get for any points. This does since it is defined based on the maximum 2.

Convex sets: Notice that m1-l N grows as the square of of the 'simpler' positive ray case. Adding up these possibilities. As we vary a.

To compute m1-l N. For the dichotomies that have less than three 1 points. Getting a good bound on mH N will prove much easier than computing m1l N itself. If no data set of size k can be shattered by 1-l. Verify that m Exercise 2. D It is not practical to try to compute m11 N for every hypothesis set we use. We now use the break point k to derive a bound on the growth function m11 N for all values of N.

If k is a break point. In general. The definition of B N. If m1-l N replaced M in Equa- tion 2. To evaluate B N. Since B N. This means that we will generalize well given a sufficient number of examples. A similar green box will tell you when rejoin. If you trust our math. To prove the polynomial bound. This bound will therefore apply to any 1-l. The notation B comes from ' Binomial' and the reason will become clear shortly. We will exploit this idea to get a significant bound on m1-l N in general.

The fact that the bound is polynomial is crucial. Absent a break point as is the case in the convex hypothesis example. We now assume N 2: XN in the table are labels for the N points of the dichotomy. We collect these dichotomies in the set S2 which can be divided into two equal parts. The remaining dichotomies on the first N 1 points appear twice. We collect these dichotomies in the set S1. Consider the B N. We list these dichotomies in the following table.

St and S-. We have chosen a convenient order in which to list the dichotomies. Let S1 have a rows. Consider the dichotomies on xi. Since the total number of rows in the table is B N. A second different dichotomy must differ on at least one point and then that subset of size 1 would be shattered.

Since no subset of k of these first N 1 points can - If there existed such a subset. The proof is by induction on N. Lemma 2. We can also use the recursion to bound B N. Assume the statement is true for all N N0 and all k. The implication of Theorem 2. For a given break point k. II It turns out that B N. Those who skipped are now rejoining us. Theorem 2. We have 1: The RHS is polynomial in N of degree k. The smaller the break point. Chervonenkis dimension of a hypothesis set ti.

It is also the best we can do using this line of reasoning. If dvc i s the VC dimension o f ti. It is easy to see that no smaller break point exists since ti can shatter dvc points. The Vapnik. We state a useful form here. The form of the polynomial bound can be further simplified to make the dependency on dvc more salient.

The 'good models' have finite dvc. One implication of this discussion is that there is a division of models into two classes. This is because dvc 2. With a bad model. If we manage to do that. The 'bad models' have infinite dvc.

The smaller dvc is. Perceptrons are one case where we can compute dvc exactly. This is done in two steps. There is a logical difference in arguing that dvc is at least a certain value. One way to gain insight about dvc is to try to compute it for learning models that we are familiar with.

Because of its significant role. For any finite value of dvc. Ein will be close to Eout. If we were to directly replace M by mH N in 2. This is consistent with Figure 2. No set of N points can be shattered by 1-l. Represent each point in as a vector of length d 1.

There is a set of N points that cannot be shattered by 1-l. In the case of perceptrons. One can view the VC dimension as measuring the 'effective' number of parameters. Show that the dimension of the perceptron with d 1 para m eters. Construct a nonsingular 1 x 1 matrix whose rows represent the d 1 points.

The more parameters a model has. Based only on this information. Any set of N points can be shattered by 1-l.

This means that some vector is a linear combination of all the other vectors. Diversity is not necessarily a good thing in the context of generalization. The VC dimension measures these effective parameters or 'degrees of freedom' that enable the model to express a diverse set of hypotheses. Wd In other models. Conclude that there is some dichotomy that cannot be implemented.

The perceptron case provides a nice intuition about the VC dimension. Let 's think of probabilities of different events as areas on that canvas. Eout g: The VC generalization bound is the most important mathematical result in the theory of learning. Each V is a point on that canvas.

The data set V is the source of randomization in the original Hoeffding Inequality. The quantities in red need to be technically modified to make 2.

There are two parts to the proof. Let us think of this space as a 'canvas' Figure 2. Sketch of the proof. The correct bound. Since the formal proof is somewhat lengthy and technical. The probability of a point is determined by which Xn 's in X happen to be in that particular V. It establishes the feasibility of learning with infinite hypothesis sets. This means that. If you compare the blue items in 2. Consider the space of all possible data sets. The key is that the effective number of hypotheses.

Even if each h contributed very little. If you were told that the hypotheses in 1-i are such that each point on the canvas that is colored will be colored times because of different h's. The argument goes as follows. What the basic Hoeffding Inequality tells us is that the colored area on the canvas will be small Figure 2. This is the essence of the VC bound as illustrated in Figure 2. Here is the idea. For a given hypothesis h E 1-i. The bulk of the VC proof deals with how to account for the overlaps.

Let us paint these points with a different color. For a particular h. If we keep throwing in a new colored area for each h E 1-i. This was the problem with using the union bound in the Hoeffding Inequality 1. The area covered by all the points we colored will be at most the sum of the two individual areas.

This is the worst case that the union bound considers. Illustration of the proof of the VC bound. Any statement based on D alone will be simultaneously true or simultaneously false for all the hypotheses that look the same on that particular D. Given the generality of the result.

## Learning From Data

This is the essence of the proof of Theorem 2. Why do we need 2N points? If it happens that the number of dichotomies is only a polynomial. This is where the 2N comes from. This breaks the main premise of grouping h's based on their behavior on D. It accounts for the total size of the two samples D and D'.

The reason m 1-l 2N appears in the VC bound instead of m 1-l N is that the proof uses a sample of 2N points instead of N points. It can be extended to other types of target functions as well. When you put all this together. When 1-l is infinite.

What the growth function enables us to do is to account for this kind of hypothesis redundancy in a precise way. With this understanding. Some effort could be put into tightening the VC bound.

XN and used I H x1. Why did we bother to go through the analysis then? Two reasons. The estimate will be ridiculous. This is an observation from practical experience. Bounding mH N by a simple polynomial of order dvc. The inequality gives the same bound whether Eout is close to 0. Among them. The slack in the bound can be attributed to a number of technical factors. Using mH N to quantify the number of dichotomies on N points. The reality is that the VC line of analysis leads to a very loose bound.

The basic Hoeffding Inequality used in the proof already has a slack. In real applications. How big a data set do we need? Using 2. E and 8. The performance is specified by two parameters.

This gives an implicit bound for the sample complexity N. The error tolerance E determines the allowed generalization error. From Equation 2. We can obtain a numerical value for N using simple iterative methods. If we replace m1-l 2N in 2. How fast N grows as E and 8 become smaller4 indicates how much data is needed to get good generalization. The constant of proportionality it suggests is We can use the VC bound to estimate the sample complexity for a given learning model.

D 4 The term 'complexity' comes from a similar metaphor in computational complexity. If dvc were 4. In most practical situations.

One way to think of rl N. Eout may still be close to 1. The first part is Ein. The bound in 2. If someone manages to fit a simpler model with the same training If we use the polynomial bound based on dvc instead of m1-l 2N. We could ask what error bar can we offer with this confidence.

D Let us look more closely at the two parts that make up the bound on Eout in 2. Etest is just a sample estimate like Ein.

When we report Etest as our estimate of Eout. The final hypothesis g is evaluated on the test set. A combination of the two. While the estimate can be useful as a guideline for the training process. An alternative approach that we alluded to in the beginning of this chapter is to estimate Eout by using a test set. The optimal model is a compromise that minimizes a combination of the two terms. Although O N. If you are developing a system for a customer.

How do we know When we use a more complex learning model. Let us call the error we get on the test set Etest. We wish to estimate Eout g. Both sets are finite samples that are bound to have some variance due to sample size.

Another aspect that distinguishes the test set from the training set is that the test set is not biased.We have a set of examples generated by the target. The learning curves summarize the behavior of the Borrowed the book from a friend for a few hours to help understand some topic that was needed for a problem set. The mathematical results provide fundamental insights into learning from data. Model selection and data contamination. This hypothesis has o n ly one para meter a but 'enjoys' a n infi n ite VC dimensio n.

For instance. This happens when we have stream ing data that the algorithm has to process 'on the run'.