What if I told you that you could achieve near 0% generalization error in practice using a model that has 20% generalization error? You’d say, “You’re out of your fucking mind,” right? I promise I’m not, and given the right kind of business problem, I’m about to show you how it’s not only achievable, but very, very easy.
In many business problems, we are concerned with the behavior of an entire class of entities, rather than individual entities. Example: We may be more concerned with the conversion rate for customers in Los Angeles vs. San Francisco than we are with the actual conversion likelihood for individual people living in either city. But using aggregated data necessarily means we have fewer data points to work with; I always try to get the most granular data possible, leaving it up to me to determine how far up I can aggregate without sacrificing reliability of output. Even when the prediction we actually want is for an aggregated class, it’s usually better to train the model on individual data and arrive at the class-level prediction later.
In the most straightforward case of this, we train on individual data, treated as class members during training. Then we predict for the class. For example, given 10 individuals each from Los Angeles and San Francisco, we might scrub individual features from the data and retain only city-level features, which simply take the same value for all individuals in each group. Then, when we go to predict, we put in the city-level predictors and get a group-level prediction out at the end.
Here, we’re going to consider a less straightforward case: we need a group-level prediction, but only have individual level features to work with. But with only individual features to work with, which may differ between individuals in the same group, we have to make individual level predictions and develop a group-level prediction from these. For simplicity (and because it’s what was most suitable for my specific problem), we’ll limit ourselves to binary classification and say that a group is predicted as “yes” if and only if at least members of the group are predicted as “yes.”
We are left with the problem of quantifying the uncertainty in our group level prediction. The general problem here is a difficult one; propagation of uncertainty isn’t easy. But, for the special case where we know that if one member of a class has characteristic , then they all do, it can be demonstrated that performing individual-level predictions can greatly improve the group-level error rate. Group-level error rates can be reduced to negligibly close to 0%. Moreover, this can be achieved even in the presence of unacceptably high individual-level error rates. The city-wide conversion rate example we started with, unfortunately, is not part of this special case. But my own business problem of using real estate listings for individual condos to predict whether a condo building has an on-site gym does; so this special case isn’t so special that we never encounter it in the wild.
Without loss of generality, let’s assume our group has members and , and take our individual fpr to be , which I’d generally consider unacceptably high in most business contexts. We’d classify the group as a yes if and only if at least 40 of the members are classified as yes; in our special case, this would be a false positive if and only if none of the members are truly yes cases. (Either they are all yes cases or they are all no cases.) That means that we have made at least 40 individual false positive calls.
The number of false positive calls in a group of predictions is a clear-cut binomial distribution with probability of success equal to the false positive rate. So the probability of at least 40 false positive calls, i.e. the probability of a single false positive call at the group level, is the complement of the cumulative distribution function of the binomial. For , this places our probability of false positive for the group at !!! Under this setup, we have the freedom to choose whatever value of we like to target a specific group-level false positive rate, in the event we want to accept more false positives to mitigate false negatives. The same reasoning can be employed for any evaluation metric that represents probability of some sort of error.
These are the advantages I see to approaching the problem in this way:
- We can reach near zero error very easily.
- We don’t need to spend a whole lot of time chasing after the “best” model, because even very low-performing models can yield incredibly powerful results.
- Through , we have a computationally inexpensive and precise lever we can pull to manage the tradeoff between false positives and false negatives.
- Going into your CEO’s office and saying “We were able to achieve an error rate of 1 in a million” is sure to win some brownie points.