Welcome back to our series of articles sponsored by Intel – “Ask a Data Scientist.” Once a week you’ll see reader submitted questions of varying levels of technical detail answered by a practicing data scientist – sometimes by me and other times by an Intel data scientist. Think of this new insideAI News feature as a valuable resource for you to get up to speed in this flourishing area of technology. If you have a data science question you’d like answered, please just enter a comment below, or send an e-mail to me at: daniel@insidehpc.com. This week’s question is from a reader who asks for an explanation of data leakage.
Q: What is data leakage?
A: As a data scientist, you should always be aware of circumstances that may cause your machine learning algorithms to over-represent their generalization error as this may render them useless in the solution of real-world problems. One such potential problem is called data leakage – when the data you are using to train a machine learning algorithm happens to have the information you are trying to predict. It is undesirable on many levels such as a source of poor generalization and over-estimation of expected performance. Data leakage often occurs subtly and inadvertently and may result in overfitting. A leading text in the field called data leakage as one of the top ten machine learning mistakes.[1]
Data leakage can manifest in many ways including:
- Leaking data from the test set into the training set.
- Leaking the correct prediction or ground truth into the test data.
- Leaking of information from the future into the past.
- Reversing obfuscation, randomization or anonymization of data that were intentionally included.
- Information from data samples outside of scope of the algorithm’s intended use.
- Any of the above existing in external data coupled with the training set.
In general, data leakage comes from two sources in a machine learning algorithm – the feature variables, and the training set. A trivial example of data leakage would be a model that uses the response variable itself as a predictor, thus concluding for example that “it is sunny on sunny days.”
As a more concrete example, consider the use of a “customer service rep name” feature variable in a SaaS company churn prediction algorithm. Using the name of the rep who interviewed a customer when they churned might seem innocent enough until you find out that a specific rep was assigned to take over customer accounts where customers had already indicated they intended to churn. In this case, the resulting algorithm would be highly predictive of whether the customer had churned but would be useless for making predictions on new customers. This is an extreme example – many more instances of data leakage occur in subtle and hard-to-detect ways. There are war stories of algorithms with data leakage running in production systems for years before the bugs in the data creation or training scripts were detected.
Identifying data leakage beforehand and correcting for it is an important part of improving the definition of a machine learning problem. Many forms of leakage are subtle and are best detected by trying to extract features and train state-of-the-art algorithms on the problem. Here are several strategies to find and eliminate data leakage:
- Exploratory data analysis (EDA) can be a powerful tool for identifying data leakage. EDA allows you to become more intimate with the raw data by examining it through statistical and visualization tools. This kind of examination can reveal data leakage as patterns in the data that are surprising.
- If the performance of your algorithm is too good to be true, data leakage may be the reason. You need to weigh prior or competing documented results with a certain level of performance for the problem at hand. A substantial divergence from this expected performance merits testing the algorithm more closely to establish legitimacy.
- Perform early in-the-field testing of algorithms. Any significant data leakage would be reflected as a difference between estimated and realized out-of-sample performance. This is perhaps the best approach in identifying data leakage, but it is also the most expensive to implement. It can also be challenging to isolate the cause of such performance discrepancy as data leakage since the cause actually could be classical over-fitting, sampling bias, etc.
Once data leakage has been identified, the next step is to figure out how fix it (or even if you want to try). For some problems, living with data leakage without attempting to fix it could be acceptable. But if you decide to fix the leakage, care must be taken not to make matters worse. Usually, when there is one leaking feature variable, there are others. Removing the obvious leaks that are detected may exacerbate the effect of undetected ones, and engaging in feature modification in an attempt to plug obvious leaks, could create others. The idea is to try to figure out the legitimacy of specific observations and/or feature variables and work to plug the leak and hopefully seal it completely. Rectifying data leakage is an active field of research that will likely yield effective results in the near future.
[1] Nisbet, R., Elder, J. and Miner, G. 2009. Handbook of Statistical Analysis and Data Mining Applications. Academic Press.
If you have a question you’d like answered, please just enter a comment below, or send an e-mail to me at: daniel@insidehpc.com.
Data Scientist: Daniel D. Gutierrez – Managing Editor, insideAI News
I have a spatial model that uses the response variable itself as a predictor but it would be useful for making predictions on new unseen dataset. so, is there data leakage in my model(spatial model)?
[READER COMMENT] I was reading your 2014 article Ask a Data Scientist: Data Leakage for insideBIGDATA. I know it’s a quite old article but I was wondering if you were able to give me more pointers about this sentence: “There are war stories of algorithms with data leakage running in production systems for years before the bugs in the data creation or training scripts were detected”.
I’m Denis, a PhD student at the ENS in Paris. Lately, I’ve been focus in formally proving the absence/presence of data leakage on machine learning code – in the data preparation phase. I would love to receive some suggestion/pointers of such problem in code snippets where data leakage happened in production! I don’t have a working tool yet, I’m still in the exploration phase to understand what are the ML scientist needs about data leakage.