Compute F1 score to assess data labeling quality
One simple way to monitor the “noise” in your ML system as it ingests new data periodically is applying the F1 score to data labeling. F1 score captures a balance between recall (the quest to capture all the data possible) and precision (include only those that are really relevant).
Standardizing our data labeling process, using true and false positives and true and false negatives, in one of our ML systems, with all things kept the same, we got a reduction in RMS by about 365 points. The challenge comes in rigorously identifying the true and false sub-sets, and that comes from a deeper understanding of the system sought to be modeled.
One of the first checks we do now before parsing new data into our model is to compare the new F1 score to the previous iterations. If it is not within a similar range, there is a call to be made whether the train the same model on new data set or not. At the very least, identifying the sources of “noise” in the new data must be done. And if the data source is “different”, train the model afresh, treating the new data source as different enough to start training all over again.
So tracking the F1 score during the new data labeling process is more useful than we realize. Read this excellent primer by Alegionon measuring data labeling quality
Image Source: Shutterstock