Aligning AI Metrics with Business Goals in Healthcare

Nov 01, 2023

Nick Pogrebnyakov, PhD

Healthcare companies, like their cousins in other industries, increasingly explore AI to power their products. Many companies create specialized AI teams, separate from the business or product teams. However, while both the AI and the business teams use metrics to evaluate the quality of AI models and track progress, they often use different metrics. It is crucial that both teams understand the technical and business implications of what the metrics show.

A larger difference is often in key objectives of AI and business. When building AI models, AI teams typically strive for improving key metrics, which are often composites such as F1. They also have a broad spectrum of metrics that can be calculated. Business teams, meanwhile, ask other questions relating to performance of AI models: how well the models satisfy business objectives; are they ready for release; and how well the model is likely to behave on real users.

These differences shouldn’t be unsurmountable. The leaders of business and AI teams should jointly discuss company’s business objectives, and then select metrics that reflect models’ performance in meeting these objectives. Consider these factors to prioritize AI metrics depending on the business objectives.

Company size
- Small companies, especially startups, may want to prioritize recall / sensitivity to ensure they don’t miss any positive cases. This is vital for establishing credibility and effectiveness early on.
- Larger organizations, especially those catering to several markets, may want to emphasize precision / PPV and specificity to reduce both false positives and negatives. This becomes very important for conditions where the cost of a wrong prediction is high.
Target market
- Niche markets value correct predictions. Emphasize precision / PPV and specificity to track occurrences of false positives and false negatives
- By contrast, broad markets imply solutions that appeal to multiple subgroups. Here, recall / sensitivity is important.
Prevalence of condition in the population
- If the targeted condition is very rare or very common in a community, training and, importantly, test datasets will be imbalanced. Metrics like F1 or Matthew’s Correlation Coefficient (MCC) are more relevant indicators than straightforward accuracy.
- Medium prevalence leads to balanced datasets. Use area under the ROC curve (ROC-AUC) or accuracy.
Cost of false positives or false negatives
- The cost of a false negative is high when it’s crucial that the model doesn’t erroneously flag as healthy people who actually have the condition. People who were flagged as having the condition by the model can then be sent for follow-up tests to confirm. Improving detectability of condition here is essential, and recall / sensitivity is a good metric to emphasize.
- In other instances it’s more important that the model doesn’t mistakenly identify the condition in people who don’t have it: a false positive. This calls for greater accuracy of detection. Highlight the precision / PPV metric.
Importance of outliers
- Some AI models output raw numbers instead of probabilities. A model that predicts blood pressure is a good example. Extremely high or low values, or outliers, may or may not be important in interpreting model output.
- If outliers are important, use RMSE (root mean squared error) or MSE (mean squared error), which penalize larger errors more (as they square the difference between true and predicted values)
- When outliers are not essential, use MAE (mean absolute error), as it is less sensitive to outliers

Metrics like precision / PPV, recall / sensitivity and specificity are derived from probabilities output by AI models. Those probabilities need to be converted into a “hard” label like “healthy” or “sick”. This requires setting a decision threshold. Set the threshold at 0.6, and all patients with the predicted probability of disease greater than 0.6 are assigned the label “sick”, while those less than 0.6 are “healthy”. This threshold is a “knob” that the model’s user can adjust. Lower the threshold, and more patients are flagged as sick, increasing the chance of false positives. Increase it, and fewer cases will be flagged, but this increases the chance of incorrectly flagging sick patients as healthy: false negatives. Decide what is more important for you depending on the business requirements.

The key to successful collaboration between the AI and business teams is mutual understanding. To achieve this understanding, have the two teams communicate regularly and educate each other on business implications, trade-offs and usefulness of metrics they use.

Nick Pogrebnyakov is head of AI and Data Science at Sparrow BioAcoustics. Prior to Sparrow Nick was an AI leader at Twitter, early NLP innovator at Thomson Reuters, and a visiting researcher at Stanford University.

Welcome to Stethophone