So, how can we rely on speech analytics as a window to brain health in disease when the way a person speaks is influenced by all sorts of mundane and benign other things? For example:
- You just woke up and your voice is creaky
- You have a head cold and your speech sounds stuffy
- You are stressed out and so you speak quickly
- You have an accent
How can a person’s voice, which can be so messy and variable, be a reliable source of brain health information?
To be fair, the challenge is a formidable one. Below we provide an overview of the different sources of variability in speech and how we account for them.
Cross-sectional Differences in Speech
The way one speaks is a function of many variables that aren’t necessarily related to one’s neurological health. These variables include such things as gender, age, height, regional dialect, race, socio-economic status, mood, health, emotion, communication circumstances and even the environment.
Why is this a problem? Well, if we are building models to classify between two groups of individuals – say, those with and those without Parkinson’s disease – then we need to match the two groups on other variables that could reveal differences between the groups. Even if we can use a stratified sampling scheme that samples across some of the variables, there are variables that we couldn’t possibly match for (e.g. their specific physio-anatomy).
Here’s a simple experiment to demonstrate this. Let’s take two groups of healthy individuals sampled such that they are matched on age, gender, and regional dialect. We ask them to each produce 10-15 seconds of speech by having them read standard sentences in a noise-free environment using exactly the same recording equipment and setup. We can iteratively increase the sample size in each group, extract features commonly used in machine learning (say mel-frequency cepstral coefficients (MFCCs)) and ask a simple question: At what sample size are these two groups statistically indistinguishable? The answer is surprising. You need ~300 individuals in each group such that there is no statistically significant difference between the groups. If this is the case under the most benign conditions (no noise, same recording equipment, reading the same sentences, matched on demographics), then what are the sample size requirements for other cases?
To make things even more complicated, many of the disorders we are interested in have different prevalence and incidence rates among different demographic groups. For example, men are 1.5 times more likely to develop Parkinson’s disease than women. Many recent papers based on AI that aim to directly predict the disease from speech acoustics are really just using the acoustic features as proxies for demographic variables. In other words, the AI is picking up on “nuisance variables” related to gender instead of the variable of interest (+/-Parkinson’s disease).
Accounting for Sources of Variation
The features we develop are hypothesis-driven and based on the neurophysiology of speech production: Standardized feature sets (MFCCs, filterbanks, PLPs) capture the majority of content in the signal indiscriminately, regardless of whether or not it is clinically meaningful. In some sense, the standardized feature sets have been optimized for speech recognition applications over the years, although they have been used in many other contexts as well. We take an alternative approach in which we rely on the large body of work that studies the acoustic speech features that change when there are neurological disturbances. This not only results in a lower number of features that are sensitive to change in neurological function, but also ensures that the resultant models are clinically interpretable.
The features we develop are maximally invariant to these nuisance parameters: When it’s possible, we control for nuisance variables during feature engineering through normalization. When it’s not, we collect normative data that characterize the distribution of the features by age, gender, dialect, etc.
Within-subject Changes in Speech
Every person’s speech changes from moment to moment, and day to day, for perfectly innocuous reasons that have nothing to do with neurological health. Understanding and quantifying this range of normal variability on a given speaking task is critical for identifying abnormalities indicative of neurological impairment. This is where two large bodies of literature come into play: 1) acoustic correlates of healthy speech and 2) acoustic correlates of speech within disease state. We focus on specific attributes of speech that are known to be abnormal in disease and track their measures over time. These include measures associated with respiratory support; the quality and dynamics of phonation; the rate and clarity of articulation; and the balance of oral-nasal resonance; the complexity of language used; the coherence of ideas conveyed, etc.
Accounting for day-to-day Differences in Speech
We collect speech at frequent intervals to characterize the within-subject variability: Sampling must be frequent enough to capture typical variability in measures and identify data trends indicative of changes in neurological health. Determination of optimal frequency considers the rate of disease progression as well as the magnitude and effect size of changes in the acoustic measures. This is one of the major benefits of app-based speech collection, where samples can be captured and recorded without the burden of a clinic visit.
Our speech analytics and speech elicitations are jointly optimized: Speech tasks must be optimized to exert pressure on the speech subsystems expected to be affected by disease. For example, if respiratory support is expected to be compromised, the elicitation tasks should include the opportunity to “run out air” while speaking or phonating so that this can be captured in the metrics.
Our speech features are reliable and validated relative to existing gold standards: We don’t deploy metrics into our suite of analytics until they have passed rigorous testing and met pre-defined benchmarks for reliability and clinical accuracy. Our test for reliability quantifies a metric’s stability by asking the question, “If I were to re-record my voice several times, how much would my measurements change?”. Test-retest reliability is computed from the correlation between pairs of equivalent measurements, taken from a database with tens of thousands of recordings. We verify that reliability is as strong for healthy speakers as it is for speakers with neurodegenerative disease, and that measurements are not biased by gender, age, language, or disease severity. Once a metric reaches an acceptable level of reliability, we then verify that it corresponds closely to ratings provided by clinicians. For example, nasal resonance metric values should be similar to perceptual ratings of hypernasality. Our clinical ratings are provided by trained clinicians and must meet high standards of inter and intra-rater reliability.
Heterogenous endpoints are absolute killers for clinical trials. Variability in the endpoint is directly related to the required sample size. The “holy grail” in digital biomarkers is a repeatable metric capable of detecting subtle, clinically-relevant changes in neurological health. As we outlined here, our speech endpoints are based on clinically-relevant features (and not general speech features) sensitive to subtle change with high test-retest reliability. The result? Robust metrics that have the potential to identify change in neurological health in cases where traditional metrics don’t show a signal.
We use the term nuisance variables to denote changes in speech that aren’t related to the clinical application of interest. These include age, gender, background noise, microphone type, etc.