A team of researchers from Stanford University and the University of Toronto have developed a model that can detect alcohol intoxication using voice clips with 98 percent accuracy. The study built the model using one-second clips of tongue twisters taken over a span of seven hours from study participants who were given a weight-based dose of alcohol.
The original goal of the study, though, was not to develop such a model. Instead, the controlled laboratory study was aiming to test the effects of high doses of alcohol on blood and toxicology biomarkers. Participants were told the exact amount of alcohol they would be drinking, and they did not have to ingest the entire dose. Researchers then took a baseline and each half-hour measurement of breath alcohol content (BrAC) and read a randomly generated tongue twister.
Researchers used support vector machine models to detect alcohol intoxication in comparison to the participant’s baseline, and used a leave-one-participant cross-validation, meaning the model was trained on all but one of the participants’ speech and tested on the last participant’s speech to get the accuracy of the model.
The model that the researchers produced was able to identify alcohol intoxication with an accuracy of 97.5 percent, which far outperformed the only other known voice-recording alcohol corpus the researchers were aware of, a German-language corpus with an accuracy of 70 percent.
The researchers believed their improved accuracy may be from several potential causes. The first belief is that the tongue twisters served as a “phonetic stress test,” which helped speech production and increased the sensitivity of models. Secondly, the researchers also used a standard list of tongue twisters rather than free speech, which reduced variability in the samples. Third, the researchers used specific procedures to extract spectral or frequency-based features, meaning their findings focus more on frequency and pitch of the participants rather than time-based features like phonemes (speech sounds) and prosody (rhythm, stress, and intonation).
The study does note its limitations. Though it included both men and women, the study only had recordings from 18 participants (originally 20 were recruited, but two had technical difficulties). The study only had white, non-Hispanic participants and did not identify specific features of speech that are sensitive to alcohol, such as speech volume.
Despite its high accuracy, the model was only for the one-second voice spectrographic signatures from the participants. In their conclusions, the researchers state that to expand upon their model, they need more varied voice samples to validate their model, as they only had audio recordings from 18 participants.
The researchers also look at the real-world limitations to their model. For instance, environmental and background noise and competing noises would require pre-processing and filtering to actually be useful. Additionally, only the effects of alcohol were examined, and other factors that impair speech like sleepiness are also likely to affect the model accuracy. The study also notes that some people might see programs that process speech samples as intrusive, meaning they don’t know if their model could or would actually be used in the real world.
In conclusion, the researchers state that this was more of a proof-of-concept lab study with brief English speech samples. They look for future studies to look at more varied voice samples that are collected before and during the ascending and descending curves of alcohol intoxication that can build a more applicable model. They also note that there should be more studies to better understand the acceptability of different remote monitoring approaches such that these models can eventually be used in the real world.
Leave a Reply