AN OVERVIEW OF TECHNIQUES FOR BIOMARKER DISCOVERY IN …

AN OVERVIEW OF TECHNIQUES FOR BIOMARKER DISCOVERY IN VOICE SIGNAL

Rita Singh, Ankit Shah*, Hira Dhamyal*

Carnegie Mellon UniversityPittsburgh, PA, USA

ABSTRACT

This paper reflects on the effect of several categoriesof medical conditions on human voice, focusing on thosethat may be hypothesized to have effects on voice, butfor which the changes themselves may be subtle enoughto have eluded observation in standard analytical exami-nations of the voice signal. It presents three categoriesof techniques that can potentially uncover such elusivebiomarkers and allow them to be measured and used forpredictive and diagnostic purposes. These approachesinclude proxy techniques, model-based analytical tech-niques and data-driven AI techniques.

Index Terms— Biomarker discovery, Voice profiling,Feature engineering, Voice analytics, AI-based diagnosticaids

1. INTRODUCTION

Based on their effects on human voice, medical condi-tions that are known to affect humans can be dividedinto four clear categories. Of these, one category of dis-eases comprises those that have absolutely no effect onvoice, such as certain dermatological conditions, hair-related conditions etc. In contrast to this, the categoryof conditions that is expected to have the most obviouseffects on voice includes diseases that directly affect thestructures of the vocal tract – vocal folds, larynx, glottis,respiratory tract, articulators etc. Examples of such dis-eases are otolayngological diseases of various etiology.A third category is that of diseases that indirectly affectthe processes that drive voice production – includingcognitive, neuromuscular, biomechanical and auditoryfeedback processes. These conditions cause varied ef-fects on voice, ranging from fairly intense and obviouseffects, to very subtle or almost imperceptile ones. Thiscategory includes diseases such as those listed in Table1, and syndromes caused by secondary effects of drugs,intoxicants and other harmful substances. The fourthcategory – and the focus of the mechanisms presented inthis paper – comprises diseases for which the existence

* Second and third author contributed equally

of voice changes is hypothesized, but these may not beevident through standard analytical examinations of thevoice signal. These include disease subcategories thataffect intellectual abilities, visual function, temperament,personality, behavior etc. For these diseases, biomark-ers in voice may be hypothesized to be present, but areelusive and must be searched for – in effect designed orcreated – for use in data-driven applications that attemptto detect the presence of these diseases from voice.

The changes in voice that are alluded to above arein fact biomarker patterns, or biomarkers. The termbiomarker in this context refers to specific patterns ofchange(s) in the voice signal in the voice signal, thatcarry information about the health conditions that causethem. Such changes may be thought of as perturbationsor deviations from a hypothetical “normal” voice signal,within its frequency, duration, intensity, amplitude andother entities that characterize it. The changes may bewide-raging and coarse in nature, or temporally tran-sient, occurring within micro-durations of the signal. Inother words, biomarker patterns may range from be-ing highly perceptible, to completely imperceptible – tothe extent that they may be undetectable even withinstandard analytical representations of the speech signal.

The rest of this paper introduces three categoriesof techniques that can potentially uncover such elu-sive biomarkers. Such techniques can be thought of asbiomarker discovery or feature engineering techniquesfocused on the design or re-design of biomarker fea-tures in voice. The approaches described below includeproxy techniques, model-based analytical techniquesand data-driven AI techniques.

“© 20XX IEEE. Personal use of this material is per-mitted. Permission from IEEE must be obtained for allother uses, in any current or future media, includingreprinting/republishing this material for advertising orpromotional purposes, creating new collective works, forresale or redistribution to servers or lists, or reuse of anycopyrighted component of this work in other works.”

arX

iv:2

110.

0467

8v1

[cs

.SD

] 1

0 O

ct 2

021

Health condition How it affects voiceAttention Deficit HyperactivityDisorder (ADHD)

Prosodic variations in loudness and fundamental frequency [1]

Amyotrophic Lateral Sclerosis(ALS)

Voice tremor, flutter [2], incomplete vocal fold closure, dysarthria [3]; Dystonia,dysarthria [4]; Low-frequency (< 4 hz) tremor [5]

Alzheimer’s Disease (Demen-tia)

Abnormal fundamental frequency, pause and voice-break patterns, reduction in vocalrange [6, 7]; Dysphonia [8]

Arthritis: Fibromyalgia Changes in Jitter, shimmer, harmonic-to-noise ratio, and phonation time [9]Cerebral Palsy Dysphonia [10]; Breathiness, Asthenia, Roughness, Strain [11]Cholera Husky voice [12]; high-pitched, asthenia [13]; “Cholera voice” [14]Congenital Heart Defects Dystonia, dysarthria [4]Coronavirus Disease 2019(COVID-19)

Abnormal acoustic measures of pitch, harmonics, jitter, shimmer [15]; Asymmet-ric/abnormal vocal fold vibrations [16, 17]

Diabetes Roughness, asthenia, breathiness, and strain in high glycemic index subgroups [18]Down Syndrome Reduced pitch range, higher mean fundamental frequency, reduced jitter [19]Epilepsy (Temporal Lobe) Abnormal voice pitch regulation [20]Huntington’s disease Low-frequency (< 4 hz) tremor [5]Hypertension Noisy breathing, voice production difficulty, vocal fatigue [21]Hyperthermia (Extreme Heat) Pressured speech [22]; Dysarthria [23]Hypothermia (Extreme Cold) Slurred speech [24]Lung Cancer Hoarseness, GRBAS symptoms, dysphonia [25]Myasthenia gravis Dystonia, dysarthria [4]Multiple sclerosis Low-frequency (< 4 hz) tremor [5]Myxoedema low-pitched, husky, and nasal voice, unusual diction [26]Parkinson’s disease Dystonia, dysarthria [4]; Low-frequency ( < 4 hz) tremor, Medium tremor (4-7 Hz) [5]Pneumonia Wet, gurgly voice [27]Pulmonary Hypertension Hoarseness [28]Stress Breathiness, softness [29], Multiple changes [30]Tobacco Use Multiple voice quality changes, high jitter [31]

Table 1: Qualitative observations of the effect of some diseases on voice

2. BIOMARKERS AND THEIR MEASUREMENTTHROUGH PROXY TECHNIQUES

Proxy techniques may be used for the changes in voicethat are human observable, but not easily measurable.For example, many of the correlations established be-tween various diseases and voice changes in the medicalliterature refer to changes in voice quality. The changes,although subjective, comprise biomarkers that could po-tentially be used in machine learning systems for predic-tion of the corresponding diseases from voice. However,the problem is that for the most part, the entities that con-stitute the set of voice qualities are subjectively specified.For many, methodologies for objective measurement donot exist.

For example, the voice sub-qualities of nasality (ornasalence), roughness, breathiness, asthenia etc. have noobjective measures associated with them. They are in factrated by human experts on standardized clinical ratingscales, such as Voice Rating scale (VRS) [32], Voice Dis-ability Coping Questionnaire (VDCQ) [33], ConsensusAuditory-Perceptual Evaluation of Voice (CAPE-V) [34]etc. Examples of such subjective correlations of various

diseases to voice changes are listed in Table 1.

2.1. Proxy features

Features that are subjectively rated and do not have spe-cific methods to objectively measure them can be mea-sured through proxy. The term “proxy” here refers to theact of using (or creating) replacement features that canbe measured instead of the original ones. For this to beviable, the proxy features must be highly correlated tothe subjective features that we desire to measure. Thereare two potential mechanisms to create such proxy fea-tures: the use of physical models of voice productionto produce signals that can be more easily measured,and the use of AI mechanisms for transfer learning thatcan generate measurable features that exhibit the samepatterns as the subjective features they proxy for.

2.1.1. Proxy features from models of voice production: mea-surement through emulation

As an example, we consider physical models that cangenerate just one specific aspect of a continuous speechsignal – the set of phonated sounds embedded in it. The

(a) Normal voice (b) Covid-19 (> 7days after +ve test)

(c) Covid-19 (< 7days after +ve test)

Fig. 1: Phase space trajectories of the 1-mass model (forthe extended vowel /aa/) estimated to emulate the bi-lateral vocal fold vibrations of people with normal voiceand Covid-19.

idea is to use such models to generate or approximatethe actual motion of the vocal folds during the processof phonation, producing a glottal flow signal that hascharacteristics (or specific voice sub-qualities) that aresimilar to a give recorded speech signal. Once this isachieved, the parameters of the model used can proxy forthe speech quality characteristics of the original signal,that may not have been directly measurable. We canalso call this process “measurement through emulation.”Examples of physical models of phonation that can beused include the 1-mass, mass-spring model [35], the2-mass, mass-spring model [36] etc.

For a given recording, the parameters of these modelsare derivable through the ADLES algorithm, that min-imizes the squared error between a glottal flow signalestimated through inverse filtering and the glottal flowsignal generated through the model. The discriminabil-ity of such features is evident from the highly character-istic patterns exhibited by the corresponding model inits phase space. For example, while we know that thereare changes in voice in response to various diseases ofthe vocal folds, and may have observed changes in voiceas a result of covid, it is hard to characterize, identify ormeasure the exact changes in spectrographic and othersignal representations. However, we can also see thatthe vocal fold movements (phonation process) would beaffected by these diseases, and be highly correlated tothe actual vocal fold oscillations as an affected personspeaks. This is borne out by the phase space patternsexhibited by a 1-mass model, as shown in Fig. 1. We cantherefore use the corresponding parameter values (andeven other measurements that pertain to the phase spacetrajectories shown) to build predictors for the underlyingconditions. The model parameters are then the proxyfeatures we are looking for.

2.1.2. Proxy features from neural networks: measurementthrough correlation

Proxy features can also be derived using neural networksand other classifiers that learn to perform classificationtasks on the signals within which we desire to identify ormeasure biomarkers. When trained to perform auxiliarytasks that are equivalent to those that must be actuallyperformed using objective measures of the biomarker,the scores generated through the auxiliary classifiers actas proxy features for the biomarkers in question. Forexample, it is easy to identify, but difficult to measurechanges in the “nasality” of speech signals. However, itis relatively easy to identify and isolate nasal and non-nasal phonemes in speech, and build classifiers to dis-criminate between them. When properly trained, anaccurate classifier would generate scores that are dis-criminative of nasality characteristics. If this were notso, the classifier would not be able to accurately performthe task of discriminating between nasal and non-nasalphonemes. Once trained, the classifier can be used to gen-erate scores for training newer predictors of underlyingconditions based on nasality. that we wish to measure.

3. AI SYSTEMS FOR BIOMARKER DISCOVERY

With the advancements in the voice profiling techniques,it has become evident that for a large set of diseases forwhich correlations to voice were not known to exist, thepresence of such correlations can in fact be hypothesizedand scientifically supported. Since for these, we clearlyknow that biomarkers are not perceptible or evidentwithin standard analytical representations of the voicesignal, we must devise techniques to discover them orcreate them in appropriate mathematical spaces withinwhich they can be shaped and measured.

One example of such a biomarker discovery sys-tem/framework is illustrated in Fig. 2. This representsa generic setup which we call the ABCDE framework(Autoencoder based Biomarker Creation and DiscoveryEngine). Its exact formulation can vary from applicationto application. In this framework, a speech signal is firstconverted into a numerical representation that is hypoth-esized to contain the biomarker related information thatwe seek to uncover, or extract. The representation isthen “projected” into a neural kernel space, where wecan impose objective criteria on it that are tailored to thespecific biomarker.

Viewing Fig.2 from the left, from a given voice sig-nal S, several types of signal representations are ob-tained using digital signal processing algorithms (DSP).Thereafter, one of those (SPi) is chosen as the substratefor the target feature to be created (carrying the rele-vant biomarker). The other representations (SP1, SP2,...)

Fig. 2: Schematic diagram of ABCDE neural model forfeature discovery. The yellow boxes represent appropri-ately architectured neural networks.

serve as “control” representations, and are placed withinthe DSP stack shown in the figure 2.

The cuboid for SPi represents a stack of the same“type” of DSP features, e.g. a stack of spectrograms ob-tained at different time-frequency resolutions, or a cor-relogram. These are input into a neural network E (e.g. aconvolutional neural network), which can be viewed asan “encoder” that transforms them into a kernel space,yielding a latent feature representation Z. Within thisspace, different constraints (including those based onprior knowledge) can be imposed on Z, so that it hasthe desired properties of the biomarker for the targetedhealth condition. One obvious property it must possessis that it must be discriminative for the underlying medi-cal condition it is expected to encode.

To ensure that the information present in the inputrepresentation preserved in the process of transforma-tion into the kernel space, a decoder D is trained to re-construct the original representation from Z, while mini-mizing the loss between the reconstructed and originalrepresentation. All loss functions are represented by reddouble-lines in Fig. 2. In the training process, which isthat of parameter optimization of the aggregate neuralframework through gradient descent, such a loss func-tion would be minimized.

To ensure that only the information in the original in-put representation is engineered for the feature at hand,the same latent representation Z is input into a genera-tor G that can recreate the original speech signal fromit (as G(Z)). Alternatively, the G can act on the outputof the decoder, SPo (as G(SPo)) to generate the voicesignal. Time-domain signals with same information con-tent can be different in voice acoustics and thus, morerobust losses are defined based on comparisons of DSPfeatures from the original signal, to those predicted fromthe voice signal generated by G. Alternatively, the DSP

features can be “simulated” by a neural stack, as shownin the figure. An additional level of detail that allowsvoice quality features to play a role in this process ofdiscovery is introduced by specifying rough functionalrelationships between DSP features and various voicequalities V. Losses based on direct comparisons betweenthese can also play a part in the learning process.

The entire framework is simultaneously optimized,but it is conceivable to express this process of discoverywithin more sophisticated AI learning frameworks thatprioritize interpretability and can be trained in parts.Here, Z represents the “discovered” feature carrying thebiomarker whose existence was hypothesized.

4. CONCLUSIONS

The biomarker discovery mechanisms given above aregeneric ones, and may have varied formulations in dif-ferent settings. In contrast to traditional features derivedfrom audio signals for use with machine learning algo-rithms, these features are likely to perform better withlesser training data, since they are designed to be rela-tively more discriminating and less ambiguous for thespecific disease for which they are designed. Such fea-tures are useful in many ways. For example, they canbe used in applications that serve as diagnostic aids inclinical settings, to build tools for early-detection of cer-tain diseases, for self-monitoring of health by disabled,elderly and under-resourced people, etc.

5. ACKNOWLEDGEMENT

This material is based upon work supported by the De-fence Science and Technology Agency, Singapore undercontract number A025959. Its content does not reflect theposition or policy of DSTA and no official endorsementshould be inferred.

6. REFERENCES

[1] Georg G von Polier, Eike Ahlers, Julia Amunts, JoergLangner, Kaustubh R Patil, Simon B Eickhoff, FlorianHelmhold, and Daina Langner, “Predicting adult at-tention deficit hyperactivity disorder (adhd) using vocalacoustic features,” medRxiv, 2021.

[2] Arnold E Aronson, William S Winholtz, Lorraine OlsonRamig, and Sandra R Silber, “Rapid voice tremor, or “flut-ter,” in amyotrophic lateral sclerosis,” Annals of Otology,Rhinology & Laryngology, vol. 101, no. 6, pp. 511–518, 1992.

[3] Anton Chen and C Gaelyn Garrett, “Otolaryngologicpresentations of amyotrophic lateral sclerosis,” Otolaryn-gology—Head and Neck Surgery, vol. 132, no. 3, pp. 500–504,2005.

[4] Chester Griffiths and I David Bough Jr, “Neurologic dis-eases and their effect on voice,” Journal of Voice, vol. 3, no.2, pp. 148–156, 1989.

[5] Jan Hlavnicka, Tereza Tykalova, Olga Ulmanova, PetrDusek, Dana Horakova, Evzen Ruzicka, Jirı Klempır, andJan Rusz, “Characterizing vocal tremor in progressiveneurological diseases via automated acoustic analyses,”Clinical Neurophysiology, vol. 131, no. 5, pp. 1155–1165,2020.

[6] Juan Jose G Meilan, Francisco Martınez-Sanchez, JuanCarro, Dolores E Lopez, Lymarie Millian-Morell, andJose M Arana, “Speech in alzheimer’s disease: can tem-poral and acoustic parameters discriminate dementia?,”Dementia and Geriatric Cognitive Disorders, vol. 37, no. 5-6,pp. 327–334, 2014.

[7] Israel Martınez-Nicolas, Thide E Llorente, FranciscoMartınez-Sanchez, and Juan Jose G Meilan, “Ten years ofresearch on automatic voice and speech analysis of peoplewith alzheimer’s disease and mild cognitive impairment:A systematic review article,” Frontiers in Psychology, vol.12, pp. 645, 2021.

[8] Peak Woo, Janina Casper, Raymond Colton, and DavidBrewer, “Dysphonia in the aging: physiology versusdisease,” The Laryngoscope, vol. 102, no. 2, pp. 139–144,1992.

[9] Levent Gurbuzler, Ahmet Inanir, Kursat Yelken, SemaKoc, Ahmet Eyibilen, and Ismail Onder Uysal, “Voice dis-order in patients with fibromyalgia,” Auris Nasus Larynx,vol. 40, no. 6, pp. 554–557, 2013.

[10] Kay Coombes, “Voice in people with cerebral palsy,” inVoice Disorders and their Management, pp. 202–237. Springer,1991.

[11] Nick Miller, Lindsay Pennington, Sheila Robson, Ella Roe-lant, Nick Steen, and Eftychia Lombardo, “Changes invoice quality after speech-language therapy interventionin older children with cerebral palsy,” Folia Phoniatrica etLogopaedica, vol. 65, no. 4, pp. 200–207, 2013.

[12] T Narrain Sawmy Naidoo, “Notes of cases of choleratreated by sulphurous acid,” The Indian medical gazette,vol. 12, no. 8, pp. 219, 1877.

[13] CC Carpenter, “Cholera: diagnosis and treatment.,” Bul-letin of the New York Academy of Medicine, vol. 47, no. 10,pp. 1192, 1971.

[14] Frederick Humphreys, The Cholera and Its HomoeopathicTreatment, Radde, 1849.

[15] Maral Asiaee, Amir Vahedian-Azimi, Seyed ShahabAtashi, Abdalsamad Keramatfar, and Mandana Nour-bakhsh, “Voice quality evaluation in patients with covid-19: An acoustic analysis,” Journal of Voice, 2020.

[16] Mahmoud Al Ismail, Soham Deshmukh, and Rita Singh,“Detection of covid-19 through the analysis of vocal foldoscillations,” in ICASSP 2021-2021 IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP).IEEE, 2021, pp. 1035–1039.

[17] Soham Deshmukh, Mahmoud Al Ismail, and Rita Singh,“Interpreting glottal flow dynamics for detecting covid-19from voice,” in ICASSP 2021-2021 IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP).IEEE, 2021, pp. 1055–1059.

[18] Abdul-latif Hamdan, Jad Jabbour, Jihad Nassar, Iyad Da-houk, and Sami T Azar, “Vocal characteristics in patientswith type 2 diabetes mellitus,” European Archives of Oto-Rhino-Laryngology, vol. 269, no. 5, pp. 1489–1495, 2012.

[19] Mary T Lee, Jude Thorpe, and Jo Verhoeven, “Intonationand phonation in young adults with down syndrome,”Journal of Voice, vol. 23, no. 1, pp. 82–87, 2009.

[20] Weifeng Li, Ziyi Chen, Nan Yan, Jeffery A Jones, ZhiqiangGuo, Xiyan Huang, Shaozhen Chen, Peng Liu, and Han-jun Liu, “Temporal lobe epilepsy alters auditory-motorintegration for voice control,” Scientific reports, vol. 6, no.1, pp. 1–13, 2016.

[21] HT Anil, N Lasya Raj, and Nikitha Pillai, “A study onetiopathogenesis of vocal cord paresis and palsy in a ter-tiary centre,” Indian Journal of Otolaryngology and Head &Neck Surgery, vol. 71, no. 3, pp. 383–389, 2019.

[22] Nazila Jamshidi and Andrew Dawson, “The hot patient:acute drug-induced hyperthermia,” Australian prescriber,vol. 42, no. 1, pp. 24, 2019.

[23] Megan E Musselman and Suprat Saely, “Diagnosis andtreatment of drug-induced hyperthermia,” American Jour-nal of Health-System Pharmacy, vol. 70, no. 1, pp. 34–42,2013.

[24] Ahmed Faraz Aslam, Ahmad Kamal Aslam, Balendu CVasavada, and Ijaz A Khan, “Hypothermia: evaluation,electrocardiographic manifestations, and management,”The American journal of medicine, vol. 119, no. 4, pp. 297–301, 2006.

[25] Clare F Lee, Paul N Carding, and Mike Fletcher, “Thenature and severity of voice disorders in lung cancer pa-tients,” Logopedics Phoniatrics Vocology, vol. 33, no. 2, pp.93–103, 2008.

[26] WH Lloyd, “Value of the voice in diagnosis of myx-oedema in the elderly,” British medical journal, vol. 1, no.5131, pp. 1208, 1959.

[27] Paul E Marik and Danielle Kaplan, “Aspiration pneumo-nia and dysphagia in the elderly,” Chest, vol. 124, no. 1,pp. 328–336, 2003.

[28] KD Shah, KH Ayyer, and UK Shah, “Hoarseness ofvoice—a presenting manifestation of primary pulmonaryhypertension,” Indian Journal of Otolaryngology, vol. 32, no.2, pp. 35–36, 1980.

[29] Keith W Godin and John HL Hansen, “Physical task stressand speaker variability in voice quality,” EURASIP Journalon Audio, Speech, and Music Processing, vol. 2015, no. 1, pp.1–13, 2015.

[30] Klaus R Scherer, “Voice, stress, and emotion,” in Dynamicsof stress, pp. 157–179. Springer, 1986.

[31] Isabel Guimaraes and Evelyn Abberton, “Health andvoice quality in smokers: an exploratory investigation,”Logopedics Phoniatrics Vocology, vol. 30, no. 3-4, pp. 185–191, 2005.

[32] Shin-Woong Cho, Chang Shik Yin, Young-Bae Park, andYoung-Jae Park, “Differences in self-rated, perceived, andacoustic voice qualities between high-and low-fatiguegroups,” Journal of Voice, vol. 25, no. 5, pp. 544–552, 2011.

[33] Ruth Epstein, Shashivadan P Hirani, Jan Stygall, and Stan-ton P Newman, “How do individuals cope with voicedisorders? introducing the voice disability coping ques-tionnaire,” Journal of Voice, vol. 23, no. 2, pp. 209–217,2009.

[34] Gail B Kempster, Bruce R Gerratt, Katherine VerdoliniAbbott, Julie Barkmeier-Kraemer, and Robert E Hillman,“Consensus auditory-perceptual evaluation of voice: de-velopment of a standardized clinical protocol,” AmericanJournal of Speech-Language Pathology (AJSLP), vol. 18, no. 2,pp. 124–132, 2009.

[35] Jorge C Lucero and Jean Schoentgen, “Modeling vocalfold asymmetries with coupled van der pol oscillators,”in Proceedings of Meetings on Acoustics ICA2013. AcousticalSociety of America, 2013, vol. 19, p. 060165.

[36] Jorge C Lucero, “Dynamics of the two-mass model ofthe vocal folds: Equilibria, bifurcations, and oscillationregion,” The Journal of the Acoustical Society of America, vol.94, no. 6, pp. 3104–3111, 1993.

AN OVERVIEW OF TECHNIQUES FOR BIOMARKER DISCOVERY IN …

Documents

Transcript of AN OVERVIEW OF TECHNIQUES FOR BIOMARKER DISCOVERY IN …