DeepFry: deep neural network algorithms for identifying Vocal Fry


DeepFry: deep neural network algorithms for identifying Vocal Fry



Written by Roni Chernyak with the help of Eleanor Chodroff, Jennifer S. Cole, Talia Ben Simon, Yael Segal, Jeremy Steffman. Originally published on Medium on Sept 2, 2022. This post is about our paper DeepFry which was accepted for publication at Interspeech 2022.

What do Britney Spears, Zooey Deschanel, Scarlett Johansson, and Kim Kardashian all have in common? They all use the tonal quality of vocal fry, a type of creaky sound that occurs when the voice drops to its lowest register. And men use it too. It was mentioned a lot recently: from how celebrities use it, specifically how Kim Kardashian and rappers use it to sound sexier, and how it has become a language fad by young women. A worried grandma even wrote a letter to the Chicago Tribune complaining that her 8-year-old granddaughter “is now emulating her teacher’s voice and, not only has her beautiful singing voice suffered, it’s distressing to me that her strong, clear speaking voice may be forever lost.” Apparently, that won’t happen.

In this post, we will introduce what vocal fry is, the issues it poses to various signal processing and machine learning-based algorithms, and introduce an improved deep learning algorithm to identify it.

Characteristics: A Low and Irregular Pitch

Air coming from the lungs pushes against the vocal folds forcing an opening starting at the bottom of the vocal folds and moving upward. As the vocal folds separate at the upper edge, the lower portion closes, just as an elastic band snaps back after being stretched. With continued upward air pressure, the vocal folds are again forced open at the bottom, and again close shut, in a repeating pattern. The frequency of this vibrating pattern (the number of open-shut cycles per second, also termed fundamental frequency or f0) determines the perceived pitch of the voiced sound. During vocal fry, the vibrations are irregular, because the vocal cords are somewhat relaxed during the closure, and the airflow from the lungs decreases.

This results in slower and irregular vibrations, which we hear as a creaky voice quality that is also lower in pitch.

Speech with vocal fry

Speech with vocal fry (right) vs normal speech (left). We can see the distinctive pattern difference (e.g. irregular pitch and irregular periodicity) in both the spectrogram and the acoustic waveform.

Breaking long-used algorithms

A new approach — DeepFry


Operating on the raw waveform

Vocal fry can span over varying durations. For example, it can be very short or longer than half a second. To capture sounds defined over shorter and longer periods, including vocal fry, and overcome the issue posed by using pre-processing methods, the input to the network is the raw acoustic wave, split into frames of varying durations without any pre-processing methods. This bypasses the ‘stationarity’ principle and windowing required by MFCCs and STFTs, allowing for more flexibility to uncover regions of creak.

Large receptive field

As mentioned above, vocal fry can have varying durations. We wanted to ensure that our encoder would have a receptive field large enough to capture the varying periodicity of vocal fry to learn a good latent representation of the signal. The standard approach to achieve such representation is to use an encoder, which in our case is implemented as a fully-convolutional neural network, which consists of large filters (kernels) compared to standard ones used in the domain of speech.

Multi-task learning framework

We trained our model to predict 3 tasks simultaneously: vocal fry (creak), voicing, and pitch: The auxiliary tasks are:

  1. Pitch — due to the correlation between the tasks of detecting vocal fry and pitch, we added the task of detecting if a given frame had pitch or not.

Final notes

In this post, we introduced vocal fry, explained why its detection is difficult, and proposed two methods to detect it. However, there is still work to be done. For instance, we only used labeled instances of vocal fry in our training routine, but we had very little speech data where vocal fry has been explicitly labeled; such labeling requires phonetic expertise and is difficult and time-consuming. We could further improve detection by using our trained model to label more segments in a semi-supervised fashion and improving the model further.