Auto-Tune

In music production, Auto-Tune is an effect plug-in that can alter a pitchy, out-of-tune melody to the correct musical notes. This process consists of analyzing a melody from an audio signal, identifying the pitches of the melody and determining the closest musical notes that the melody should be, and altering the original signal to those notes. Our project explores this process by following these same steps, allowing us to apply our digital signal processing knowledge and better understand how Auto-Tune works.

Home

PROCESS

Get and Filter Audio

Time and Frequency Domain

Record and find mono .wav files
Pre-process audio signals with filtering

Detect Pitch

Frequency Domain

Detect the pitches of the given audio signal
Find the nearest note of each pitch

Shift Pitch

Time Domain

Identify patterns
Isolate patterns
Shift patterns

Process

Process 2

G/F Audio

Get and Filter Audio

Data

Our data consists of pre-recorded mono audio signals that are in .wav file format. The samples include vocals singing major scales, both with and without solfège, guitar, and vocals over guitar.

Filtering Techniques

Our filtering process consists of a bandpass filter and a noise gate. Both of these steps eliminate harmonic information from noise, making it easier to detect the given audio signal's pitches in the next step.

The bandpass filter is a 7th-order Butterworth filter with a bandpass range of 300 Hz to 10 kHz. This bandpass filter maintains the given audio signal's timbre while cutting out unwanted muddy sub-frequencies and high-frequency mechanical and electrical noise.

The noise gate adapts to the given audio signal's dynamic range by detecting the largest dynamic leap and assuming that this is when the performer has begun singing or playing. The algorithm then attenuates anything under decibel level prior to this leap. The noise gate gets rid of unwanted background noise while the performer is not singing or playing.

Original Audio

00:00 / 00:10

Filtered Audio

00:00 / 00:10

Detect Pitch

Detection Algorithm

In order to detect the pitches of a given signal, we determine the fundamental frequencies of each pitch through spectrogram analysis. However, for detecting the fundamental frequencies in vocal samples, we must be mindful of the harmonics produced by the human voice. Certain vowels have more energy in the harmonics than in the fundamental note. In the following image of a scale being sung, notice that there is more energy in the second harmonic for most of the notes.

In order to properly shift the sample, we need to find the fundamental pitch. For this, we use the Harmonic Product Spectrum. This technique splits the windows in the spectrogram in segments and multiplies them sample by sample so that the fundamentals grow larger when multiplied by their harmonics. We implement this by multiplying each window in the frequency domain by downsampled versions of itself. The following image demonstrates this technique.

Planner Function

In order to get an optimal spectrogram for analysis, parameters like the window and overlap must be carefully chosen. Signals that have a lot of change in a short period of time will require smaller windows to properly capture the fundamental pitches. Likewise, the overlap between each window should also be chosen in accordance with how the signal changes in short periods of time. To automate the process of deciding these parameters, we created a planner function that looks at the transients in a signal and decides the proper window length and overlapping. All windows are rectangular windows, as this has proved to be the most effective window for this application.

Quantizing the Pitches

Once all the fundamental frequencies are identified in all the windows, the program must then find the closest corresponding musical notes for each fundamental frequency in order to properly shift the given signal. To find what the closest note is, we must quantize the fundamental frequencies to a grid of known notes from Western music theory. This grid is based on A4 being equal to 440 Hz. Knowing that a one octave difference corresponds to doubling the frequency, a good approximation is to set the twelfth root of two as the factor that, when multiplied by a frequency, yields one semitone above it. With this knowledge, the following formula is derived:

Screenshot 2023-11-28 at 11.48.23 AM.png

Here, n is the number of semitones f 2 is from f 1. Taking f 1 to be 440 Hz, the formula will output how many semitones from A4 that f 2 is. If n is negative, f 2 has a lower frequency than f 1. In most instances, n is not an integer. So, we round n up to the nearest integer to reveal how many semitones the nearest note of the detected frequency is to A4. Then, n is added with the index of A4 from the notes array, and the new index is the quantized pitch that the given signal will be shifted to. Here is an example of the melody output from a given signal.

Shift Pitch

Pitch Synchronous Overlap-Add

Audio signals are comprised of repeating patterns within pitch periods, which determine attributes like tone and vowel identity. To actually alter the given audio signal after determining the correct notes without distorting these important attributes, we use an algorithm called Pitch Synchronous Overlap-Add (PSOLA). The PSOLA algorithm involves “stretching” or “squashing” a signal in time to change its frequency. Once the correct notes of our audio signal are identified, the PSOLA algorithm allows us to tune the given signal without distorting the characteristics that influence if and how words are understood by a listener.

Identifying the Pitch Period

As mentioned above, each audio signal can be split into repeating patterns, the distance between which is defined as the pitch period. This is commonly calculated as the distance between the local maximum value of each pitch period. To calculate this in MATLAB, we windowed our audio sample with the fundamental frequency found in our pitch detection algorithm (window length = 1/pitch). Then, we determined the maximum peak within each window and calculated the pitch periods as the time between peaks. The following image showcases these repeating patterns of an audio signal, the time between maximum values in a pitch period.

Extracting the Pitch Period

To extract each pitch period, a Hamming window spanning two pitch periods is applied at every previously identified pitch period peak. Pitch periods are stored separately as rows in a matrix for processing.

Shifting the Pitch Period

For simpler PSOLA implementation, we assume the audio signal is a singular note with a given pitch. With the pitch periods identified and extracted, the new pitch period is calculated as the measured pitch period times the ratio of the measured and ideal pitches. Each extracted pitch period is shifted to match the new pitch period. The extracted pitch periods are padded with zeros to prevent any data from being lost. The shifted pitch periods are then summed to form the new, tuned audio signal.

Note that each shifted audio signal must be resized to match the length of the given audio signal to avoid distortion. If the shifted signal is smaller (i.e., an increase in pitch), groupings of 15 pitch periods are duplicated at the beginning and end of the signal until its size matches the original signal.

Demo

DEMO

Original Samples

Original D Major Scale (Ah)

00:00 / 00:11

Auto-Tuned Samples

Auto-Tuned D Major Scale (Ah)

00:00 / 00:11

Original Eb Major Scale (Ah)

00:00 / 00:11

Auto-Tuned Eb Major Scale (Ah)

00:00 / 00:09

Original C Major Scale (Solfege)

00:00 / 00:08

Auto-Tuned C Major Scale (Solfege)

00:00 / 00:08

Original Twinkle Twinkle

00:00 / 00:23

Auto-Tuned Twinkle Twinkle

00:00 / 00:24

Original Bags

00:00 / 00:25

Auto-Tuned Bags

00:00 / 00:25

Original C Major Scale (Solfege)

00:00 / 00:08

Auto-Tuned C Major Scale (3 Multiplier)

00:00 / 00:08

Original Bags

00:00 / 00:25

Auto-Tuned Bags (0.5 Multiplier)

00:00 / 00:25

(Samples labeled with "Multiplier" indicate that the pitch was multiplied by that value before Auto-Tuning)

Analysis

ANALYSIS

This project was extremely beneficial to our use and development of our digital signal processing knowledge. Because we are all interested in music and audio, we were excited to further explore how audio can be altered with digital signal processing. Within this context, we were able to better understand how to implement skills such as filtering, windowing, and spectrogram analysis, while also teaching ourselves new digital signal processing tools. By choosing a specific music production effect to mimic, we were able to comprehend how algorithms like filtering and tuning might actually be implemented in the industry.

In terms of improving our Auto-Tune, finer resolution of the tuning process may be able to mitigate the audible fluttering that is heard in the affected samples. We could also further enhance this project by programming the processing of live audio signals and by creating a graphical user interface that a user could interact with.

Overall, we are happy with the results of our project, as we were able to successfully implement a working Auto-Tune algorithm.

DSP Tools

DSP TOOLS

Spectrogram/STFT

In Class

The Short-Time Fourier Transform (STFT) is the basis for the spectrogram, which we used primarily for pitch identification. The result of the STFT was processed using the Harmonic Product Spectrum formula to find the fundamental frequency of each sample of an audio signal.

Windowing

In Class

During pitch detection, Hamming windows were applied in the time-domain to split up the signal into chunks for frequency-domain analysis. Additionally, while tuning with the PSOLA algorithm, Hamming windows are applied to split the signal into its individual pitch periods for shifting.

Filtering

In Class

For noisy input signals, a bandpass filter isolating the frequency range of human vocals was applied.

Pitch Synchronous Overlap-Add

Outside Class

The PSOLA algorithm involves “stretching” or “squashing” a signal in time to change its frequency. The signal is decomposed into unique repeating vocal patterns, called pitch periods, which are shifted in time to change their frequency. The PSOLA algorithm allowed us to tune the signal without distorting other important characteristics, such as vowel identity, that influence if and how words are understood by a listener.

Harmonic Product Spectrum

Outside Class

HPS is a frequency-domain formula used to identify fundamental frequency in an audio signal with many harmonics. Due to internal vibrations of vocal folds, the highest magnitude frequency is often not the fundamental frequency of an audio signal. HPS involves the multiplication of all harmonics in the frequency domain; the resulting highest-power peak is the true pitch of the audio signal.

Team