Previous Page Previous Page   Next Page Next Page

Signal Processing Basics

Song Scope is designed to be easy to use for field biologists and others to automate the process of finding specific kinds of vocalizations in long field recordings. While Song Scope implements sophisticated algorithms for efficient signal detection and pattern recognition, you don't need to have a strong background in mathematics, statistics, and digital signal processing to use Song Scope. That being said, there are a few fundamentals you should understand.

The Nature of Sound

Sound is a wave of pressure vibrations created by vibrating objects that propagates through air, water, and other materials. The speed of sound through air at sea level is approximately 340.29 meters (1116.4 feet) per second. In open air, sound propagates away from the vibrating source in all directions creating an expanding wave of air pressure vibrations in the shape of the surface of a sphere.

As the wave expands away from the source, the amplitude of the vibrating pressure decreases with the square of the distance. In other words, a sound from 10 feet away will have 1/100th the amplitude of the sound from only 1 foot away, and a sound from 100 feet away will have 1/10000th the amplitude of the sound from only 1 foot away.

One of the remarkable qualities of our ears and brain is that our perception of volume is "logarithmic" with changes in pressure. We perceive a sound from 10 feet away as having 1/10th the volume as a sound from 1 foot away, even though the pressure amplitude is only 1/100th as much. This enables us to hear quiet/distant sounds as well as loud/close sounds. Or to put it another way, the human ear has a very large dynamic range.

A microphone measures changes in air pressure and produces an electric voltage in proportion to the pressure levels. Unlike the human ear, microphones have a limited dynamic range. Very sensitive microphones can pick up small changes in pressure (e.g. form distant or quiet sound sources), but would be overwhelmed by a close/loud sound.

Frequency

Frequency represents the "pitch" of a sound and is measured as the rate at which the air pressure vibrates. Faster vibrations are higher pitched. For example, middle C on a musical scale has a vibration frequency of around 262 cycles per second (or Hertz, Hz). An octave above middle C will have twice the frequency (524 Hz), and an octave below middle C will have half the frequency (131Hz). Some very high-pitched birds sing at frequencies of 8,000 to 10,000Hz.

Digitizing Sound

To digitize sound, the signal pressure level of a sound wave is measured (sampled) thousands of times per second resulting in a sequence of numerical values corresponding to the pressure level as it changes through time. The rate at which sound is sampled is called the sample rate, usually measured in samples per second, or Hertz (Hz).

In order to detect high frequencies, a faster sample rate must be used. In fact, the sample rate must be at least twice as fast as the highest frequency. In other words, a sample rate of at least 524Hz would be needed (and in reality, just a little more) in order to detect middle C (262Hz), and a sample rate of at least 20,000Hz would be needed to detect some very high-pitched birds singing at 10,000Hz. Music CDs are sampled at 44,100Hz.

Song Scope uses 16-bits to represent each sample of sound as an integer between -32,768 and +32767. These values are proportional to the changes in sound pressure levels measured by a microphone.

Time Domain

The Waveform Plot is used to display sound waves in the "time domain" and shows the actual sample values corresponding to changing sound pressure levels through time.

Frequency Domain

The Spectrogram Plot is used to display sound waves in the "frequency domain" and shows the individual frequency components that make up sound as the sound changes through time. For example, a sustained middle C would show up as a horizontal line through time at 262Hz.

This simple visualization is calculated from the time domain sound data using an algorithm called the Fast Fourier Transform, or FFT. The FFT is a fundamental concept in digital signal processing.

Fast Fourier Transform (FFT)

The FFT transforms a time domain signal into the frequency domain. The input to the FFT is a "window" of N time domain samples, where N is a power of two. The output of the FFT is used to compute the power spectrum of the window represented as N/2 "frequency bins". The bins are evenly spaced and represent frequencies between zero and half the sample rate.

For example, consider a time domain signal sampled at 44,100 samples per second and an FFT window size N of 256 (2 to the power of 8). The output of the FFT can be used to calculate 128 frequency bins between 0Hz and 22,050Hz (with a resolution of 22,050/128 = 172.26Hz between each bin). The relative power level in each bin is typically represented on a log scale measured in decibels. Notice that a single FFT window of 256 samples divided by 44,100 samples per second represents a slice of time equal to 256/44,100 = 5.80 milliseconds.

A spectrogram can be generated by computing a series of FFTs. The output of each FFT represents the frequency power levels over a narrow slice of time. The series of slices can be used to show how the frequency components of a signal change through time. It is also common practice for these FFT windows to be overlapping and averaged together to smooth out edge transitions. Using the example above, if there is a 50% overlap in FFT windows, then there will be twice as many slices in the spectrogram with a new slice every 2.90 milliseconds. If there is a 75% overlap in FFT windows, then there will be four times as many slices in the spectrogram with a new slice every 1.45 milliseconds.

Notice that there is a trade-off between the resolution of frequency and time. Larger FFT windows can resolve more frequencies (more frequency bins), but are wider in time (more samples), and thus have lower resolution in time. Shorter FFT windows are shorter in time (fewer samples) and can therefore observe faster changes in time, but can't resolve as many frequencies.

Beam Forming

An array of multiple microphones can be used to focus sounds from a particular direction. Depending on direction, a sound wave will propagate across each of the microphones at different times. The greater the distance between microphones, the longer it takes for the sound to travel from the nearest microphone to the furthest microphone.

If a sound wave arrives at two microphones at exactly the same time, and these waves are added together, the combination of waves is said to be "constructive" because the sound waves are in phase with each other. The amplitude of such a combined pair of waves will be twice the amplitude of any of the individual waves.

On the other hand, if sound waves arrive at two microphones at different times and added together, the combination of these out-of-phase waves results in "destructive interference". If the waves are off by just the right amount (half a wavelength), they cancel each other out.

Beam forming is a technique in which the signals received from two or more microphones are delayed by different amounts and then added together such that the audio signals from a particular direction are all in phase and combine constructively resulting in a stronger signal, while sounds from other directions experience destructive interference and are reduced in amplitude.

Song Scope's beam-forming algorithms will automatically adjust the delays of multiple audio channels to maximize a selected signal's amplitude.

Previous Page Previous Page   Next Page Next Page