Previous Page |
Next Page |
Digital pattern recognition by machine is extremely difficult and inefficient when compared with the human brain (and brains from several other species, for that matter). It is easy for us to take for granted our ability to instantaneously recognize sights, sounds, and smells with extremely high accuracy, even in difficult conditions (e.g. in the presence of noise and low light). Digital computers have a very long way to go, despite decades of research and investment in the field of speech recognition and computer vision.
Computers are good with 1's and 0's, and can easily compare bit-by-bit two data sets to tell you if they are an exact match or not. But pattern matching is never an exact match in the real world. The animals we study will never produce exactly the same sound twice (at least, not when comparing bit-by-bit). In addition, the signals we must analyze are corrupted by random noise (from the wind blowing, other animals vocalizing, sound waves bending around trees, etc). The problem of pattern recognition is not a simple binary 1 or 0 problem. Rather, it is "fuzzy".
When considering the analysis of animal vocalizations, another significant factor is the degree to which vocalizations vary. There can be considerable individual and regional variations within a particular species. And even a single individual may be capable of producing a wide range of vocalizations.
Consider a machine that does a bit-by-bit comparison of two vocalizations. Only an exact duplicate would be considered a match, which will never happen given real world variation and random processes. So instead, imagine a machine that "blurs" the data. Now two blurry patterns that are similar could be a more exact match than a bit-by-bit comparison. This is one of the techniques used by most successful digital pattern recognition systems and is known as "feature reduction". The idea here is that some information in the signal is important for identification, and the rest is not. If we could reduce the signal by removing all the elements that did not contribute to identification, then it would be easy to compare the remaining features and determine if they matched. This is easier said than done, however, because the set of features that might help identify one vocalization may be different than for another.
Now lets consider the degree to which a given vocalization may vary. If a vocalization has wide variation among individuals, then pattern matching requires elimination of more features and broader acceptance of patterns. But if we go too far, then many incorrect vocalizations may be falsely identified resulting in undesirable "false positives".
In the face of real world noise, the problem becomes even more difficult. Once again, the human brain effortlessly separates simultaneous sounds received in our ears from all directions. But to the digital machine, the competing sounds merge to form an ambiguous jumble that is much more difficult to separate.
Finally, consider mimics. There are species of birds, like the Northern Mockingbird, that have so much song-to-song variation that it is impossible to rely on any direct pattern matching to detect their vocalizations (though humans can easily detect the variation and repetition of syllables to quickly identify the mockingbird).
For all the reasons cited above, it is impossible to build a perfect classifier capable of identifying each and every occurrence of a particular vocalization with 100% accuracy. That being said, in the face of these challenges, the Wildlife Acoustics algorithms perform remarkably well, especially with the careful selection of training data and parameter settings.
As described above, feature reduction is an important aspect of digital pattern matching. Song Scope incorporates a number of feature reduction techniques that can be adjusted for particular vocalizations as follows:
Song Scope makes use of a background noise filter used to reduce background noise levels. In addition to reducing background noise, the vocalization spectrogram is also sharpened by reducing smearing effect of echos. Since background information has nothing to do with the signal of interest, reduction or elimination of background noise is a good place to start in feature reduction.
Most vocalizations can be described as occurring within a finite range of frequencies. (At faster sampling rates, many harmonics of the fundamental frequency may be detected, but these harmonics generally only add redundant information to the underlying signal and can therefore be eliminated as part of feature reduction). The frequency range can be specified using the Frequency Minimum and Frequency Range controls in the Spectrogram Control panel. The signal detector also makes use of frequency band limiting for finding candidate vocalizations.
Related to the frequency band limiting above, a sampling rate of twice the maximum frequency is all that is needed to resolve the band-limited frequency range. Higher sampling rates will only add additional processing overhead. The sample rate can be set by using the Sample Rate Control on the Mixer Control panel.
Song Scope uses a log frequency scale in recognizers because the log frequency scale compresses high frequency information. It is common for the high frequencies to contain less or redundant information (such as harmonics of lower frequencies) compared with lower frequencies. The effect of using a log frequency scale can be observed visually by viewing the spectrogram plot on a log frequency scale.
Song Scope normalizes the power spectrum in each FFT slice to reduce spectral features to a small dynamic range. This has the added benefit of eliminating noise and competing audio signals that are not within the dynamic range. The dynamic range can be configured by the Dynamic Range control on the Detector Control Panel. The effect of normalizing power can be observed visually by viewing the spectrogram plot on a long frequency scale with power normalization.
Song Scope reduces the frequency bins of each FFT slice further to a "feature vector" consisting of a small number of dimensions. The size of the feature vector can be configured by the Maximum Resolution control in the recognizer control panel.
One of the most important factors in building an accurate recognizer is selecting the training data. For vocalizations that are particularly consistent with little variation, less training data will do the job. But for vocalizations with significant variation, training data will be needed that covers a range of these variations. Occasionally, a new variation may be encountered that doesn't fit the model and will fail to be recognized. The good news is that Song Scope makes it easy to then incorporate this new variation into the model so that it will be more easily recognized the next time a similar variation is encountered.
In pattern recognition, there is a phenomenon known as "over training". When there is very little training data available, statistical models will tend to retain too much detail such that the small amount of training data fits perfectly, but even small variations are rejected. To prevent over training, Song Scope builds and tests several different models and uses the annotation id to identify the different variations available in the training data. For each annotation id, Song Scope will build a model with all the training data excluding those specific vocalizations marked with the id, and then tests the model against the excluded vocalizations. In this way, Song Scope attempts to maximize the generalization of the model to cope with previously unobserved data. For this to work, it is important to have at least a few examples of different variations and mark the annotation id field appropriately.
When different vocalization subclasses are distinct and unlike each other at all (e.g. the difference between nasal call notes and whistled songs in many species), it would be better to divide these into different recognizers (i.e. one recognizer per subclass) in order to avoid overly complicating a model with two or more completely unrelated sets of data. On the other hand, if different vocalization subclasses share many common elements with each other, it may be better to combine them into a single recognizer model in order to have more training data available to capture the variation accurately. For a given vocalization of a given species, different combinations may produce better results than others. Song Scope makes it easy to build different recognizers with different permutations of training data so you can easily and quickly try different combinations until satisfactory results are achieved.
Previous Page |
Next Page |