Wildlife Acoustics | Bioacoustic monitoring systems for research, science, industry and governments.

Explaining Maximum Likelihood Estimators (MLE) and P-values used in Kaleidoscope

The U.S. Fish & Wildlife Service Indiana Bat Summer Survey Guidance describes the use of approved software programs. As part of their software testing criteria:

As species identifications are never perfect, all analysis programs must utilize a maximum-likelihood estimator  approach to determine species presence at the site rather than relying on a single sequence. Post-hoc maximum-likelihood estimator p-values will be used to determine acceptance thresholds for final identification determination. The maximum-likelihood estimator used by Kaleidoscope Pro is based on a 2002 paper by Britzke, Murray, Heywood, and Robbins "Acoustic Identification".

The method described takes two inputs. First, there are the classification results e.g. How many detections of each bat did the classifier find? Second, there is the confusion matrix representing the known error rates across all the classifiers. For example, 70% of MYLU calls are correctly classified as MYLU while 3% of MYLU calls are misclassified as MYSO, etc.

The maximum likelihood estimator determines what the most likely distribution of different species are that would result in the observed classifications given the classifier error rate. Then, to calculate P-values, a given species is clamped as absent and the most likely distribution is recalculated. The ratio of the clamped likelihood divided by the original likelihood is the P-value.

In layman's terms, if we run an automated classifier on a data set, we will end up with a number of classifications for each species found in the data. From this, we want to determine the likelihood of presence or absence by calculating the P-value corresponding to the null hypothesis of absence. A low (near zero) p-value would therefore suggest presence.

For example...

Suppose we have a classification result with 70 MYLU detections and 3 MYSO detections. Given the error rate between MYLU and MYSO, the 3 MYSO detections are easily explained away as false positives from actual MYLU calls, so the P-value for MYSO in this case would be expected to be very high (unlikely present). On the other hand, if we have 70 MYLU detections and 20 MYSO detections, it is harder to explain away all 20 MYSO detections as false positives from actual MYLU calls, so the P-value for MYSO in this case would be expected to be very low (likely present).

There are some important caveats:

First, remember that an important input to the calculation is the known confusion matrix of the classifier. Unfortunately, there is no such thing. The error rates of a bat classifier will vary from one site to the next, because the bats will produce different calls in different habitats with different levels of clutter. In a high clutter environment, for example, you might expect to see a higher error rate than in a low clutter environment. It is also exceedingly difficult to measure the error rate without significant independently collected and verified data. For Kaleidoscope, we split our data in half using one half to train our classifiers and the other half to measure the error rates. This is a as good an estimate of the average confusion matrix that we can measure. But, it is not going to be the actual confusion matrix for any particular deployment. Therefore, the P-value calculations can't be determined perfectly.

Second, remember that while the P-value is perhaps the best statistical tool we have to work with, it is not perfect. A high P-value is not proof of absence. It simply means that there is not sufficient statistical evidence of presence. And a low P-value is not proof of presence, it simply means that the null hypothesis of absence cannot be explained by the data. A low P-value might suggest that an alternate hypothesis is more likely. That could be presence. But it could also be that the classification error matrix was not a good fit for the data.

MLE P-values are a convenient way to aggregate a lot of data and provide a useful statistic to estimate presence or absence of species. But, it is an imperfect statistic and should not be relied upon in a vacuum.