Prediction with pre-trained CNNs

This notebook contains all the code you need to use a pre-trained OpenSoundscape convolutional neural network model (CNN) to make predictions on your own data. Before attempting this tutorial, install OpenSoundscape by following the instructions on the OpenSoundscape website, More detailed tutorials about data preprocessing, training CNNs, and customizing prediction methods can also be found on this site.

Load required packages

We will load several imports from OpenSoundscape. First, load the AudiotoSpectrogramPreprocessor class from the preprocess.preprocessors module. Preprocessor classes are used to load, transform, and augment audio samples for use in a machine learning model.

from opensoundscape.preprocess.preprocessors import AudioToSpectrogramPreprocessor

Second, the cnn module provides classes for training and prediction with various structures of CNNs. For this example, load the Resnet18Binary class, used for models made with the Resnet18 architecture for predicting the presence or absence of a species (a “binary” classifier).

# The cnn module provides classes for training/predicting with various types of CNNs
from opensoundscape.torch.models.cnn import Resnet18Binary

Finally, load some additional packages and perform some setup for the Jupyter notebook.

# Other utilities and packages
import torch
from pathlib import Path
import numpy as np
import pandas as pd
from glob import glob
import subprocess
#set up plotting
from matplotlib import pyplot as plt
plt.rcParams['figure.figsize']=[15,5] #for large visuals
%config InlineBackend.figure_format = 'retina'

Prepare audio data for prediction

To run predictions on your audio data, you will need to have your audio already split up into the clip lengths that the model expects to receive. If your audio data are not already split, see the demonstration of the Audio.split() method in the audio_and_spectrogram notebook.

You can check the length of clips that the model to receives in the model’s notes when you download it. This is often, but not always, 5.0 seconds.

Download audio files

The Kitzes Lab has created a small labeled dataset of short clips of American Woodcock vocalizations. You have two options for obtaining the folder of data, called woodcock_labeled_data:

  1. Run the following cell to download this small dataset. These commands require you to have curl and tar installed on your computer, as they will download and unzip a compressed file in .tar.gz format.
  2. OR download a .zip version of the files by clicking here. You will have to unzip this folder and place the unzipped folder in the same folder that this notebook is in.

Note: Once you have the data, you do not need to run this cell again.

[27]:['curl','','-L', '-o','woodcock_labeled_data.tar.gz']) # Download the data["tar","-xzf", "woodcock_labeled_data.tar.gz"]) # Unzip the downloaded tar.gz file["rm", "woodcock_labeled_data.tar.gz"]) # Remove the file after its contents are unzipped
CompletedProcess(args=['rm', 'woodcock_labeled_data.tar.gz'], returncode=0)

Generate a Preprocessor object

In addition to having audio clips of the correct length, you will need to create a Preprocessor object that loads audio samples for the CNN.

First, generate a Pandas DataFrame with the index containing the paths to each file, as shown below.

# collect a list of audio files
file_list = glob('./woodcock_labeled_data/*.wav')

# create a DataFrame with the audio files as the index
audio_file_df = pd.DataFrame(index=file_list)

Next, use that DataFrame to create a Preprocessor object suitable for your application. Use the argument return_labels=False, as our audio to predict on does not have labels.

If the model was trained with any special preprocesor settings, you should apply those settings here. For pretrained models created by the Kitzes Lab, see the model’s notes from its download page for the exact code to use here.

# create a Preprocessor object
# we use the option "return_labels=False" because our audio to predict on does not have labels
from opensoundscape.preprocess.preprocessors import AudioToSpectrogramPreprocessor
prediction_dataset = AudioToSpectrogramPreprocessor(audio_file_df, return_labels=False)

Models trained with OpenSoundscape v0.5.x

Check the model notes page for the appropriate model class to use and import the correct class from the cnn module.

from opensoundscape.torch.models.cnn import Resnet18Binary

For the purpose of demonstration, let’s generate a new Resnet18 model for binary prediction and save it to our local folder. This is a dummy model that will not be trained using any data and will thus not make meaningful predictions.

If you download a pre-trained model, you can skip this cell.

model = Resnet18Binary(classes=['absent','present'])'./demo.model')
created PytorchModel model object with 2 classes
Saving to demo.model
from opensoundscape.torch.models.cnn import PytorchModel

Next, provide the model class’s from_checkpoint() method with the path to your downloaded model.

# load the model into the appropriate model class
model = Resnet18Binary.from_checkpoint('./demo.model')
created PytorchModel model object with 2 classes
loading weights from saved object

Generate predictions as follows. The predict method returns three arguments: scores, thresholded predictions, and labels. For unthresholded prediction on unlabeled data, only the first one is relevant, so we can discard the other returns using scores, _, _.

# call model.predict() with the Preprocessor to generate predictions
scores, _, _ = model.predict(prediction_dataset)
(29, 2)

Look at the scores of the first 5 samples. These scores may be anything from negative to positive infinity.

#look at the scores of the first 5 samples
absent present
./woodcock_labeled_data/6a83b011665c482c1f260d8e111aa81c.wav 0.484657 0.382294
./woodcock_labeled_data/0d043e9954d9d80ca2c3e86055e94487.wav 0.593123 0.313045
./woodcock_labeled_data/78654b6f687d7635f50fba3546c7bdfa.wav 0.616126 0.423048
./woodcock_labeled_data/863095c237c52ec51cff7395d70cee41.wav 0.287727 0.352490
./woodcock_labeled_data/e84a4b60a4f2d049d73162ee99a7ead8.wav 0.442259 0.363902

Options for prediction

The code above returns the raw predictions of the model without any post-processing (such as a softmax layer or a sigmoid layer).

For details on how to use the predict() function for post-processing of predictions and to generate binary 0/1 predictions of class presence, see the “Basic training and prediction with CNNs” tutorial notebook. But, as a quick example, let’s add a softmax layer to make the prediction scores for both classes sum to 1. We can also use the binary_preds argument to generate 0/1 predictions for each sample and class. For presence/absence models, use the option binary_preds='single_target'. For multi-class models, think about whether each clip should be labeled with only one class (single target) or whether each clip could contain multiple classes (binary_preds='multi_target')

scores, binary_predictions, _ = model.predict(
(29, 2)

As before, the scores are continuous variables, but now have been softmaxed:

absent present
./woodcock_labeled_data/6a83b011665c482c1f260d8e111aa81c.wav 0.525568 0.474432
./woodcock_labeled_data/0d043e9954d9d80ca2c3e86055e94487.wav 0.569565 0.430435
./woodcock_labeled_data/78654b6f687d7635f50fba3546c7bdfa.wav 0.548120 0.451880
./woodcock_labeled_data/863095c237c52ec51cff7395d70cee41.wav 0.483815 0.516185
./woodcock_labeled_data/e84a4b60a4f2d049d73162ee99a7ead8.wav 0.519579 0.480421

We also have an additional output, the binary 0/1 (“absent” vs “present”) predictions generated by the model:

absent present
./woodcock_labeled_data/6a83b011665c482c1f260d8e111aa81c.wav 1.0 0.0
./woodcock_labeled_data/0d043e9954d9d80ca2c3e86055e94487.wav 1.0 0.0
./woodcock_labeled_data/78654b6f687d7635f50fba3546c7bdfa.wav 1.0 0.0
./woodcock_labeled_data/863095c237c52ec51cff7395d70cee41.wav 0.0 1.0
./woodcock_labeled_data/e84a4b60a4f2d049d73162ee99a7ead8.wav 1.0 0.0

It is often helpful to look at a histogram of the scores for the positive class. Because this dummy model had random weights, we would expect this histogram to center somewhere around 0.5.

_ = plt.hist(scores['present'],bins=20)
_ = plt.xlabel('softmax score for positive class')

Prediction on long (un-split) audio files

It’s also possible to run predictions on long audio files. In this case, OpenSoundscape will internally split the audio into short segments during prediction. The input and output of prediction is slightly different in this case: - Input is similar to before: a dataframe with the index containing the paths to audio files - Output is still a dataframe, but it will have three “index” columns. The first matches the index of the input, and contains the audio file paths. The second and third index columns contain the “begin” and “end” time of clips relative to the start of the audio file. The remaining columns, as usual, contain the names of each class and the scores or predictions for each class for that row’s audio clip.

Let’s look at an example. We’ll use the 1 minute audio file contained within OpenSoundscape’s test folder as a “long” audio file. In practice, you can split files that are multiple hours long - the limiting factor is your computer’s memory (“RAM”), which must be able to hold the entire audio file.

import opensoundscape
from opensoundscape.preprocess.preprocessors import LongAudioPreprocessor

#get audio path from opensoundscape's tests folder
audio_1m_path = Path(opensoundscape.__file__).parent.parent.joinpath('tests/audio/1min.wav')
long_audio_prediction_df = pd.DataFrame(index=[audio_1m_path])
img_shape = [224,224]

#the audio will be split during prediction. choose the clip length and overlap of sequential clips (0 for no overlap)
clip_length = 5.0
clip_overlap = 0.0
long_audio_prediction_ds = LongAudioPreprocessor(

in addition to the scores (and potentially, predictions) the function returns a list of “unsafe” samples that caused errors during preprocessing.

score_df, pred_df, unsafe_samples = model.split_and_predict(
absent present
file start_time end_time
/home/louisfh/Development/opensoundscape/tests/audio/1min.wav 0.0 5.0 0.784139 0.027767
5.0 10.0 0.402279 -0.030964
10.0 15.0 0.854899 0.172253
15.0 20.0 0.626625 0.354499
20.0 25.0 0.757906 0.259789

Models trained with OpenSoundscape 0.4.x

One set of our publicly availably binary models for 500 species was created with an older version of OpenSoundscape. These models require a little bit of manipulation to load into OpenSoundscape 0.5.x and onward.

First, let’s download one of these models (it’s stored in a .tar format) and save it to the same directory as this notebook in a file called opso_04_model_acanthis-flammea.tar

                '-L', '-o', 'opso_04_model_acanthis-flammea.tar'])
CompletedProcess(args=['curl', '', '-L', '-o', 'opso_04_model_acanthis-flammea.tar'], returncode=0)

Next, load the weights from that model into an OpenSoundscape model object with the following code:

from opensoundscape.torch.models.cnn import PytorchModel
from opensoundscape.torch.architectures.cnn_architectures import resnet18
import torch

# load the tar file into a dictionary
# (you could change this to the location of any .tar file on your computer)
opso_04_model_tar_path = "./opso_04_model_acanthis-flammea.tar"
opso_04_model_dict = torch.load(opso_04_model_tar_path)

# create a resnet18 binary model
# (all models created with Opensoundscape 0.4.x are 2-class resnet18 architectures)
architecture = resnet18(num_classes=2,use_pretrained=False)
model = PytorchModel(classes=['negative','positive'],architecture=architecture)

# load the model weights into our model object
# now, our model is equivalent to the trained model we downloaded['model_state_dict'])
created PytorchModel model object with 2 classes
<All keys matched successfully>

Now, we can use the model as normal to create predictions on audio. We’ll use the same prediction_dataset from above (which does not contain any Common redpoll).

Remember to choose the activation_layer you desire. In this example, we’ll assume we just want to generate scores, not binary predictions. We’ll apply a softmax layer, then the logit transform, to the scores using the activation_layer="softmax_and_logit" option. This will generate the type of scores that are useful for plotting score histograms, among other things.

# generate predictions on our dataset
predition_scores_df,_,_ = model.predict(prediction_dataset, activation_layer='softmax')
(29, 2)
negative positive
./woodcock_labeled_data/6a83b011665c482c1f260d8e111aa81c.wav 0.998181 0.001819
./woodcock_labeled_data/0d043e9954d9d80ca2c3e86055e94487.wav 0.999854 0.000145
./woodcock_labeled_data/78654b6f687d7635f50fba3546c7bdfa.wav 0.998935 0.001065
./woodcock_labeled_data/863095c237c52ec51cff7395d70cee41.wav 0.999816 0.000184
./woodcock_labeled_data/e84a4b60a4f2d049d73162ee99a7ead8.wav 0.995471 0.004529

Remove the downloaded files to clean up.

folder = Path('./woodcock_labeled_data')
[p.unlink() for p in folder.glob("*")]
for p in Path('.').glob('*.model'):
for p in Path('.').glob('*.tar'):