Beginner friendly training and prediction with CNNs
Convolutional Neural Networks (CNNs) are a popular tool for developing automated machine learning classifiers on images or image-like samples. By converting audio into a two-dimensional frequency vs. time representation such as a spectrogram, we can generate image-like samples that can be used to train CNNs. This tutorial demonstrates the basic use of OpenSoundscape’s preprocessors
and cnn
modules for training CNNs and making predictions using CNNs.
Under the hood, OpenSoundscape uses Pytorch for machine learning tasks. By using the class opensoundscape.ml.cnn.CNN
, you can train and predict with PyTorch’s powerful CNN architectures in just a few lines of code.
First, let’s import some utilities.
[1]:
# the cnn module provides classes for training/predicting with various types of CNNs
from opensoundscape import CNN
#other utilities and packages
import torch
import pandas as pd
from pathlib import Path
import numpy as np
import pandas as pd
import random
import subprocess
#set up plotting
from matplotlib import pyplot as plt
plt.rcParams['figure.figsize']=[15,5] #for large visuals
%config InlineBackend.figure_format = 'retina'
Set manual seeds for pytorch and python. These ensure the training results are reproducible. You probably don’t want to do this when you actually train your model, but it’s useful for debugging.
[2]:
torch.manual_seed(0)
random.seed(0)
np.random.seed(0)
Prepare audio data
Download labeled audio files
Training a machine learning model requires some pre-labeled data. These data, in the form of audio recordings or spectrograms, are labeled with whether or not they contain the sound of the species of interest. These data can be obtained from online databases such as Xeno-Canto.org, or by labeling one’s own ARU data using a program like Cornell’s Raven sound analysis software.
The Kitzes Lab has created a small labeled dataset of short clips of American Woodcock vocalizations. You have two options for obtaining the folder of data, called woodcock_labeled_data
:
Run the following cell to download this small dataset. These commands require you to have
tar
installed on your computer, as they will download and unzip a compressed file in.tar.gz
format.Download a
.zip
version of the files by clicking here. You will have to unzip this folder and place the unzipped folder in the same folder that this notebook is in.
Note: Once you have the data, you do not need to run this cell again.
[3]:
subprocess.run(['curl','https://drive.google.com/uc?export=download&id=1Ly2M--dKzpx331cfUFdVuiP96QKGJz_P','-L', '-o','woodcock_labeled_data.tar.gz']) # Download the data
subprocess.run(["tar","-xzf", "woodcock_labeled_data.tar.gz"]) # Unzip the downloaded tar.gz file
subprocess.run(["rm", "woodcock_labeled_data.tar.gz"]) # Remove the file after its contents are unzipped
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 9499k 100 9499k 0 0 4518k 0 0:00:02 0:00:02 --:--:-- 10.1M
[3]:
CompletedProcess(args=['rm', 'woodcock_labeled_data.tar.gz'], returncode=0)
Generate one-hot encoded labels
The folder contains 2s long audio clips taken from an autonomous recording unit. It also contains a file woodcock_labels.csv
which contains the names of each file and its corresponding label information, created using a program called Specky.
[4]:
#load Specky output: a table of labeled audio files
specky_table = pd.read_csv(Path("woodcock_labeled_data/woodcock_labels.csv"))
specky_table.head()
[4]:
filename | woodcock | sound_type | |
---|---|---|---|
0 | d4c40b6066b489518f8da83af1ee4984.wav | present | song |
1 | e84a4b60a4f2d049d73162ee99a7ead8.wav | absent | na |
2 | 79678c979ebb880d5ed6d56f26ba69ff.wav | present | song |
3 | 49890077267b569e142440fa39b3041c.wav | present | song |
4 | 0c453a87185d8c7ce05c5c5ac5d525dc.wav | present | song |
This table must provide an accurate path to the files of interest. For this self-contained tutorial, we can use relative paths (starting with a dot and referring to files in the same folder), but you may want to use absolute paths for your training.
[5]:
#update the paths to the audio files
specky_table.filename = ['./woodcock_labeled_data/'+f for f in specky_table.filename]
specky_table.head()
[5]:
filename | woodcock | sound_type | |
---|---|---|---|
0 | ./woodcock_labeled_data/d4c40b6066b489518f8da8... | present | song |
1 | ./woodcock_labeled_data/e84a4b60a4f2d049d73162... | absent | na |
2 | ./woodcock_labeled_data/79678c979ebb880d5ed6d5... | present | song |
3 | ./woodcock_labeled_data/49890077267b569e142440... | present | song |
4 | ./woodcock_labeled_data/0c453a87185d8c7ce05c5c... | present | song |
We then modify these labels, replacing present
with 1 and absent
with zero. Ones and zeros are the way that presences and absences are represented in a machine learning model.
[6]:
# create a new dataframe with the filenames from the previous table as the index
labels = pd.DataFrame(index=specky_table['filename'])
#convert 'present' to 1 and 'absent' to 0
labels['woodcock']=[1 if l=='present' else 0 for l in specky_table['woodcock']]
#look at the first rows
labels.head(3)
[6]:
woodcock | |
---|---|
filename | |
./woodcock_labeled_data/d4c40b6066b489518f8da83af1ee4984.wav | 1 |
./woodcock_labeled_data/e84a4b60a4f2d049d73162ee99a7ead8.wav | 0 |
./woodcock_labeled_data/79678c979ebb880d5ed6d56f26ba69ff.wav | 1 |
Split into training and validation sets
We use a utility from sklearn
to randomly divide the labeled samples into two sets. The first set, train_df
, will be used to train the CNN, while the second set, valid_df
, will be used to test how well the model can predict the classes of samples that it was not trained with.
During the training process, the CNN will go through all of the samples once every “epoch” for several (sometimes hundreds of) epochs. Each epoch usually consists of a “learning” step and a “validation” step. In the learning step, the CNN iterates through all of the training samples while the computer program is modifying the weights of the convolutional neural network. In the validation step, the program performs prediction on all of the validation samples and prints out metrics to assess how well the classifier generalizes to unseen data.
Note: using the random_state
argument with a fixed number means that the “random” split will be exactly the same each time we run it. This is useful for reproducible results, but to get a different split each time you would not use the random_state
argument.
[7]:
from sklearn.model_selection import train_test_split
train_df,validation_df = train_test_split(labels,test_size=0.2,random_state=1)
Create and train a model
Now, we create a convolutional neural network model object, train it on the train_dataset
with validation from validation_dataset
Set up a one-class CNN
The purpose of this model is to predict the presence or absence of a single species, so it has one class “woodcock”. Its also possible to train models to recognize multiple species - we call these “multi-class models” and each category of sounds it learns to recognize is a “class”.
The model object should be initialized with a list of class names that matches the class names in the training dataset. Here we’ll use the resnet18 architecture, a popular and powerful architecture that makes a good starting point. For more details on other CNN architectures, see the “Advanced CNN Training” tutorial.
[8]:
# Create model object
classes = train_df.columns #in this case, there's just one class: ["woodcock"]
model = CNN('resnet18',classes=classes,sample_duration=2.0)
CAVEAT: the default audio preprocessing in this class bandpasses spectrograms to 0-11025 Hz. If your audio has a sample rate of less than 22050 Hz, the preprocessing will raise an error because the spectrogram will not contain the expected frequencies. In this case you could change the parameters of the bandpass action, or simply disable the bandpass action:
model.preprocessor.pipeline.bandpass.bypass=True
Inspect training images
Before creating a machine learning algorithm, we strongly recommend making sure the images coming out of the preprocessor look like you expect them to. Here we generate images for a few samples.
[9]:
#helper functions to visualize processed samples
from opensoundscape.preprocess.utils import show_tensor_grid, show_tensor
from opensoundscape.ml.datasets import AudioFileDataset
Now, let’s check what the samples generated by our model look like
[10]:
#pick some random samples from the training set
sample_of_4 = train_df.sample(n=4)
#generate a dataset with the samples we wish to generate and the model's preprocessor
inspection_dataset = AudioFileDataset(sample_of_4, model.preprocessor)
#generate the samples using the dataset
samples = [sample.data for sample in inspection_dataset]
labels = [list(sample.labels[sample.labels>0].index) for sample in inspection_dataset]
#display the samples
_ = show_tensor_grid(samples,4,labels=labels)
/Users/SML161/miniconda3/envs/opso_dev/lib/python3.9/site-packages/matplotlib_inline/config.py:68: DeprecationWarning: InlineBackend._figure_format_changed is deprecated in traitlets 4.1: use @observe and @unobserve instead.
def _figure_format_changed(self, name, old, new):
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).

The dataset allows you to turn all augmentation off or on as desired. Inspect the unaugmented images as well:
[11]:
#turn augmentation off for the dataset
inspection_dataset.bypass_augmentations = True
#generate the samples without augmentation
samples = [sample.data for sample in inspection_dataset]
labels = [list(sample.labels[sample.labels>0].index) for sample in inspection_dataset]
#display the samples
_ = show_tensor_grid(samples,4,labels=labels)

Initilize Weights & Biases logging session
Weights and Biases (WandB) is a powerful and beautiful tool for logging and visualizing the machine learning model training process.
OpenSoundscape supports WandB integration, making it easy to visualize preprocessed samples and progress during training.
To use WandB, first create an account at https://wandb.ai/. The first time you use it on any computer, you’ll need to run wandb.login() either in the command line or in Python, and enter the API key from your settings page. The “Entity” or team option allows runs and projects to be shared across members in a group, making it easy to collaborate and see progress and results of other team members’ runs.
[12]:
try:
import wandb
wandb.login()
wandb_session = wandb.init(
entity='kitzeslab',
project='opensoundscape_tutorials',
name='cnn training tutorial'
)
except:
wandb_sesion = None
/Users/SML161/miniconda3/envs/opso_dev/lib/python3.9/site-packages/notebook/utils.py:280: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
return LooseVersion(v) >= LooseVersion(check)
wandb: Currently logged in as: samlapp (kitzeslab). Use `wandb login --relogin` to force relogin
/Users/SML161/opensoundscape/docs/tutorials/wandb/run-20230622_075934-2myt8w7y
Train the model
Depending on the speed of your computer, training the CNN may take a few minutes.
We’ll only train for 5 epochs on this small dataset as a demonstration, but you’ll probably need to train for tens (or hundreds) of epochs on hundreds (or thousands) of training files to create a useful model.
Batch size refers to the number of samples that are simultaneously processed by the model. In practice, using larger batch sizes (64+) improves stability and generalizability of training, particularly for architectures (such as ResNet) that contain a ‘batch norm’ layer. Here we use a small batch size to keep the computational requirements for this tutorial low.
[13]:
model.train(
train_df=train_df,
validation_df=validation_df,
save_path='./binary_train/', #where to save the trained model
epochs=10,
batch_size=8,
save_interval=5, #save model every 5 epochs (the best model is always saved in addition)
num_workers=0, #specify 4 if you have 4 CPU processes, eg; 0 means only the root process
wandb_session=wandb_session
)
# let wandb know that we finished training successfully
wandb.unwatch(model.network)
wandb.finish()
wandb: logging graph, to disable use `wandb.watch(log_graph=False)`
Training Epoch 0
Epoch: 0 [batch 0/3, 0.00%]
DistLoss: 0.694
Metrics:
Metrics:
MAP: 0.637
Validation.
Metrics:
MAP: 1.000
Training Epoch 1
Epoch: 1 [batch 0/3, 0.00%]
DistLoss: 0.619
Metrics:
Metrics:
MAP: 0.882
Validation.
Metrics:
MAP: 1.000
Training Epoch 2
Epoch: 2 [batch 0/3, 0.00%]
DistLoss: 0.749
Metrics:
Metrics:
MAP: 0.862
Validation.
Metrics:
MAP: 1.000
Training Epoch 3
Epoch: 3 [batch 0/3, 0.00%]
DistLoss: 0.405
Metrics:
Metrics:
MAP: 0.929
Validation.
Metrics:
MAP: 1.000
Training Epoch 4
Epoch: 4 [batch 0/3, 0.00%]
DistLoss: 0.279
Metrics:
Metrics:
MAP: 0.945
Validation.
Metrics:
MAP: 0.967
Training Epoch 5
Epoch: 5 [batch 0/3, 0.00%]
DistLoss: 0.619
Metrics:
Metrics:
MAP: 0.948
Validation.
Metrics:
MAP: 0.967
Training Epoch 6
Epoch: 6 [batch 0/3, 0.00%]
DistLoss: 0.672
Metrics:
Metrics:
MAP: 0.955
Validation.
Metrics:
MAP: 0.967
Training Epoch 7
Epoch: 7 [batch 0/3, 0.00%]
DistLoss: 0.160
Metrics:
Metrics:
MAP: 0.957
Validation.
Metrics:
MAP: 0.927
Training Epoch 8
Epoch: 8 [batch 0/3, 0.00%]
DistLoss: 0.427
Metrics:
Metrics:
MAP: 0.956
Validation.
Metrics:
MAP: 1.000
Training Epoch 9
Epoch: 9 [batch 0/3, 0.00%]
DistLoss: 0.178
Metrics:
Metrics:
MAP: 1.000
Validation.
Metrics:
MAP: 1.000
Best Model Appears at Epoch 0 with Validation score 1.000.
Run history:
epoch | ▁▂▃▃▄▅▆▆▇█ |
loss | █▄▅▄▅▅▃▃▅▁ |
Run summary:
epoch | 9 |
loss | 0.1809 |
Synced 6 W&B file(s), 4 media file(s), 44 artifact file(s) and 0 other file(s)
./wandb/run-20230622_075934-2myt8w7y/logs
Inspect metrics and samples using wandb
to see the wandb webpage generated by the original run of this notebook, visit https://wandb.ai/kitzeslab/opensoundscape_tutorials/runs/rv67reaq?workspace=user-samlapp
[14]:
wandb_session
[14]:
Visualize gradient activation maps
using gradient activation maps such as GradCAM, we can check what parts of the sample the model is paying attention to when it correctly (or incorrectly) labels a sample with a class.
Here, we can see that the American Woodcock vocalization in the sample is activating the network
[15]:
samples = model.generate_cams(samples=train_df.head(1))
samples[0].cam.plot()
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).

[15]:
(<Figure size 1500x500 with 1 Axes>,
<AxesSubplot:title={'center':'activation for class None'}>)
You can also use generate_cams(...backprop=True
) and cam.plot(...mode='backprop')
or mode='backprop_and_activation'
to see pixel-level activations of gradients. This will work best with a model trained more thoroughly than our toy model in this example.
Plot the loss history
We can plot the loss from each epoch to check that our loss is declining. Loss should decline as the model learns, but may have ups and downs along the way.
[16]:
plt.scatter(model.loss_hist.keys(),model.loss_hist.values())
plt.xlabel('epoch')
plt.ylabel('loss')
[16]:
Text(0, 0.5, 'loss')

Printing and Logging outputs
We can log the outputs of the training process to a file, and/or print them. We can independently modify how much content is logged/printed with the model’s attributes model.verbose
and model.logging_level
. Content increases from level 0 (nothing) to 1 (standard), 2, 3, etc. For instance, let’s train for an epoch with lots of logged content but no printed output:
[17]:
model.logging_level = 3 #request lots of logged content
model.log_file = './binary_train/training_log.txt' #specify a file to log output to
Path(model.log_file).parent.mkdir(parents=True,exist_ok=True) #make the folder ./binary_train
model.verbose = 0 #don't print anything to the screen during training
model.train(
train_df=train_df,
validation_df=validation_df,
save_path='./binary_train/', #where to save the trained model
epochs=1,
batch_size=8,
save_interval=5, #save model every 5 epochs (the best model is always saved in addition)
num_workers=0, #specify 4 if you have 4 CPU processes, eg; 0 means only the root process
)
Prediction
We haven’t actually trained a useful model in 5 epochs, but we can use the trained model to demonstrate how prediction works and show several of the settings useful for prediction.
We will run prediction on two one-minute clips of field data recorded by an AudioMoth acoustic recorded. The two files are located in woodcock_labeled_data/field_data
Predict on the field data
To run prediction, also known as “inference”, wich a CNN, we simply call model’s predict
method and pass it a list of file paths (or a dataframe with file paths in the index).
The predict function will internally split audio files into the appropriate length clips for prediction and generate prediction scores for each clip.
By default, there is no overlap between these clips, but we can specify a fraction of overlap with consecutive clips with the
overlap_fraction
argument (eg, 0.5 for 50% overlap).Additionally, if we want to predict on audio files that are already trimmed to the same duration as the training files, we can specify
split_files_into_clips=False
.
Calling .predict()
will return a dataframe with numeric (continuous score) predictions from the model for each sample and class (by default these are raw outputs from the model).
Let’s predict on the two field recordings:
[18]:
from glob import glob
field_recordings = glob('./woodcock_labeled_data/field_data/*')
field_recordings
[18]:
['./woodcock_labeled_data/field_data/60s_field_data_sample_1.wav',
'./woodcock_labeled_data/field_data/60s_field_data_sample_2.wav']
[19]:
prediction_scores_df = model.predict(field_recordings)
The predict function generated a dataframe with rows for each 2-second segment of each 1-minute audio clip. Let’s look at the first few rows:
[20]:
prediction_scores_df.head()
[20]:
woodcock | |||
---|---|---|---|
file | start_time | end_time | |
./woodcock_labeled_data/field_data/60s_field_data_sample_1.wav | 0.0 | 2.0 | 1.785296 |
2.0 | 4.0 | 1.707386 | |
4.0 | 6.0 | 1.753464 | |
6.0 | 8.0 | 2.018827 | |
8.0 | 10.0 | 1.863675 |
Prediction can also be monitored with WandB by passing wandb_session to the wandb_session argument. This is especially useful for prediction runs that may take a long time.
Generate boolean predicted class labels (0/1) from the continuous scores
Note: Presence/absence predictions always have some error rates, sometimes large ones. It is not generally advisable to use these binary predictions as scientific observations without a thorough understanding of the model’s false-positive and false-negative rates.
There are two different ways we might want to predict class labels which reflect the nature of the classes themselves:
single target means that out of a set of classes, one and only one should be chosen for each sample. For instance, if our classes were days of the week, any single should be labeled with one and only one day of the week. In opensoundscape, use the function generate_single_target_labels()
to convert scores to predicted single target labels. For each sample, the class with the highest score will recieve a label of 1 and all other classes will recieve a label of 0.
multi-target means that a sample can have 0, 1, or more than 1 labels. For instance, if our classes were the types of flowers in a photo, any given photo might have none of the classes, one class, or multiple different classes at once. In opensoundscape, use the function generate_multi_target_labels()
to convert scores to predicted multi target labels. For each sample and each class, the class will be labeled 1 if its score is higher than a user-specified threshold and 0 otherwise. You
can choose to use a single threshold for all classes, or specify a different threshold for each class.
[21]:
from opensoundscape.metrics import predict_single_target_labels
score_df = model.predict(field_recordings)
pred_df = predict_single_target_labels(score_df)
pred_df.head()
[21]:
woodcock | |||
---|---|---|---|
file | start_time | end_time | |
./woodcock_labeled_data/field_data/60s_field_data_sample_1.wav | 0.0 | 2.0 | 1 |
2.0 | 4.0 | 1 | |
4.0 | 6.0 | 1 | |
6.0 | 8.0 | 1 | |
8.0 | 10.0 | 1 |
The predict_multi_target_labels
function allows you to select a threshold. If a score exceeds that threshold, the binary prediction is 1; otherwise, it is 0. You can also specify a list of thresholds, with one for each class
[22]:
from opensoundscape.metrics import predict_multi_target_labels
multi_target_pred_df = predict_multi_target_labels(score_df,threshold=0.99)
multi_target_pred_df.head()
[22]:
woodcock | |||
---|---|---|---|
file | start_time | end_time | |
./woodcock_labeled_data/field_data/60s_field_data_sample_1.wav | 0.0 | 2.0 | 1 |
2.0 | 4.0 | 1 | |
4.0 | 6.0 | 1 | |
6.0 | 8.0 | 1 | |
8.0 | 10.0 | 1 |
Note that it is possible both the negative and positive classes are predicted to be present. This is because multi_target labeling assumes that the classes are not mutually exclusive. For a presence/absence model like the one above, single_target labeling is more appropriate.
Change the activation layer
We can modify the final activation layer to change the scores returned by the predict()
function. Note that this does not impact the results of the binary predictions (described above), which are always calculated using a sigmoid transformation (for multi-target models) or softmax function (for single-target models).
Options include:
None
: default. Just the raw outputs of the network, which are in (-inf, inf)'softmax'
: scores across all classes will sum to 1 for each sample'softmax_and_logit'
: softmax the scores across all classes so they sum to 1, then apply the “logit” transformation to these scores, taking them from [0,1] back to (-inf,inf)'sigmoid'
: transforms each score individually to [0, 1] without requiring that all scores sum to 1
In this case, since we are just looking at the output of one class, we can use the ‘sigmoid’ activation layer to put scores on the interval [0,1]
Let’s generate binary 0/1 predictions on the validation set. Since these samples are the same length as the training files, we’ll specify split_files_into_clips=False
(we just want one prediction per file, we don’t want to divide each file into shorter clips).
[23]:
valid_scores = model.predict(
validation_df,
activation_layer='sigmoid',
split_files_into_clips=False
)
Compare the softmax scores to the true labels for this dataset, side-by-side:
[24]:
valid_scores.columns = ['pred_woodcock']
validation_df.join(valid_scores).sample(5)
[24]:
woodcock | pred_woodcock | |
---|---|---|
filename | ||
./woodcock_labeled_data/4afa902e823095e03ba23ebc398c35b7.wav | 1 | 0.989541 |
./woodcock_labeled_data/92647ab903049a9ee4125abdf7b24f2a.wav | 1 | 0.969963 |
./woodcock_labeled_data/75b2f63e032dbd6d197900495a16856f.wav | 1 | 0.957937 |
./woodcock_labeled_data/882de25226ed989b31274eead6630b47.wav | 1 | 0.999741 |
./woodcock_labeled_data/ad14ac7ffa729060712b442e55aebf0b.wav | 0 | 0.360559 |
We can directly compare our model’s confidence that woodcock is present with the original labels
Parallelizing prediction
Two parameters can be used to increase prediction efficiency, depending on the computational resources available:
num_workers
: Pytorch’s method of parallelizing across cores (CPUs) - choose 0 to predict on the root process, or >1 if you want to use more than 1 CPU process.batch_size
: number of samples to predict on simultaneously. You can try increasing this by factors of two until you get a memory error, which means your batch size is too large for your system.
[25]:
score_df = model.predict(
validation_df,
batch_size=8,
num_workers=0,
)
Multi-class models
A multi-class model can have any number of classes, and can be either
multi-target: any number of classes can be positive for one sample
single-target: exactly one class is positive for each sample
Models that are multi-target benefit from a modified loss function, and we have implemented a special class that is specifically designed for multi-target problems called ResampleLoss
. We can use it as follows:
[26]:
from opensoundscape.ml.cnn import use_resample_loss
model = CNN('resnet18',classes,2.0,single_target=False)
use_resample_loss(model)
print("model.single_target:", model.single_target)
model.single_target: False
Train
Training looks the same as in one-class models.
[27]:
model.train(
train_df,
validation_df,
save_path='./multilabel_train/',
epochs=1,
batch_size=64,
save_interval=100,
num_workers=0,
)
Training Epoch 0
Epoch: 0 [batch 0/1, 0.00%]
DistLoss: nan
Metrics:
Metrics:
MAP: nan
Validation.
Best Model Appears at Epoch 0 with Validation score 0.000.
/Users/SML161/opensoundscape/opensoundscape/ml/cnn.py:695: UserWarning: Recieved empty list of predictions (or all nan)
warnings.warn("Recieved empty list of predictions (or all nan)")
Note: since we used the same data as above, we just trained a 1 class model with “resample loss”. You should not actually use resample loss for single class models!
Predict
Prediction looks the same as demonstrated above, but make sure to think carefully:
What
activation_layer
do you want?If creating boolean (0/1 or True/False) predictions for each sample and class, is my model single-target (use
metrics.predict_single_target_labels
) or multi-target (usemetrics.predict_multi_target_labels
)?
For more detail on these choices, see the sections about activation layers and boolean predictions above.
Save and load models
Models can be easily saved to a file and loaded at a later time. If the model was saved with OpenSoundscape version >=0.6.1, the entire model object will be saved - including the class, cnn architecture, loss function, and training/validation datasets. Models saved with earlier versions of OpenSoundscape do not contain all of this information and may require that you know their class and architecture (see below).
Save and load a model
OpenSoundscape saves models automatically during training:
The model saves a copy of itself
self.save_path
toepoch-X.model
automatically during training everysave_interval
epochsThe model keeps the file
best.model
updated with the weights that achieve the best score on the validation dataset. By default the model is evaluated using the mean average precision (MAP) score, but you can overwritemodel.eval()
if you want to use a different metric for the best model.
You can also save the model manually at any time with model.save(path)
[28]:
model1 = CNN('resnet18',classes,2.0,single_target=False)
# Save every 2 epochs
model1.train(
train_df,
validation_df,
epochs=3,
batch_size=8,
save_path='./binary_train/',
save_interval=2,
num_workers=0
)
model1.save('./binary_train/my_favorite.model')
Training Epoch 0
Epoch: 0 [batch 0/3, 0.00%]
DistLoss: 0.835
Metrics:
Metrics:
MAP: 0.751
Validation.
Metrics:
MAP: 1.000
Training Epoch 1
Epoch: 1 [batch 0/3, 0.00%]
DistLoss: 0.411
Metrics:
Metrics:
MAP: 0.819
Validation.
Metrics:
MAP: 1.000
Training Epoch 2
Epoch: 2 [batch 0/3, 0.00%]
DistLoss: 0.450
Metrics:
Metrics:
MAP: 0.978
Validation.
Metrics:
MAP: 1.000
Best Model Appears at Epoch 0 with Validation score 1.000.
Load
Re-load a saved model with the load_model function:
[29]:
from opensoundscape.ml.cnn import load_model
model = load_model('./binary_train/best.model')
Note on saving models and version compatability
Loading a model in a different version of OpenSoundscape than the version that saved the model may not work. To use a model across different versions of OpenSoundscape, you should save the model.network’s state dict using model.save_weights(path)
as described in the “predicting with pre-trained models” tutorial. You can load weights from a saved state dict with model.load_weights(path
). We recommend saving both the full model object (.save()
) and the raw weights (.save_weights()
)
for models you plan to use in the future.
Models saved with OpenSoundscape 0.4.x and 0.5.x can be loaded with load_outdated_model
- but be sure to update the model.preprocessor after loading to match the settings used during training. See the tutorial “predicting with pre-trained models” for more details on loading models from earlier OpenSoundscape versions.
Predict using saved (or pre-trained) model
Using a saved or downloaded model to run predictions on audio files is as simple as
Loading a previously saved model
Generating a list of files for prediction
Running
model.predict()
on the preprocessor
[30]:
# load the saved model
model = load_model('./binary_train/best.model')
#predict on a dataset
scores = model.predict(field_recordings, activation_layer='sigmoid')
NOTE: See the tutorial “predicting with pre-trained models” for loading and using models from earlier OpenSoundscape versions
Continue training from saved model
Similar to predicting using a saved model, we can also continue to train a model after loading it from a saved file.
Note that .load()
loads the entire model object, which includes optimizer parameters and learning rate parameters from the saved model, in addition to the network weights.
[31]:
# Create architecture
model = load_model('./binary_train/best.model')
# Continue training from the checkpoint where the model was saved
model.train(train_df,validation_df,save_path='.',epochs=0)
Best Model Appears at Epoch 0 with Validation score 0.000.
Next steps
You now have seen the basic usage of training CNNs with OpenSoundscape and generating predictions.
Additional tutorials you might be interested in are: * Custom preprocessing: how to change spectrogram parameters, modify augmentation routines, etc. * Custom training: how to modify and customize model training * Predict with pre-trained CNNs: details on how to predict with pre-trained CNNs. Much of this information was covered in the tutorial above, but this tutorial also includes information about using models made with previous versions of OpenSoundscape
Finally, clean up and remove files created during this tutorial:
[32]:
import shutil
dirs = ['./multilabel_train', './binary_train', './woodcock_labeled_data']
for d in dirs:
try:
shutil.rmtree(d)
except:
pass