Customize CNN training

This notebook demonstrates how to use classes from and architectures created using to

  • choose between single-target and multi-target model behavior

  • modify learning rates, learning rate decay schedule, and regularization

  • choose from various CNN architectures

  • train a multi-target model with a special loss function

  • use strategic sampling for imbalanced training data

  • customize preprocessing: train on spectrograms with a bandpassed frequency range

Rather than demonstrating their effects on training (model training is slow!), most examples in this notebook either don’t train the model or “train” it for 0 epochs for the purpose of demonstration.

For an introductory demonstration of model training, please see the “Train a CNN” tutorial. For a demo of how to apply a trained model to a dataset, see the “Predict with pretrained CNNs” tutorial.

Run this tutorial

This tutorial is more than a reference! It’s a Jupyter Notebook which you can run and modify on Google Colab or your own computer.

Link to tutorial

How to run tutorial

Open In Colab

The link opens the tutorial in Google Colab. Uncomment the “installation” line in the first cell to install OpenSoundscape.

Download via DownGit

The link downloads the tutorial file to your computer. Follow the Jupyter installation instructions, then open the tutorial file in Jupyter.

# if this is a Google Colab notebook, install opensoundscape in the runtime environment
if 'google.colab' in str(get_ipython()):
  %pip install opensoundscape


Import needed packages

from opensoundscape.preprocess import preprocessors
from import cnn, cnn_architectures

import torch
import pandas as pd
from pathlib import Path
import numpy as np
import random
import subprocess

from matplotlib import pyplot as plt
plt.rcParams['figure.figsize']=[15,5] #for big visuals
%config InlineBackend.figure_format = 'retina'
/Users/tessa/Code/opensoundscape/opensoundscape/ml/ TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
  from tqdm.autonotebook import tqdm

Download labeled audio files

The Kitzes Lab has created a small labeled dataset of short clips of American Woodcock vocalizations. You have two options for obtaining the folder of data, called woodcock_labeled_data:

  1. Run the following cell to download this small dataset. These commands require you to have tar installed on your computer, as they will download and unzip a compressed file in .tar.gz format.

  2. Download a .zip version of the files by clicking here. You will have to unzip this folder and place the unzipped folder in the same folder that this notebook is in.

[3]:['curl','','-L', '-o','woodcock_labeled_data.tar.gz']) # Download the data["tar","-xzf", "woodcock_labeled_data.tar.gz"]) # Unzip the downloaded tar.gz file["rm", "woodcock_labeled_data.tar.gz"]) # Remove the file after its contents are unzipped
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:03 --:--:--     0
100 9499k  100 9499k    0     0  2338k      0  0:00:04  0:00:04 --:--:-- 2338k
CompletedProcess(args=['rm', 'woodcock_labeled_data.tar.gz'], returncode=0)

Prepare audio data

To create a machine learning model, we need two dataframes of labeled clips, one for training and one for testing.

The steps to create these dataframes are described in more detail in other tutorials (e.g. the “Audio annotations” tutorial).

First, we need a dataframe with file paths in the index, so we manipulate the included one_hot_labels.csv slightly.

# Load one-hot labels dataframe
labels = pd.read_csv('./woodcock_labeled_data/one_hot_labels.csv').set_index('file')[['present']]

# Prepend the folder location to the file paths
labels.index = pd.Series(labels.index).apply(lambda f: './woodcock_labeled_data/'+f)

# Create class list
classes = labels.columns

# Inspect
./woodcock_labeled_data/d4c40b6066b489518f8da83af1ee4984.wav 1
./woodcock_labeled_data/e84a4b60a4f2d049d73162ee99a7ead8.wav 0
./woodcock_labeled_data/79678c979ebb880d5ed6d56f26ba69ff.wav 1
./woodcock_labeled_data/49890077267b569e142440fa39b3041c.wav 1
./woodcock_labeled_data/0c453a87185d8c7ce05c5c5ac5d525dc.wav 1

Next, randomly split these data into train and validation sets.

from sklearn.model_selection import train_test_split
train_df, valid_df = train_test_split(labels, test_size=0.2, random_state=0)
print(f"created train_df (len {len(train_df)}) and valid_df (len {len(valid_df)})")
created train_df (len 23) and valid_df (len 6)

Model architectures

We initialize a model object by specifying the architecture, a list of classes, and the duration of individual samples in seconds.

The architecture is the particular design of the CNN. This option can either be a string matching one of the architectures available by default in OpenSoundscape, or a custom PyTorch model object.

Default architectures

The module provides functions to create several common CNN architectures. These architectures are built into PyTorch, but the OpenSoundscape module helps us out by reshaping the final layer to match the number of classes we have.

Note that these will use default architecture parameters, including using pre-trained ImageNet weights. If you don’t want to use pre-trained weights, follow the method below of creating the architecture and passing it to the initialization of CNN.

See what architectures are available by default in OpenSoundscape:


For convenience, we can initialize a model object by providing the name of an architecture as a string, rather than the architecture object.

Create a model with a resnet34 architecture:

model = cnn.CNN(
    architecture = 'resnet34',
    classes = classes,
    sample_duration = 2.0)

For more control over model architectures, you will initialize the architecture using the corresponding OpenSoundscape object instead:

arch = cnn_architectures.resnet50(num_classes=len(classes))

model = cnn.CNN(arch, classes, sample_duration=2.0)

Use random weights

By default, OpenSoundscape’s models download weights pre-trained on ImageNet.

You can instead start from scratch with random weights using the parameter weights=None when creating an architecture. For instance, let’s create an Alexnet architecture with random weights:

my_arch = cnn_architectures.alexnet(num_classes=len(classes),weights=None)
model = cnn.CNN(my_arch, classes, 2.0)

Other custom architectures

We can create any Pytorch model architecture and pass it to the architecture argument when creating a model in OpenSoundscape. You can do this by * subclassing an existing Pytorch model * writing one from scratch. The minimum requirement is that it subclasses torch.nn.Module - it should at least have .forward() and .backward() methods.

Viewing the architecture

The architecture is stored in the model object’s .network attribute. We can view the network and access its parameters by examining this attribute and its sub-parameters. For instance, we can view a ResNet’s feature layer using the .fc attribute:

model = cnn.CNN('resnet18', classes, 2.0)
Linear(in_features=512, out_features=1, bias=True)

It is also possbile to replace an architecture of a model entirely simply by setting model.architecture to a new architecture, but this is not recommended. It will completely remove anything the model has “learned,” since the learned weights are a part of the architecture.

Single-target models

One decision about your architecture is whether your classification problem is single-target (exactly one label per sample) or multi-target (any number of labels per sample, including 0). Single-target models have a softmax activation layer which forces the sum of all class scores to be 1.0.

This is a separate decision from the number of classes your model can potentially identify. For example, if you are creating a model to identify only one species, your model should contain only one class, but it should still be a multi-target model. This allows your model to predict that the species isn’t present (i.e. the class score can be 0).

In most cases in bioacoustic monitoring, models are multi-target. But if you would like to train a single-target model, just set single_target=True either when creating the model object or afterwards.

# Change the model to be single_target
model.single_target = True

# Or specify single_target when you create the object
model = cnn.CNN("resnet18", classes, 2.0, single_target=True)

Multi-target training with ResampleLoss

Training multi-target models is challenging and can benefit from using a modified loss function. OpenSoundscape provides a loss function designed for training multi-target models. We recommend using this loss function when training multi-target models. You can add it to a class with an in-place helper function:

[ ]:
from import use_resample_loss
[ ]:
model = cnn.CNN('resnet18',classes,2.0)
use_resample_loss(model, train_df=train_df)

Spectrogram settings

The parameters used to create spectrograms are very important for classifier performance. The main way you modify these parameters are by setting a custom preprocessor.

OpenSoundscape also provides an additional option that can affect performance and training speed, the ability to change the size of the input spectrogram.

Custom preprocessing

The preprocessing tutorial gives in-depth descriptions of how to customize your preprocessing pipeline, as well as best practices for using these customizations, e.g. reviewing what the samples look like before training on them.

Here, we’ll just give a quick example of tweaking the preprocessing pipeline: providing the CNN with a bandpassed spectrogram object instead of the full frequency range.

model = cnn.CNN('resnet18', classes, 2.0)

# change the min and max frequencies for the spectrogram bandpass action
model.preprocessor.pipeline.bandpass.set(min_f=3000, max_f=5000)

Size of spectrogram

OpenSoundscape enables you to modify the size of the spectrogram input to the classifier.

Larger spectrograms have greater resolution which can help the classifier pick up on finer details. However, potential accuracy improvements come at the cost of more resource-intensive training and prediction.

To change the image size, when creating the CNN set sample_shape = (height, width, channels). Most classifier architectures expect 3 channels.

model = cnn.CNN('resnet18', classes, 2.0, sample_shape=(448, 448, 3))

Learning parameters

In a general sense, a model’s learning rate determines how fast the model fits to the data. More specifically, it determines how much the model’s weights change every time it calculates the loss function.

Faster learning rates improve the speed of training and help the model leave local minima as it learns to classify, but if the learning rate is too fast, the model may not successfully fit the data or its fitting might be unstable.

OpenSoundscape allows you to flexibly change parameters related to the model’s optimizer. This includes parameters related to the learning rate, as well as the emphasis the model’s training places on learning smaller, less complex weights, known as regularization.

First, let’s look at the model optimization (AKA “learning”) hyperparameters:

{'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.0005}

Options for modifying the learning hyperparameters include:

  • Modify learning rate

  • Fine tune a model

  • Separate learning rates for feature and classifier blocks

  • Modify the learning rate schedule

  • Set the regularization weight decay

Modify learning rate

A basic way to modify the learning rate on an entire model is to change the lr parameter:


Fine tune a model

One instance where we might want to modify a learning rate is to “fine tune” a model.

After training a model for a while at a relatively high learning rate (think 0.01), we might want to “fine tune” the model, or set a lower learning rate, then train the model at the lower rate for a few epochs.

Let’s set a low learning rate for fine tuning:


Separate learning rates for feature and classifier blocks

Convolutional Neural Networks can be thought of as having two parts: a feature extractor which learns how to represent/”see” the input data, and a classifier which takes those representations and transforms them into predictions about the class identity of each sample.

For ResNet architectures, we can modify the learning rates for the feature extraction and classification blocks of the network separately. For example, we can specify a relatively fast learning rate for classifier and slower one for features, if we think the features from a pre-trained model are close to optimal but we have a different set of classes than the pre-trained model.

We first use a helper function to separate the feature and classifier parameters, then specify parameters for each:

from import separate_resnet_feat_clf
r18_model = cnn.CNN('resnet18',classes,2.0)

separate_resnet_feat_clf(r18_model) #in place operation!

#now we can specify separate parameters for the 'feature' and 'classifier' portions of the network
r18_model.optimizer_params['feature']['lr'] = 0.001
r18_model.optimizer_params['classifier']['lr'] = 0.01

{'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.0005}
{'feature': {'lr': 0.001, 'momentum': 0.9, 'weight_decay': 0.0005},
 'classifier': {'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.0005}}

Learning rate schedule

It’s often helpful to decrease the learning rate over the course of training. By reducing the amount that the model’s weights are updated as time goes on, this causes the learning to gradually switch from coarsely searching across possible weights to fine-tuning the weights.

By default, the learning rates are multiplied by 0.7 (the learning rate “cooling factor”) once every 10 epochs (the learning rate “update interval”).

Let’s modify that for a very fast training schedule, where we want to multiply the learning rates by 0.1 every epoch.

model.lr_cooling_factor = 0.1
model.lr_update_interval = 1

Set the regularization weight decay

Pytorch optimizers perform L2 regularization, giving the optimizer an incentive for the model to have small weights rather than large weights. The goal of this regularization is to reduce overfitting to the training data by reducing the complexity of the model.

Depending on how much emphasis you want to place on the L2 regularization, you can change the weight decay parameter. By default, it is 0.0005. The higher the value for the “weight decay” parameter, the more the model training algorithm prioritizes smaller weights.



In this tutorial we’ve covered the more advanced options available to customize your CNN training.

While intuition can be a helpful guide, it’s not always intuitive which parameters will result in the best model. This is why it’s helpful to experiment with different parameters to see what works for you.

To facilitate experimentation, OpenSoundscape includes integration with Weights & Biases. See the original “Train a CNN” tutorial for more information on how to set this up.

Clean up: Run the following code to remove the downloaded files.

import shutil

for p in Path('.').glob('*.model'):