Advanced CNN training

This notebook demonstrates how to use classes from opensoundscape.torch.models.cnn and architectures created using opensoundscape.torch.architectures.cnn_architectures to

  • choose between single-target and multi-target model behavior
  • modify learning rates, learning rate decay schedule, and regularization
  • choose from various CNN architectures
  • train a multi-target model with a special loss function
  • use strategic sampling for imbalanced training data
  • customize preprocessing: train on spectrograms with a bandpassed frequency range

Rather than demonstrating their effects on training (model training is slow!), most examples in this notebook either don’t train the model or “train” it for 0 epochs for the purpose of demonstration.

For introductory demos (basic training, prediction, saving/loading models), see the “Beginner-friendly training and prediction with CNNs” tutorial (cnn.ipynb).

from opensoundscape.preprocess import preprocessors
from opensoundscape.torch.models import cnn
from opensoundscape.torch.architectures import cnn_architectures

import torch
import pandas as pd
from pathlib import Path
import numpy as np
import random
import subprocess

from matplotlib import pyplot as plt
plt.rcParams['figure.figsize']=[15,5] #for big visuals
%config InlineBackend.figure_format = 'retina'

Prepare audio data

Download labeled audio files

The Kitzes Lab has created a small labeled dataset of short clips of American Woodcock vocalizations. You have two options for obtaining the folder of data, called woodcock_labeled_data:

  1. Run the following cell to download this small dataset. These commands require you to have tar installed on your computer, as they will download and unzip a compressed file in .tar.gz format.
  2. Download a .zip version of the files by clicking here. You will have to unzip this folder and place the unzipped folder in the same folder that this notebook is in.

If you already have these files, you can skip or comment out this cell

[2]:['curl','','-L', '-o','woodcock_labeled_data.tar.gz']) # Download the data["tar","-xzf", "woodcock_labeled_data.tar.gz"]) # Unzip the downloaded tar.gz file["rm", "woodcock_labeled_data.tar.gz"]) # Remove the file after its contents are unzipped
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100     7    0     7    0     0      6      0 --:--:--  0:00:01 --:--:--     0
100 4031k  100 4031k    0     0  1405k      0  0:00:02  0:00:02 --:--:-- 3158k
CompletedProcess(args=['rm', 'woodcock_labeled_data.tar.gz'], returncode=0)

Create one-hot encoded labels

See the “Basic training and prediction with CNNs” tutorial for more details.

The audio data includes 2s long audio clips taken from an autonomous recording unit and a CSV of labels. We manipulate the label dataframe to give “one hot” labels - that is, a column for every class, with 1 for present or 0 for absent in each sample’s row. In this case, our classes are simply ‘negative’ for files without a woodcock and ‘positive’ for files with a woodcock. Note that these classes are mutually exclusive, so we have a “single-target” problem (as opposed to a “multi-target” problem where multiple classes can simultaneously be present).

For more details on the steps below, see the “basic training and prediction with CNNs” tutorial.

#load Specky output: a table of labeled audio files
specky_table = pd.read_csv(Path("woodcock_labeled_data/woodcock_labels.csv"))
#update the paths to the audio files
specky_table.filename = ['./woodcock_labeled_data/'+f for f in specky_table.filename]

from opensoundscape.annotations import categorical_to_one_hot
one_hot_labels, classes = categorical_to_one_hot(specky_table[['woodcock']].values)
labels = pd.DataFrame(index=specky_table['filename'],data=one_hot_labels,columns=classes)
absent present
./woodcock_labeled_data/d4c40b6066b489518f8da83af1ee4984.wav 0 1
./woodcock_labeled_data/e84a4b60a4f2d049d73162ee99a7ead8.wav 1 0
./woodcock_labeled_data/79678c979ebb880d5ed6d56f26ba69ff.wav 0 1
./woodcock_labeled_data/49890077267b569e142440fa39b3041c.wav 0 1
./woodcock_labeled_data/0c453a87185d8c7ce05c5c5ac5d525dc.wav 0 1

Split into train and validation sets

Randomly split the data into training data and validation data.

from sklearn.model_selection import train_test_split
train_df, valid_df = train_test_split(labels, test_size=0.2, random_state=0)
# for multi-class need at least a few images for each batch

Create Preprocessors

Preprocessors take the audio data specified by the dataframe created above and prepare it for use by Pytorch, e.g., creating spectrograms and performing augmentation. The class CnnPreprocessor contains a set of preprocessing and augmentation parameters that we have developed as a good starting point for general bioacoustics recognition problems. You can modify the preprocessing and augmentation parameters after creating the object. For more detail, see the “Basic training and prediction with CNNs” tutorial and the “Custom preprocessors” tutorial.

train_dataset = preprocessors.CnnPreprocessor(train_df, overlay_df=train_df)

valid_dataset = preprocessors.CnnPreprocessor(valid_df, overlay_df=valid_df, return_labels=True)

Creating a model

In general, we initialize a model object by providing the architecture object (ie a pytorch model) and a list of classes.

arch = cnn_architectures.resnet50(num_classes=len(classes))
model = cnn.PytorchModel(arch,classes)
created PytorchModel model object with 2 classes

Alternatively, we can specify the name of an architecture as a string (see Cnn Architectures below for details)

model = cnn.PytorchModel('resnet18',classes)
created PytorchModel model object with 2 classes

Single-target versus multi-target

One important decision is whether your model is single-target (exactly one label per sample) or multi-target (any number of labels per sample, including 0). Single-target models have a softmax activation layer which forces the sum of all class scores to be 1.0. By default, models are created as multi-target, but you can set single_target=True either when creating the object or afterwards.

#change the model to be single_target
model.single_target = True

#or specify single_target when you create the object
model = cnn.PytorchModel(arch,classes,single_target=True)
created PytorchModel model object with 2 classes

Model training parameters

We can modify various parameters about model training, including:

  • The learning rate
  • The learning rate schedule
  • Weight decay for regularization

Let’s take a peek at the current parameters, stored in a dictionary.

{'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.0005}

Learning rates

The learning rate determines how much the model’s weights change every time it calculates the loss function.

Faster learning rates improve the speed of training and help the model leave local minima as it learns to classify, but if the learning rate is too fast, the model may not successfully fit the data or its fitting might be unstable.

Often after training a model for a while at a relatively high learning rate (think 0.01), we might want to “fine tune” the model by training for a few epochs with a lower learning rate. Let’s set a low learning rate for fine tuning:


Separate learning rates for feature and classifier blocks

In the Resnet18Multiclass and Resnet18Binary classes, we can modify the learning rates for the feature extration and classification blocks of the network separately. For example, we can specify a relatively fast learning rate for classifier and slower one for features, if we think the features from a pre-trained model are close to optimal but we have a different set of classes than the pre-trained model.

r18_model = cnn.Resnet18Binary(classes)
r18_model.optimizer_params['feature']['lr'] = 0.001
r18_model.optimizer_params['classifier']['lr'] = 0.01
created PytorchModel model object with 2 classes
{'feature': {'lr': 0.001, 'momentum': 0.9, 'weight_decay': 0.0005}, 'classifier': {'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.0005}}

Learning rate schedule

It’s often helpful to decrease the learning rate over the course of training. By reducing the amount that the model’s weights are updated as time goes on, this causes the learning to gradually switch from coarsely searching across possible weights to fine-tuning the weights.

By default, the learning rates are multiplied by 0.7 (the learning rate “cooling factor”) once every 10 epochs (the learning rate “update interval”).

Let’s modify that for a very fast training schedule, where we want to multiply the learning rates by 0.1 every epoch.

model.lr_cooling_factor = 0.1
model.lr_update_interval = 1

Regularization weight decay

Pytorch optimizers perform L2 regularization, giving the optimizer an incentive for the model to have small weights rather than large weights. The goal of this regularization is to reduce overfitting to the training data by reducing the complexity of the model.

Depending on how much emphasis you want to place on the L2 regularization, you can change the weight decay parameter. By default, it is 0.0005. The higher the value for the “weight decay” parameter, the more the model training algorithm prioritizes smaller weights.


Selecting CNN architectures

The `opensoundscape.torch.architectures.cnn_architectures <>`__ module provides functions to create several common CNN architectures. These architectures are built in to pytorch, but the OpenSoundscape module helps us out by reshaping the final layer to match the number of classes we have.

You could also create a custom architecture by subclassing an existing pytorch model or writing one from scratch (the minimum requirement is that it subclasses torch.nn.Module - it should at least have .forward() and .backward() methods.

In general, we can create any pytorch model architecture and pass it to the architecture argument when creating a model in opensoundscape. We can choose whether to use pre-trained (ImageNet) weights or start from scratch (use_pretrained=False for random weights). For instance, lets create an alexnet architecture with random weights:

my_arch = cnn_architectures.alexnet(num_classes=len(classes),use_pretrained=False)

For convenience, we can also initialize a model object by providing the name of an architecture as a string, rather than the architecture object. For a list of valid architecture names, use cnn_architectures.list_architectures(). Note that these will use default architecture parameters, including using pre-trained ImageNet weights.

['resnet18', 'resnet34', 'resnet50', 'resnet101', 'resnet152', 'alexnet', 'vgg11_bn', 'squeezenet1_0', 'densenet121', 'inception_v3']
model = cnn.PytorchModel(architecture='resnet18',classes=classes)
created PytorchModel model object with 2 classes

Pretrained weights

In OpenSoundscape, by default, model architectures are initialized with weights pretrained on the ImageNet image database. It takes some time for pytorch to download these weights from an online repository the first time an instance of a particular architecture is created with pretrained weights - pytorch will do this automatically and only once.

Using pretrained weights often speeds up training significantly, as the representation learned from ImageNet is a good start at beginning to interpret spectrograms, even though they are not true “pictures.”

If you prefer not to use pre-trained weights, or if you don’t have an internet connection, you can specify use_pretrained argument to False, when creating an architecture:

arch = cnn_architectures.alexnet(num_classes=10,use_pretrained=False)

Freezing the feature extractor

Convolutional Neural Networks can be thought of as having two parts: a feature extractor which learns how to represent/”see” the input data, and a classifier which takes those representations and transforms them into predictions about the class identity of each sample.

You can freeze the feature extractor if you only want to train the final classification layer of the network but not modify any other weights. This could be useful for applying pre-trained classifiers to new data, i.e. “transfer learning”. To do so, set the freeze_feature_extractor argument to True when you create an architecture.

# See "InceptionV3 architecture" section below for more information
arch = cnn_architectures.resnet50(num_classes=10, freeze_feature_extractor=True, use_pretrained=False)

InceptionV3 class

The Inception architecture requires slightly different training and preprocessing from the ResNet architectures and the other architectures implemented in OpenSoundscape (see below), because:

  1. the input image shape must be 299x299, and
  2. Inception’s forward pass gives output + auxiliary output.

The InceptionV3 class in cnn handles the necessary modifications in training and prediction for you, but you’ll need to make sure to pass images of the correct shape from your Preprocessor. Here’s an example:

from opensoundscape.torch.models.cnn import InceptionV3

#generate an Inception model
model = InceptionV3(classes=classes,use_pretrained=False)

#create a copy of the training dataset from above
inception_dataset = train_dataset.sample(frac=1)

#modify the preprocessor to give 299x299 image shape

#train and validate for 1 epoch
#note that Inception will complain if batch_size=1

preds, _, _ = model.predict(inception_dataset)
/Users/SML161/opt/miniconda3/envs/opso_py37/lib/python3.7/site-packages/torchvision/models/ FutureWarning: The default weight initialization of inception_v3 will be changed in future releases of torchvision. If you wish to keep the old behavior (which leads to long initialization times due to scipy/scipy#11299), please set init_weights=True.
  ' due to scipy/scipy#11299), please set init_weights=True.', FutureWarning)
created PytorchModel model object with 2 classes

Best Model Appears at Epoch 0 with F1 0.000.
(23, 2)

Changing the architecture of an existing model

The architecture is stored in the model object’s .newtork attribute. We can access parameters of the network or even replace it entirely:

#initialize the AlexNet architecture
new_arch = cnn_architectures.densenet121(num_classes=2, use_pretrained=False)

# replace the alexnet architecture with the densenet architecture = new_arch

Sampling for imbalanced training data

The imbalanced data sampler will help to ensure that a single batch contains only a few classes during training, and that the classes will recieve approximately equal representation within the batch. This may be useful for imbalanced training data (when some classes have far fewer training samples than others). However, in practice it may be better to upsample your training data for equal class representation.

model = cnn.PytorchModel('resnet18',classes)
model.sampler = 'imbalanced' #default is None can now train your model as normal
model.train(train_dataset, valid_dataset, epochs=0)

#once we run train(), we can see that the train_loader is using an ImbalancedDatasetSampler
created PytorchModel model object with 2 classes

Best Model Appears at Epoch 0 with F1 0.000.
<opensoundscape.torch.sampling.ImbalancedDatasetSampler at 0x7fda3cb19510>

Multi-target training with CnnResampleLoss

Training multi-target models (a.k.a. multi-label: there can be any number of positive labels on each sample) is challenging and can benefit from using a modified loss function. OpenSoundscape provides a subclass of PytorchModel called CnnResampleLoss, which implements a loss function designed for training multi-target models. We recommend using this class rather than PytorchModel when training multi-target models. The use of the class is identical:

model = cnn.CnnResampleLoss('resnet18',classes)
#use as normal...
created PytorchModel model object with 2 classes

Training and predicting with custom preprocessors

The preprocessing tutorial gives in-depth descriptions of how to customize your preprocessing pipeline.

Here, we’ll just give a quick example of tweaking the preprocessing pipeline: providing the CNN with a bandpassed spectrogram object instead of the full frequency range.

It’s good practice to create the validation from the training dataset (after any modifications are made), so that they perform the same preprocessing. You may or may not want to use augmentation on the validation dataset.

Bandpassed spectrograms

model = cnn.PytorchModel('resnet18', classes)

# turn on the bandpass action

# specify the min and max frequencies for the bandpass action
train_dataset.actions.bandpass.set(min_f=3000, max_f=5000)

# create a validation dataset that matches the modified train_dataset
valid_dataset = train_dataset.sample(n=0)
valid_dataset.df = valid_df
#valid_dataset.augmentation_off() #uncomment to turn off augmentation on validation set

# now we can train and validate on the bandpassed spectrograms
# don't forget that you'll need to apply the same bandpass actions to
# any datasets that you use for prediction as well
model.train(train_dataset, valid_dataset, epochs=0)
created PytorchModel model object with 2 classes

Best Model Appears at Epoch 0 with F1 0.000.

Matching preprocessing parameters during prediction

If we predict using this model later (even if we load it from a saved file), we can create a dataset with the correct preprocessing parameters using model.train_dataset:

model_from_saved = cnn.load_model('./saved.model')
prediction_preprocessor = model_from_saved.train_dataset.sample(n=0)
#turn off augmentation for prediction
prediction_preprocessor.df = valid_df
print('Bandpassing parameters of prediction preprocessor:')
Bandpassing parameters of prediction preprocessor:
{'min_f': 3000, 'max_f': 5000, 'out_of_bounds_ok': False}

clean up

remove files

import shutil

for p in Path('.').glob('*.model'):