Custom CNN training

This notebook demonstrates how to use opensoundscape.torch.cnn classes to

  • schedule the learning rate decay
  • choose from various architectures
  • use strategic sampling for imbalanced training data
  • train on spectrograms with a bandpassed frequency range

Rather than demonstrating their effects on training (model training is slow!), most examples in this notebook either don’t train the model or “train” it for 0 epochs for the purpose of demonstration.

For introductory demos (basic training, prediction, saving/loading models), see the “basic training and prediction with CNNs” tutorial (cnn.ipynb).

[20]:
from opensoundscape.preprocess.preprocessors import BasePreprocessor, AudioToSpectrogramPreprocessor, CnnPreprocessor
from opensoundscape.torch.models.cnn import PytorchModel, Resnet18Multiclass, Resnet18Binary, InceptionV3

import torch
import pandas as pd
from pathlib import Path
import numpy as np
import random
import subprocess

from matplotlib import pyplot as plt
plt.rcParams['figure.figsize']=[15,5] #for big visuals
%config InlineBackend.figure_format = 'retina'

Prepare audio data

Download labeled audio files

The Kitzes Lab has created a small labeled dataset of short clips of American Woodcock vocalizations. You have two options for obtaining the folder of data, called woodcock_labeled_data:

  1. Run the following cell to download this small dataset. These commands require you to have tar installed on your computer, as they will download and unzip a compressed file in .tar.gz format.
  2. Download a .zip version of the files by clicking here. You will have to unzip this folder and place the unzipped folder in the same folder that this notebook is in.

If you already have these files, you can skip or comment out this cell

[21]:
subprocess.run(['curl','https://pitt.box.com/shared/static/79fi7d715dulcldsy6uogz02rsn5uesd.gz','-L', '-o','woodcock_labeled_data.tar.gz']) # Download the data
subprocess.run(["tar","-xzf", "woodcock_labeled_data.tar.gz"]) # Unzip the downloaded tar.gz file
subprocess.run(["rm", "woodcock_labeled_data.tar.gz"]) # Remove the file after its contents are unzipped
[21]:
CompletedProcess(args=['rm', 'woodcock_labeled_data.tar.gz'], returncode=0)

Create one-hot encoded labels

See the “Basic training and prediction with CNNs” tutorial for more details.

The audio data includes 2s long audio clips taken from an autonomous recording unit and a CSV of labels. We manipulate the label dataframe to give “one hot” labels - that is, a column for every class, with 1 for present or 0 for absent in each sample’s row. In this case, our classes are simply ‘negative’ for files without a woodcock and ‘positive’ for files with a woodcock. Note that these classes are mutually exclusive, so we have a “single-target” problem (as opposed to a “multi-target” problem where multiple classes can simultaneously be present).

For more details on the steps below, see the “basic training and prediction with CNNs” tutorial.

[22]:
#load Specky output: a table of labeled audio files
specky_table = pd.read_csv(Path("woodcock_labeled_data/woodcock_labels.csv"))
#update the paths to the audio files
specky_table.filename = ['./woodcock_labeled_data/'+f for f in specky_table.filename]

from opensoundscape.annotations import categorical_to_one_hot
one_hot_labels, classes = categorical_to_one_hot(specky_table[['woodcock']].values)
labels = pd.DataFrame(index=specky_table['filename'],data=one_hot_labels,columns=classes)
labels.head()
[22]:
present absent
filename
./woodcock_labeled_data/d4c40b6066b489518f8da83af1ee4984.wav 1 0
./woodcock_labeled_data/e84a4b60a4f2d049d73162ee99a7ead8.wav 0 1
./woodcock_labeled_data/79678c979ebb880d5ed6d56f26ba69ff.wav 1 0
./woodcock_labeled_data/49890077267b569e142440fa39b3041c.wav 1 0
./woodcock_labeled_data/0c453a87185d8c7ce05c5c5ac5d525dc.wav 1 0

Split into train and validation sets

Randomly split the data into training data and validation data.

[23]:
from sklearn.model_selection import train_test_split
train_df, valid_df = train_test_split(labels, test_size=0.2, random_state=0)
# for multi-class need at least a few images for each batch
len(train_df)
[23]:
23

Create Preprocessors

Preprocessors take the audio data specified by the dataframe created above and prepare it for use by Pytorch, e.g., creating spectrograms and performing augmentation. For more detail, see the “Basic training and prediction with CNNs” tutorial and the “Custom preprocessors” tutorial.

[24]:
from opensoundscape.preprocess.preprocessors import CnnPreprocessor

train_dataset = CnnPreprocessor(train_df, overlay_df=train_df)

valid_dataset = CnnPreprocessor(valid_df, overlay_df=valid_df, return_labels=True)

Model training parameters

We can modify various parameters about model training, including:

  • The learning rate
  • The learning rate schedule
  • Weight decay for regularization

Let’s take a peek at the current parameters, stored in a dictionary.

[25]:
from opensoundscape.torch.models.cnn import Resnet18Binary
model = Resnet18Binary(classes)
model.optimizer_params
created PytorchModel model object with 2 classes
[25]:
{'feature': {'lr': 0.001, 'momentum': 0.9, 'weight_decay': 0.0005},
 'classifier': {'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.0005}}

Learning rates

The learning rate determines how much the model’s weights change every time it calculates the loss function.

Faster learning rates improve the speed of training and help the model leave local minima as it learns to classify, but if the learning rate is too fast, the model may not successfully fit the data or its fitting might be unstable.

In Resnet18Multiclass and Resnet18Binary, we can modify the learning rates for the feature extration and classification blocks of the network separately. For example, we can specify a relatively fast learning rate for features and slower one for classifiers (though this might not be a good idea in practice):

[26]:
model = Resnet18Binary(classes)
model.optimizer_params['feature']['lr'] = 0.01
model.optimizer_params['classifier']['lr'] = 0.001
created PytorchModel model object with 2 classes

Learning rate schedule

It’s often helpful to decrease the learning rate over the course of training. By reducing the amount that the model’s weights are updated as time goes on, this causes the learning to gradually switch from coarsely searching across possible weights to fine-tuning the weights.

By default, the learning rates are multiplied by 0.7 (the learning rate “cooling factor”) once every 10 epochs (the learning rate “update interval”).

Let’s modify that for a very fast training schedule, where we want to multiply the learning rates by 0.1 every epoch.

[27]:
model.lr_cooling_factor = 0.1
model.lr_update_interval = 1

Regularization weight decay

The Resnet18 classes perform L2 regularization, giving the optimizer an incentive for the model to have small weights rather than large weights. The goal of this regularization is to reduce overfitting to the training data by reducing the complexity of the model.

Depending on how much emphasis you want to place on the L2 regularization, you can change the weight decay parameter. By default, it is 0.0005. The higher the value for the “weight decay” parameter, the more the model training algorithm prioritizes smaller weights.

[28]:
model.optimizer_params['feature']['weight_decay']=0.001
model.optimizer_params['classifier']['weight_decay']=0.001

Pretrained weights

In OpenSoundscape, most architectures implemented have the ability to use weights pretrained on the ImageNet image database turned on by default. It takes some time to download these weights the first time an instance of a model is created with pretrained weights.

Using pretrained weights often speeds up training significantly, as the representation learned from ImageNet is a good start at beginning to interpret spectrograms, even though they are not true “pictures.”

Currently, this feature cannot be turned off in the Resnet18 classes. However, if you prefer, you can turn this off in many classes when creating a custom architecture (see “Network architectures” below) by changing the use_pretrained argument to False, e.g.:

[29]:
# See "InceptionV3 architecture" section below for more information
model = InceptionV3(classes, use_pretrained=False)
created PytorchModel model object with 2 classes

Freezing the feature extractor

Convolutional Neural Networks can be thought of as having two parts: a feature extractor which learns how to represent/”see” the input data, and a classifier which takes those representations and transforms them into predictions about the class identity of each sample.

You can freeze the feature extractor if you only want to train the final classification layer of the network but not modify any other weights. This could be useful for applying pre-trained classifiers to new data. To do so, set the freeze_feature_extractor argument to True. Below, we set the use_pretrained argument to False to avoid downloading the weights.

[30]:
# See "InceptionV3 architecture" section below for more information
model = InceptionV3(classes, freeze_feature_extractor=True, use_pretrained=False)
created PytorchModel model object with 2 classes

Network architecture

It is possible to use a different model architecture than ResNet18. The `opensoundscape.torch.models.cnn <https://github.com/kitzeslab/opensoundscape/blob/master/opensoundscape/torch/models/cnn.py>`__ module contains two types of classes for doing so: * Custom classes for both the ResNet18 architecture (Resnet18Binary and Resnet18Multiclass) and the InceptionV3 architecture (InceptionV3 and InceptionV3ResampleLoss). * The PytorchModel class, which allows us to create a model with a different CNN architecture. The available architectures are listed in `opensoundscape.torch.architectures.cnn_architectures <https://github.com/kitzeslab/opensoundscape/blob/master/opensoundscape/torch/architectures/cnn_architectures.py>`__.

Below, we demonstrate the use of InceptionV3, how to create instances of other architectures, how to change the architecture on a model.

InceptionV3 architecture

The Inception architecture requires slightly different training and preprocessing from the ResNet architectures and the other architectures implemented in OpenSoundscape (see below), because:

  1. the input image shape must be 299x299, and
  2. Inception’s forward pass gives output + auxiliary output.

The InceptionV3 class in cnn handles the necessary modifications in training and prediction for you, but you’ll need to make sure to pass images of the correct shape from your Preprocessor. Here’s an example:

[31]:
from opensoundscape.torch.models.cnn import InceptionV3

#generate an Inception model
model = InceptionV3(classes=classes,use_pretrained=False)

#create a copy of the training dataset
inception_dataset = train_dataset.sample(frac=1)

#modify the preprocessor to give 299x299 image shape
inception_dataset.actions.to_img.set(shape=[299,299])

#train and validate for 1 epoch
#note that Inception will complain if batch_size=1
model.train(inception_dataset,inception_dataset,epochs=1,batch_size=4)

#predict
preds, _, _ = model.predict(inception_dataset)
created PytorchModel model object with 2 classes
Epoch: 0 [batch 0/6 (0.00%)]
        Jacc: 0.500 Hamm: 0.500 DistLoss: 1.139

Validation.
(23, 2)
         Precision: 0.391304347826087
         Recall: 0.5
         F1: 0.4390243902439025
Saving weights, metrics, and train/valid scores.
Saving to epoch-0.model
Updating best model
Saving to best.model

Best Model Appears at Epoch 0 with F1 0.439.
(23, 2)

Pytorch stock architectures

The opensoundscape.torch.architectures.cnn_architectures module provides helper functions to generate various CNN architectures in Pytorch. These are well-known CNN architectures that Pytorch provides out of the box. The architectures provided include:

  • Other ResNet types (resnet34, resnet50, resnet101, resnet152)
  • AlexNet
  • Vgg11
  • Squeezenet
  • Densenet121

Also implemented are ResNet18 and InceptionV3, but in most cases, you should use the pre-implemented classes for those instead of loading them into a PytorchModel.

Calling a function from this module, e.g. alexnet(), will return a CNN architecture that we can use to instantiate a PytorchModel.

Below and in the following examples, we set use_pretrained=False to avoid downloading all of the weights for these models.

[32]:
from opensoundscape.torch.architectures.cnn_architectures import alexnet
from opensoundscape.torch.models.cnn import PytorchModel

#initialize the AlexNet architecture
arch = alexnet(num_classes=2, use_pretrained=False)

#generate a model object with this architecture
model = PytorchModel(architecture=arch, classes=['negative','positive'])
created PytorchModel model object with 2 classes

Changing the architecture of an existing model

Even after initializing a model with an architecture, we can change it by replacing the model’s .network:

[33]:
from opensoundscape.torch.architectures.cnn_architectures import densenet121

#initialize the AlexNet architecture
arch = densenet121(num_classes=2, use_pretrained=False)

# replace the alexnet architecture with the densenet architecture
model.network = arch

Use a custom-built architecture

You can also build a custom architecture and initialize a PytorchModel model with it, or replace a model’s .network with your custom architecture.

For example, we can use the architectures.resnet module to build the ResNet50 architecture (just for demonstration - we could also simply create this architecture using the resnet50() function in the cnn_architectures module).

[34]:
# import a module that builds ResNet architecture from scratch
from opensoundscape.torch.architectures.resnet import ResNetArchitecture

#initialize the ResNet50 architecture
net=ResNetArchitecture(
    num_cls=2,
    weights_init='ImageNet',
    num_layers=50,
)

#generate a regular resnet18 object
model = Resnet18Multiclass(classes=['negative','positive'])

#replace the model's network with the ResNet50 architecture
model.network = net

print('number of layers:')
print(model.network.num_layers)
created PytorchModel model object with 2 classes
number of layers:
50

Sampling for imbalanced training data

The imbalanced data sampler will help to ensure that a single batch contains only a few classes during training, and that the classes will recieve approximately equal representation within the batch. This is useful for imbalanced training data (when some classes have far fewer training samples than others).

[35]:
model = Resnet18Binary(classes)
model.sampler = 'imbalanced' #default is None

#...you can now train your model as normal
model.train(train_dataset, valid_dataset, epochs=0)

#once we run train(), we can see that the train_loader is using an ImbalancedDatasetSampler
print('sampler:')
model.train_loader.sampler
created PytorchModel model object with 2 classes

Best Model Appears at Epoch 0 with F1 0.000.
sampler:
[35]:
<opensoundscape.torch.sampling.ImbalancedDatasetSampler at 0x7fb80a937ac0>

Training with custom preprocessors

The preprocessing tutorial gives in-depth descriptions of how to customize your preprocessing pipeline.

Here, we’ll just give a quick example of tweaking the preprocessing pipeline: providing the CNN with a bandpassed spectrogram object instead of the full frequency range.

Bandpassed spectrograms

[36]:
model = Resnet18Binary(classes)

# turn on the bandpass action of the datasets
train_dataset.actions.bandpass.on()
valid_dataset.actions.bandpass.on()

# specify the min and max frequencies for the bandpass action
train_dataset.actions.bandpass.set(min_f=3000, max_f=5000)
valid_dataset.actions.bandpass.set(min_f=3000, max_f=5000)

# now we can train and validate on the bandpassed spectrograms
# don't forget that you'll need to apply the same bandpass actions to
# any datasets that you use for predicting on new audio files
model.train(train_dataset, valid_dataset, epochs=0)
created PytorchModel model object with 2 classes

Best Model Appears at Epoch 0 with F1 0.000.

clean up

remove files

[37]:
import shutil
shutil.rmtree('./woodcock_labeled_data')

for p in Path('.').glob('*.model'):
    p.unlink()