Transfer learning: training shallow classifiers on embedding models outputs

NOTE: The primary class for transfer learning is now SongSpace. Please see the SongSpace tutorial notebook for a demonstration of the workflow to embed samples to a database and train shallow classifiers. Direct use of the APIs demonstrated in this notebook is for advanced useres.

If you want to adapt BirdNET, Perch, HawkEars, or another foundation model to a new set of classes or a new domain, you’ll be doing what machine learning experts call transfer learning. This tutorial demonstrates tranfer learning for PyToch models, but you can do the same with TensorFlow models such as BirdNET and Perch - examples are in a separate notebook training_birdnet_and_perch.ipynb since you might need to set your python environment up differently (by installing tensorflow and tensorflow-hub packages).

Training shallow classification heads on fixed embeddings

This notebook first shows examples of how to train simple one-layer or multi-layer fully-connected neural networks (aka multi-layer perceptron networks, MLPs) on embedding (aka features) generated by a pre-trained deep learning model. This workflow is called transfer learning because the learned feature extraction of the embedding model is transfered to a new domain. Ghani et al. [1] demonstrated that gobal bird classification models can act as feature extractors that can be used to train shallow classifiers on novel tasks and domains, even when few training samples are available.

Training a shallow classifier on embeddings, rather than training or fine-tuning an entire deep learning model, has three advantages: (1) classifiers can be developed with just a handful of training examples; (2) models fit very quickly, enabling an iterative human-in-the-loop workflow for active learning; (3) any model that generates embeddings can be used as the feature extractor; in particular, compiled models without open-source weights (e.g. BirdNET [2]) can be used as the embedding model.

Users can develop flexible and customizable transfer-learning workflow by generating embeddings then using PyTorch or sklearn directly. This notebook demonstrates both (1) high-level functions and classes in OpenSoundscape that simplify the code needed to perform transfer learning; and (2) examples demonstrating the embedding and model fitting steps explicitly line-by-line.

Fine tuning pre-trained models

When you have large quantities of training data (eg thousands of samples), you might get the best performance by training the entire deep learning model, starting from a pre-trained model. Rather than using embeddings from a fixed foundation model, you train the feature extractor as well as the classification head. We demonstrate how to fine-tune in the second half of this tutorial. Fine tuning is much slower and more susceptible to over-fitting than training shallow classifiers, and may require tuning “hyper-parameters”. In general, it is similar to full model training as described in the train_cnn.ipynb tutorial notebook.

Note: to use models from the model zoo, install the bioacoustics_model_zoo as a package in your python environment:

pip install bioacoustics-model-zoo

[1] Ghani, B., T. Denton, S. Kahl, H. Klinck, T. Denton, S. Kahl, and H. Klinck. 2023. Global birdsong embeddings enable superior transfer learning for bioacoustic classification. Scientific Reports 13:22876.

[2] Kahl, Stefan, et al. “BirdNET: A deep learning solution for avian diversity monitoring.” Ecological Informatics 61 (2021): 101236.

Run this tutorial

This tutorial is more than a reference! It’s a Jupyter Notebook which you can run and modify on Google Colab or your own computer.

Link to tutorial	How to run tutorial
	The link opens the tutorial in Google Colab. Uncomment the “installation” line in the first cell to install OpenSoundscape.
	The link downloads the tutorial file to your computer. Follow the Jupyter installation instructions, then open the tutorial file in Jupyter.

[1]:

# if this is a Google Colab notebook, install opensoundscape in the runtime environment
if 'google.colab' in str(get_ipython()):
  %pip install "opensoundscape==0.13.0" "jupyter-client<8,>=5.3.4" "ipykernel==6.17.1" "bioacoustics-model-zoo==0.12.0"
  num_workers=0
else:
  # choose cpu parallelization count
  num_workers=4

Setup

Import needed packages

[2]:

#other utilities and packages
import torch
import pandas as pd
from pathlib import Path
import numpy as np
import pandas as pd
import random
from glob import glob
import sklearn

from tqdm.autonotebook import tqdm
from sklearn.metrics import average_precision_score, roc_auc_score
from pathlib import Path

#set up plotting
from matplotlib import pyplot as plt
plt.rcParams['figure.figsize']=[15,5] #for large visuals
%config InlineBackend.figure_format = 'retina'

# opensoundscape transfer learning tools
from opensoundscape.ml.shallow_classifier import MLPClassifier, fit, fit_classifier_on_embeddings
import opensoundscape as opso
import bioacoustics_model_zoo as bmz

/var/folders/d8/265wdp1n0bn_r85dh3pp95fh0000gq/T/ipykernel_58415/968576176.py:11: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
  from tqdm.autonotebook import tqdm

Set random seeds

Set manual seeds for Pytorch and Python. These essentially “fix” the results of any stochastic steps in model training, ensuring that training results are reproducible. You probably don’t want to do this when you actually train your model, but it’s useful for debugging.

[3]:

opso.set_seed(0)

Download and prepare training data

Download example files

Download a set of aquatic soundscape recordings with annotations of Rana sierrae vocalizations

Option 1: run the cell below

if you get a 403 error, DataDryad suspects you are a bot. Use Option 2.

Option 2:

Download and unzip the rana_sierrae_2022.zip folder containing audio and annotations from this public Dryad dataset
Move the unzipped rana_sierrae_2022 folder into the current folder

[4]:

# # Note: the "!" preceding each line below allows us to run bash commands in a Jupyter notebook
# # If you are not running this code in a notebook, input these commands into your terminal instead
# !wget -O rana_sierrae_2022.zip https://datadryad.org/stash/downloads/file_stream/2722802;
# !unzip rana_sierrae_2022;

Prepare audio data

See the train_cnn.ipynb tutorial for step-by-step walkthrough of this process, or just run the cells below to prepare a training set.

[5]:

# Set this variable to specify where the folder `rana_sierrae_2022` is located:
dataset_path = Path("./rana_sierrae_2022/")

# let's generate clip labels of 5s duration (to match HawkEars) using the raven annotations
# and some utility functions from opensoundscape
from opensoundscape.annotations import BoxedAnnotations

audio_and_raven_files = pd.read_csv(f"{dataset_path}/audio_and_raven_files.csv")
# update the paths to where we have the audio and raven files stored
audio_and_raven_files["audio"] = audio_and_raven_files["audio"].apply(
    lambda x: f"{dataset_path}/{x}"
)
audio_and_raven_files["raven"] = audio_and_raven_files["raven"].apply(
    lambda x: f"{dataset_path}/{x}"
)

annotations = BoxedAnnotations.from_raven_files(
    raven_files=audio_and_raven_files["raven"],
    audio_files=audio_and_raven_files["audio"],
    annotation_column="annotation",
)
# generate labels for 5s clips, including any labels that overlap by at least 0.2 seconds
labels = annotations.clip_labels(clip_duration=5, min_label_overlap=0.2)

/Users/SML161/opensoundscape/opensoundscape/annotations.py:347: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  all_annotations_df = pd.concat(all_file_dfs).reset_index(drop=True)

Inspect labels

Count number of each annotation type:

Note that the ‘X’ label is for when the annotator was uncertain about the identity of a call. Labels A-E denote distinct call types.

[6]:

labels.sum()

[6]:

A    512
E    128
D     62
B     24
C     74
X    118
dtype: int64

split into training and validation data

We’ll just focus on class ‘A’, the call type with the most annotations. We’ll randomly split the clips into training and validation data, acknowledging that this approach does not test the ability of the model to generalize. Since the samples in the training and validation sets could be adjascent 2-second audio clips, good performance could simply mean the model has memorized the training samples, and the validation set has very similar samples.

[7]:

labels_train, labels_val = sklearn.model_selection.train_test_split(labels[["A"]])

Train shallow classifiers on embedding model outputs

We’ll train our classifiers on a small annotated dataset with HawkEars Embedding Model, Perch, and BirdNET as feature extractors.

Note: HawkEars Embedding model provides a single EfficientNet architecture, whereas the regular “HawkEars” class is an ensemble of 5 CNN architectures. For transfer learning tasks, the HawkEars_Embedding class is recommended.

[8]:

hawk = bmz.HawkEars_Embedding()

Create a shallow classifier that we’ll train with embeddings as inputs. The input size needs to match the size of the embeddings produced by our embedding model. HawkEars embeddings are vectors of length 2048.

[9]:

embedding_size = hawk.classifier.in_features

class_list = labels_train.columns.tolist()
clf = MLPClassifier(
    input_size=embedding_size,
    output_size=labels_train.shape[1],
    hidden_layer_sizes=(),
    classes=class_list,
)

We can run a single function that will embed the training and validation samples, then train the classifier.

This will take a minute or two, since all of the samples need to be embedded with HawkEars. On a GPU it takes about 30 seconds to embed the samples.

[10]:

emb_train, label_train, emb_val, label_val, metrics = fit_classifier_on_embeddings(
    embedding_model=hawk,
    classifier_model=clf,
    train_df=labels_train,
    validation_df=labels_val,
    steps=1000,
    embedding_batch_size=128,
    embedding_num_workers=num_workers,
)

Embedding the training samples without augmentation

Embedding the validation samples

Fitting the classifier
Step 100/1000, Loss: 0.414, Val Loss: 0.417, val AU ROC: 0.883, val MAP: 0.739
Step 200/1000, Loss: 0.388, Val Loss: 0.374, val AU ROC: 0.886, val MAP: 0.746
Step 300/1000, Loss: 0.316, Val Loss: 0.360, val AU ROC: 0.888, val MAP: 0.750
Step 400/1000, Loss: 0.252, Val Loss: 0.354, val AU ROC: 0.889, val MAP: 0.753
Step 500/1000, Loss: 0.277, Val Loss: 0.351, val AU ROC: 0.891, val MAP: 0.755
Step 600/1000, Loss: 0.267, Val Loss: 0.349, val AU ROC: 0.891, val MAP: 0.756
Step 700/1000, Loss: 0.255, Val Loss: 0.348, val AU ROC: 0.891, val MAP: 0.758
Step 800/1000, Loss: 0.287, Val Loss: 0.348, val AU ROC: 0.892, val MAP: 0.761
Step 900/1000, Loss: 0.253, Val Loss: 0.347, val AU ROC: 0.892, val MAP: 0.762
Step 1000/1000, Loss: 0.302, Val Loss: 0.347, val AU ROC: 0.892, val MAP: 0.763
Loaded best model with validation loss: 0.347 at step 993 of 1000
Training complete

[11]:

# using pre-computed embeddings, we can quickly fit a new shallow classifier without needing to re-embed the audio each time
# lets fit a classifier with an 2-layer squeezed architecture (only 10 hidden units)
clf2 = MLPClassifier(
    input_size=embedding_size,
    output_size=labels_train.shape[1],
    hidden_layer_sizes=(10,),
    classes=class_list,
)
clf2.fit(emb_train, label_train, emb_val, label_val, steps=50, logging_interval=25)

Step 25/50, Loss: 0.520, Val Loss: 0.543, val AU ROC: 0.876, val MAP: 0.722
Step 50/50, Loss: 0.411, Val Loss: 0.432, val AU ROC: 0.883, val MAP: 0.739
Loaded best model with validation loss: 0.432 at step 50 of 50
Training complete

[11]:

{'loss': 0.43158774822950363,
 'auroc': 0.8831756478815302,
 'map': 0.7391909378451691,
 'per_class_auroc': [0.8831756478815302]}

let’s evaluate our first classifier on the test set:

[12]:

# make sure embeddings are on same device as classifier
emb_val = emb_val.to(next(clf.parameters()).device)

# run the classifier on validation set sample embeddings to get class prediction logit scores
preds = clf(emb_val).detach().numpy()

# evaluate with threshold agnostic metrics: Average Precision and ROC AUC
print(
    f"average precision score: {average_precision_score(label_val,preds,average=None)}"
)
print(f"area under ROC: {roc_auc_score(label_val,preds,average=None)}")

average precision score: 0.7626967655893723
area under ROC: 0.8918963389551625

to visualize the performance, let’s plot histograms of classifier logit scores for positive and negative samples

it shows that precision is good for scores above >0 (few negatives get high scores), but recall is only moderate (many positive samples get low scores)

[13]:

plt.rcParams["figure.figsize"] = [5, 2]
plt.hist(preds[label_val == 1], bins=20, alpha=0.5, label="positives")
plt.hist(preds[label_val == 0], bins=20, alpha=0.5, label="negatives")
_ = plt.legend()
_ = plt.xlabel("logit score")

../_images/tutorials_transfer_learning_27_0.png

Save the shallow classifier for future use

[ ]:

save_dir = Path("./temp")
save_dir.mkdir(exist_ok=True)
clf.save(save_dir / "custom_hawkears_classifier.pt")

We can then load and apply the model to new data (perhaps in a different notebook/script)

[17]:

# create the classifier object from saved file
loaded_clf = MLPClassifier.load(save_dir / "custom_hawkears_classifier.pt")

# we can directly apply it to pre-computed embeddings
preds = loaded_clf(emb_val)

# or we can swap it into a full CNN to apply to audio samples end-to-end, without pre-computing embeddings first
hawk_custom = bmz.HawkEars_Embedding()
hawk_custom.classifier = loaded_clf
hawk_custom.predict(opso.birds_path)

[17]:

			A
file	start_time	end_time
/Users/SML161/opensoundscape/opensoundscape/sample_data/birds_10s.wav	0.0	3.0	-1.407029
	3.0	6.0	-0.982606
	6.0	9.0	-1.813512
	9.0	12.0	-1.774074

Alternatively, we can embed the training and validation sets first, then train as many different shallowclassifier variants as we want.

(note that the fit_classifier_on_embeddings returns the embeddings on the training and validation set, so if you’ve already run that function you don’t need to re-generate the embeddings)

Generally, embedding may take a while for large datasets, but training the shallow classifier will be very fast because the network is small and there is no preprocessing or data loading.

For example, here we compare fitting classifiers with one or two hidden layers on the same data:

[18]:

# uncomment the lines below to generate training and validation set embeddings, if you don't have them from the previous cells
# emb_train = hawk.embed(labels_train, return_dfs=False, batch_size=128, num_workers=num_workers)
# emb_val = hawk.embed(labels_val, return_dfs=False, batch_size=128, num_workers=num_workers)

[19]:

%%capture
# define classifier with one hidden layer, and fit
num_classes = 1
trained_classifiers = []
hidden_layer_sizes = [(), (100,), (100, 100)] # for no hidden layers, one hidden layer, two hidden layers
for hidden_layers in hidden_layer_sizes:
    clf = MLPClassifier(embedding_size, num_classes, hidden_layer_sizes=hidden_layers)
    clf.fit(
        emb_train, labels_train.values, emb_val, labels_val.values, steps=1000
    )
    trained_classifiers.append(clf)

Let’s compare the performance of shallow classifiers with one layer (no hidden layers), and with one or two hidden layers

[20]:

# evaluate
for clf, hidden_layers in zip(trained_classifiers, hidden_layer_sizes):
    train_preds = clf(emb_train)
    preds = clf(emb_val)
    print(f"{len(hidden_layers)} Hidden Layers")
    print(
        f"\tTrain AU-ROC: {roc_auc_score(labels_train.values,train_preds.detach().numpy(),average=None):0.3f}"
    )
    print(
        f"\tVal AU-ROC: {roc_auc_score(labels_val.values,preds.detach().numpy(),average=None):0.3f}"
    )
    print(
        f"\tNumber of parameters: {sum(p.numel() for p in clf.parameters() if p.requires_grad):,}"
    )

0 Hidden Layers
        Train AU-ROC: 0.940
        Val AU-ROC: 0.892
        Number of parameters: 769
1 Hidden Layers
        Train AU-ROC: 0.996
        Val AU-ROC: 0.869
        Number of parameters: 77,001
2 Hidden Layers
        Train AU-ROC: 1.000
        Val AU-ROC: 0.854
        Number of parameters: 87,101

We can see that hidden layers allow the classifier to completely over-fit to the training data. Training accuracy is 100%, but the performance decreases on the validation set compared to fitting a model with no hidden layers. Compare the number of parameters each model is fitting: the hidden layers provide too much flexibility for overfitting given the size of our training set.

regularization

Let’s see if we can use strong regularization to prevent over-fitting

[21]:

%%capture
# define classifier with one hidden layer, and fit
num_classes = 1
trained_classifiers = []
hidden_layers = (100,)
weight_decay_regularization = [0,1e-5,1e-3,0.1]
for weight_decay in weight_decay_regularization:
    clf = MLPClassifier(embedding_size, num_classes, hidden_layer_sizes=hidden_layers)
    optimizer = torch.optim.Adam(clf.parameters(), weight_decay=weight_decay)
    clf.fit(
        emb_train, labels_train.values, emb_val, labels_val.values, steps=1000, optimizer=optimizer
    )
    trained_classifiers.append(clf)

Let’s compare the performance of shallow classifiers with one hidden layer and various levels of regularization.

[22]:

# evaluate
auroc_train = []
auroc_val = []
for clf, weight_decay in zip(trained_classifiers, weight_decay_regularization):
    train_preds = clf(emb_train)
    preds = clf(emb_val)
    auroc_train.append(
        roc_auc_score(labels_train.values, train_preds.detach().numpy(), average=None)
    )
    auroc_val.append(
        roc_auc_score(labels_val.values, preds.detach().numpy(), average=None)
    )
plt.plot(weight_decay_regularization, auroc_train, label="train", marker="o")
plt.plot(weight_decay_regularization, auroc_val, label="val", marker="o")
plt.xscale("log")
plt.xlabel("weight decay regularization")
_ = plt.ylabel("AU-ROC")

../_images/tutorials_transfer_learning_41_0.png

We can see that strong regularization prevents overfitting - train and val accuracies are similar with weight_deccay=0.1. However, in this case, the performance is still slightly worse than simply using a 1-layer model.

train on variants of the embeddings generated with audio-space augmentations

Augmentation is a powerful technique for preventing machine learning models from overfitting by increasing the effective size and diversity of the training set.

First, let’s visualize what augmentation is doing by applying augmentation 3 times to each of 3 samples with a Rana sierrae vocalization

[23]:

# visualize samples with embedding, embedding each sample 3 times with random augmentations
from opensoundscape.preprocess.utils import show_tensor_grid

plt.rcParams["figure.figsize"] = [8, 8]
positives = labels_train[labels_train["A"] == 1]
samples = [
    hawk.generate_samples(
        positives.sample(3, random_state=3), bypass_augmentations=False
    )
    for i in range(3)
]
samples = [s for sublist in samples for s in sublist]  # flatten list of lists
# note that HawkEars produces samples 'upside down' compared to opensoundscape convention
show_tensor_grid([torch.flip(s.data, dims=[1]) for s in samples], columns=3)
pass

../_images/tutorials_transfer_learning_45_0.png

now, let’s embed the samples

The fit_classifier_on_embeddings function supports generating variants of training samples with augmentation via the parameter n_augmentation_variants. The default 0 does not perform augentation. Specifying a positive integer tells the function to generate each sample n times using stochastic augmentation. The specific augmentations performed are defined by the embedding model’s .preprocessor.

We can also generate the augmented samples directly using opensoundscape.ml.shallow_classifier.augmented_embed, similarly to how we generated embeddings above then trained various models on them. Note that preprocessing and sample loading is repeated for each iteration of augmented data creation, so augmented_embed will take n_augmentation_variants times longer than embedding without augmentation. The benefit is that augmenting the audio samples before embedding tends to improve model performance more than simply augmenting the embeddings themselves (e.g. by adding random noise).

Note that the time to embed the samples will n_augmentation_variants times the time required to embed without augmentation. In this case, embedding the 1000 samples 4 times on a GPU takes about 2 minutes.

[24]:

from opensoundscape.ml.shallow_classifier import augmented_embed

train_emb_aug, train_label_aug = augmented_embed(
    hawk,
    labels_train,
    batch_size=128,
    num_workers=num_workers,
    n_augmentation_variants=4,
)

/Users/SML161/opensoundscape/opensoundscape/ml/cnn.py:2955: UserWarning: The columns of input samples df differ from `model.classes`. Discarding sample df columns.
  warnings.warn(
/Users/SML161/opensoundscape/opensoundscape/ml/cnn.py:2955: UserWarning: The columns of input samples df differ from `model.classes`. Discarding sample df columns.
  warnings.warn(
/Users/SML161/opensoundscape/opensoundscape/ml/cnn.py:2955: UserWarning: The columns of input samples df differ from `model.classes`. Discarding sample df columns.
  warnings.warn(
/Users/SML161/opensoundscape/opensoundscape/ml/cnn.py:2955: UserWarning: The columns of input samples df differ from `model.classes`. Discarding sample df columns.
  warnings.warn(

we embed the validation set as normal, without any augmentation

[ ]:

# uncomment and run if you don't already have emb_val from previous steps
# emb_val = hawk.embed(labels_val, return_dfs=False, batch_size=128, num_workers=0)

fitting the classifier on the augmented variants’ embeddings looks the same as before:

[25]:

%%capture
classifier_model = MLPClassifier(embedding_size, 1, hidden_layer_sizes=())
fit(
    classifier_model,
    train_emb_aug,
    train_label_aug,
    emb_val,
    labels_val.values,
    steps=1000,
)

Evaluate on the validation set

[26]:

preds = classifier_model(torch.tensor(emb_val).to(torch.device("cpu")))
roc_auc_score(labels_val.values, preds.detach().numpy(), average=None)

/var/folders/d8/265wdp1n0bn_r85dh3pp95fh0000gq/T/ipykernel_58415/1566684614.py:1: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.detach().clone() or sourceTensor.detach().clone().requires_grad_(True), rather than torch.tensor(sourceTensor).
  preds = classifier_model(torch.tensor(emb_val).to(torch.device("cpu")))

[26]:

0.8860345536816125

Fit SKLearn Classifiers on embeddings

scikit-learn provides various classification algorithms as alternatives to the MLPClassifier implemented in OpenSoundscape via PyTorch. It’s straightforward to fit any sklearn model on embeddings:

[27]:

from sklearn.ensemble import RandomForestClassifier

# initialize a random forest class from sklearn
rf = RandomForestClassifier()

# fit the model on training set embeddings
rf.fit(emb_train, labels_train.values[:, 0])

# evaluate on the validation set
preds = rf.predict(emb_val)
roc_auc_score(labels_val.values, preds, average=None)

[27]:

0.7376388317564788

here’s another example with K nearest neighbors classification:

[28]:

from sklearn.neighbors import KNeighborsClassifier

# initialize classifier
knc = KNeighborsClassifier()

# fit on training set embeddings
knc.fit(emb_train, labels_train.values[:, 0])

# evaluate on validation set
preds = knc.predict(emb_val)
roc_auc_score(labels_val.values, preds, average=None)

[28]:

0.7130604689428219

Fit a classifier that is a layer in an exisisting OpenSoundscape model

If you have a fully connected layer at the end of an existing OpenSoundscape model, training that layer works similarly to training a separate MLPClassifier object. We can use the fit function to train the layer on pre-generated embeddings (output of previous network layer) to avoid the slow-down associated with preprocessing samples for every training step.

For example, let’s load up a CNN, embed the samples, then fit only it’s classifier

[29]:

from opensoundscape import CNN

imagenet_pretrained = CNN(
    "resnet18", classes=["A"], sample_duration=2, sample_rate=32000
)
train_emb = imagenet_pretrained.embed(
    labels_train, batch_size=64, num_workers=num_workers
)
val_emb = imagenet_pretrained.embed(labels_val, batch_size=64, num_workers=num_workers)

[30]:

val_emb.shape, labels_val.values.shape, train_emb.shape, labels_train.values.shape

[30]:

((504, 512), (504, 1), (1512, 512), (1512, 1))

[31]:

# modify the last layer of the CNN to have a single output for the class 'A'
# replace fc layer with 1-output layer and initialize with random weights
imagenet_pretrained.change_classes(["A"])

# fit the fc layer within the opso CNN by passing the layer to the `fit` function
metrics = fit(
    model=imagenet_pretrained.classifier,
    train_features=train_emb,
    train_labels=labels_train,
    validation_features=val_emb,
    validation_labels=labels_val,
    steps=400,
)
print("Validation set metrics for best model:")
metrics

Step 100/400, Loss: 0.438, Val Loss: 0.452, val AU ROC: 0.762, val MAP: 0.460
Step 200/400, Loss: 0.468, Val Loss: 0.444, val AU ROC: 0.771, val MAP: 0.459
Step 300/400, Loss: 0.430, Val Loss: 0.440, val AU ROC: 0.777, val MAP: 0.470
Step 400/400, Loss: 0.413, Val Loss: 0.436, val AU ROC: 0.782, val MAP: 0.490
Loaded best model with validation loss: 0.436 at step 400 of 400
Training complete
Validation set metrics for best model:

[31]:

{'loss': 0.4364142268896103,
 'auroc': 0.7820855614973262,
 'map': 0.49007373639027385,
 'per_class_auroc': [0.7820855614973262]}

[32]:

# evaluate:

# can use regular prediction since we modified the in-network classifier
# but this will be slower than just running the fc layer on the embeddings, since it requires
# preprocessing and running the entire CNN architecture forward pass
preds = imagenet_pretrained.predict(labels_val, batch_size=128)
roc_auc_score(labels_val.values, preds, average=None)

[32]:

0.7820855614973262

We could equivalently get the prediciton by passing the embeddings through the trained fc layer. The outputs should be equivalent to prediction starting from the audio clips:

[33]:

preds2 = (
    imagenet_pretrained.network.fc(torch.tensor(val_emb.values).to(torch.device("mps")))
    .detach()
    .cpu()
    .numpy()
)
print(
    f"max difference between running classifier vs emb->classifier: {np.max(np.abs((preds.values - preds2))):0.6f}"
)

max difference between running classifier vs emb->classifier: 0.000000

We could also replace the cnn’s model.network.fc (single fully connected layer) with an MLPClassifier object, if we want the classifier to be more than one fully-connected layer.

Fine tuning

Fine tuning means that you will train the entire deep learning model, with millions of parameters, rather than just using the pre-trained model as a feature extractor and training a classifier on those features. Fine tuning works best when you have thousands of training samples, and can require careful tuning of hyperparameters like the learning rate schedule. In general, though, it provides much more capacity for a model to learn the specific features relevant to a classification task and has the potential to outperform the shallow classification approches demonstrated above. Over-fitting will be your main enemy here, and the typical tools of augmentation and regularization will be required. ]

Another advantage of fine-tuning is that we can freely modify the preprocessing parameters. By re-training the model weights, we can adapt the model to understand our new preprocessing settings - eg changing the clip duration, spectrogram window samples, or bandpassing range.

Fine-tuning is exactly like training models from scratch - except that you start with the weights of a pre-trained model.

Note that some models provided in the Bioacoustics Model Zoo, like BirdNET and Perch, are provided as “compiled” feature extractors that cannot be re-trained or fine tuned. Typically, the pytorch models can be fully fine-tuned.

Unlike shallow classification, we have to run forward and backward passes of the entire deep learning model for each sample, repeatedly for each epoch. This makes full training comparatively slow and computation-heavy.

[34]:

model = bmz.HawkEars_Embedding()  # get pre-trained model
model.change_classes(labels_train.columns)

It is typically to set a lower learning rate for the parameters of the feature extractor than for the classification head. It may also be beneficial to first train the classification head with a frozen feature extractor, then train everything together.

We can control the classifier_lr separately from the lr for other parameters using model.optimizer_params dictionary. In the case of HawkEars, the default from the Bioacoustics Model Zoo already has set a higher default learning rate for the classifier than for the feature extractor.

[35]:

model.optimizer_params

[35]:

{'class': torch.optim.adamw.AdamW,
 'kwargs': {'lr': 0.001, 'weight_decay': 0.0005},
 'classifier_lr': 0.01}

[36]:

# we could also choose to freeze all of the weights except the classifier
# model.freeze_feature_extractor()

Now we can train the CNN just like any other OpenSoundscape CNN object (see CNN training tutorial notebook for details)

[ ]:

model.train(
    train_df=labels_train,
    validation_df=labels_val,
    steps=10,
    batch_size=64,
    num_workers=4,
    save_path="./temp",
)

[38]:

# clean up by removing temp dir
import os

for f in glob("./temp/*"):
    os.remove(f)
os.rmdir("./temp")