{
"cells": [
{
"cell_type": "markdown",
"id": "e5d6e344-3ed7-4b0a-aa36-2b83d4842bff",
"metadata": {},
"source": [
"# Train a CNN\n",
"\n",
"Convolutional neural networks (CNNs) are popular tools for creating automated machine learning classifiers on images or image-like samples. By converting audio into a two-dimensional frequency vs. time representation such as a spectrogram, we can generate image-like samples that can be used to train CNNs. \n",
"\n",
"This tutorial demonstrates the basic use of OpenSoundscape's `preprocessors` and `cnn` modules for training CNNs and making predictions using CNNs.\n",
"\n",
"Under the hood, OpenSoundscape uses Pytorch for machine learning tasks. By using the class `opensoundscape.ml.cnn.CNN`, you can train and predict with PyTorch's powerful CNN architectures in just a few lines of code. \n",
"\n",
"## Run this tutorial\n",
"\n",
"This tutorial is more than a reference! It's a Jupyter Notebook which you can run and modify on Google Colab or your own computer.\n",
"\n",
"|Link to tutorial|How to run tutorial|\n",
"| :- | :- |\n",
"| [](https://colab.research.google.com/github/kitzeslab/opensoundscape/blob/master/docs/tutorials/train_cnn.ipynb) | The link opens the tutorial in Google Colab. Uncomment the \"installation\" line in the first cell to install OpenSoundscape. |\n",
"| [](https://minhaskamal.github.io/DownGit/#/home?url=https://github.com/kitzeslab/opensoundscape/blob/master/docs/tutorials/train_cnn.ipynb) | The link downloads the tutorial file to your computer. Follow the [Jupyter installation instructions](https://opensoundscape.org/en/latest/installation/jupyter.html), then open the tutorial file in Jupyter. |"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "b52ecca1-702b-4fa3-a48b-61025f55d8fd",
"metadata": {},
"outputs": [],
"source": [
"# if this is a Google Colab notebook, install opensoundscape in the runtime environment\n",
"if 'google.colab' in str(get_ipython()):\n",
" %pip install \"opensoundscape==0.13.0\" \"jupyter-client<8,>=5.3.4\" \"ipykernel==6.17.1\"\n",
" num_workers=0\n",
"else:\n",
" num_workers=4"
]
},
{
"cell_type": "markdown",
"id": "c4d88b73-77d1-4c00-a83a-8466fd79e15e",
"metadata": {},
"source": [
"## Setup"
]
},
{
"cell_type": "markdown",
"id": "59c9eee8-c65c-4df1-95d0-15dda341ee0a",
"metadata": {},
"source": [
"### Import needed packages"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "972e3e01-c85f-415d-95cc-9b695332f738",
"metadata": {},
"outputs": [],
"source": [
"# the cnn module provides classes for training/predicting with various types of CNNs\n",
"from opensoundscape import CNN\n",
"\n",
"#other utilities and packages\n",
"import torch\n",
"import pandas as pd\n",
"from pathlib import Path\n",
"import numpy as np\n",
"import pandas as pd\n",
"import random \n",
"import subprocess\n",
"from glob import glob\n",
"import sklearn\n",
"\n",
"#set up plotting\n",
"from matplotlib import pyplot as plt\n",
"plt.rcParams['figure.figsize']=[15,5] #for large visuals\n",
"%config InlineBackend.figure_format = 'retina'"
]
},
{
"cell_type": "markdown",
"id": "22adf5d6-403d-4a06-bc85-477cdc60ec07",
"metadata": {},
"source": [
"### Set random seeds\n",
"\n",
"Set manual seeds for Pytorch and Python. These essentially \"fix\" the results of any stochastic steps in model training, ensuring that training results are reproducible. You probably don't want to do this when you actually train your model, but it's useful for debugging."
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "68e09bd5-e86d-44e0-8ffa-0f8ee699c31f",
"metadata": {},
"outputs": [],
"source": [
"torch.manual_seed(0)\n",
"random.seed(0)\n",
"np.random.seed(0)"
]
},
{
"cell_type": "markdown",
"id": "e1c60bac-280a-4d72-80b6-2659f6ecd83d",
"metadata": {},
"source": [
"### Download files\n",
"\n",
"Training a machine learning model requires some pre-labeled data. These data, in the form of audio recordings or spectrograms, are labeled with whether or not they contain the sound of the species of interest. \n",
"\n",
"These data can be obtained from online databases such as Xeno-Canto.org, or by labeling one's own ARU data using a program like Cornell's Raven sound analysis software. In this example we are using a set of annotated avian soundscape recordings that were annotated using the software Raven Pro 1.6.4 (Bioacoustics Research Program 2022):\n",
"\n",
"
An annotated set of audio recordings of Eastern North American birds containing frequency, time, and species information. Lauren M. Chronister, Tessa A. Rhinehart, Aidan Place, Justin Kitzes.\n",
"https://doi.org/10.1002/ecy.3329 \n",
"
\n",
"\n",
"These are the same data that are used by the annotation and preprocessing tutorials, so you can skip this step if you've already downloaded them there."
]
},
{
"cell_type": "markdown",
"id": "947448da",
"metadata": {},
"source": [
"### Download example files\n",
"Download a set of example audio files and Raven annotations:\n",
"\n",
"Option 1: run the cell below\n",
"\n",
"- if you get a 403 error, DataDryad suspects you are a bot. Use Option 2. \n",
"\n",
"Option 2:\n",
"\n",
"- Download and unzip both `annotation_Files.zip` and `mp3_Files.zip` from the https://datadryad.org/stash/dataset/doi:10.5061/dryad.d2547d81z \n",
"- Move the unzipped contents into a subfolder of the current folder called `./annotated_data/`"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "7d8bf5cf-6c0b-43d6-a3bc-62657597fbec",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2026-03-20 12:45:32-- https://datadryad.org/stash/downloads/file_stream/641805\n",
"Resolving datadryad.org (datadryad.org)... 54.200.119.136, 44.227.122.190, 54.213.102.118, ...\n",
"Connecting to datadryad.org (datadryad.org)|54.200.119.136|:443... connected.\n",
"HTTP request sent, awaiting response... 301 Moved Permanently\n",
"Location: https://datadryad.org/downloads/file_stream/641805 [following]\n",
"--2026-03-20 12:45:33-- https://datadryad.org/downloads/file_stream/641805\n",
"Reusing existing connection to datadryad.org:443.\n",
"HTTP request sent, awaiting response... 403 Forbidden\n",
"2026-03-20 12:45:33 ERROR 403: Forbidden.\n",
"\n",
"--2026-03-20 12:45:33-- https://datadryad.org/stash/downloads/file_stream/641807\n",
"Resolving datadryad.org (datadryad.org)... 44.227.122.190, 54.213.102.118, 54.200.191.246, ...\n",
"Connecting to datadryad.org (datadryad.org)|44.227.122.190|:443... connected.\n",
"HTTP request sent, awaiting response... 301 Moved Permanently\n",
"Location: https://datadryad.org/downloads/file_stream/641807 [following]\n",
"--2026-03-20 12:45:33-- https://datadryad.org/downloads/file_stream/641807\n",
"Reusing existing connection to datadryad.org:443.\n",
"HTTP request sent, awaiting response... 403 Forbidden\n",
"2026-03-20 12:45:33 ERROR 403: Forbidden.\n",
"\n",
"mkdir: annotated_data: File exists\n",
"Archive: annotation_Files.zip\n",
" End-of-central-directory signature not found. Either this file is not\n",
" a zipfile, or it constitutes one disk of a multi-part archive. In the\n",
" latter case the central directory and zipfile comment will be found on\n",
" the last disk(s) of this archive.\n",
"unzip: cannot find zipfile directory in one of annotation_Files.zip or\n",
" annotation_Files.zip.zip, and cannot find annotation_Files.zip.ZIP, period.\n",
"Archive: mp3_Files.zip\n",
" End-of-central-directory signature not found. Either this file is not\n",
" a zipfile, or it constitutes one disk of a multi-part archive. In the\n",
" latter case the central directory and zipfile comment will be found on\n",
" the last disk(s) of this archive.\n",
"unzip: cannot find zipfile directory in one of mp3_Files.zip or\n",
" mp3_Files.zip.zip, and cannot find mp3_Files.zip.ZIP, period.\n"
]
}
],
"source": [
"# Note: the \"!\" preceding each line below allows us to run bash commands in a Jupyter notebook\n",
"# If you are not running this code in a notebook, input these commands into your terminal instead\n",
"!wget -O annotation_Files.zip https://datadryad.org/stash/downloads/file_stream/641805;\n",
"!wget -O mp3_Files.zip https://datadryad.org/stash/downloads/file_stream/641807;\n",
"!mkdir annotated_data;\n",
"!unzip annotation_Files.zip -d ./annotated_data/annotation_Files;\n",
"!unzip mp3_Files.zip -d ./annotated_data/mp3_Files;"
]
},
{
"cell_type": "markdown",
"id": "82705d0a-f5f7-4104-8ea7-461ca7f72e4e",
"metadata": {},
"source": [
"## Prepare audio data\n",
"\n",
"To prepare audio data for machine learning, we need to convert our annotated data into clip-level labels.\n",
"\n",
"These steps are covered in depth in other tutorials, so we'll just set our clip labels up quickly for this example.\n",
"\n",
"First, get exactly matched lists of audio files and their corresponding selection files:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "61cbd28e-1e20-4709-95e7-dadf7f8b3f2c",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Set the current directory to where the dataset is downloaded\n",
"dataset_path = Path(\"./annotated_data/\")\n",
"\n",
"# Make a list of all of the selection table files\n",
"selection_files = glob(f\"{dataset_path}/annotation_Files/*/*.txt\")\n",
"\n",
"# Create a list of audio files, one corresponding to each Raven file\n",
"# (Audio files have the same names as selection files with a different extension)\n",
"audio_files = [\n",
" f.replace(\"annotation_Files\", \"mp3_Files\").replace(\n",
" \".Table.1.selections.txt\", \".mp3\"\n",
" )\n",
" for f in selection_files\n",
"]"
]
},
{
"cell_type": "markdown",
"id": "adc6709e-9508-4f08-b1ea-30d8662161b1",
"metadata": {},
"source": [
"Next, convert the selection files and audio files to a `BoxedAnnotations` object, which contains the time, frequency, and label information for all annotations for every recording in the dataset."
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "77f3f7a5-e074-4313-a1bd-6b5a4c98612e",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/SML161/opensoundscape/opensoundscape/annotations.py:347: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
" all_annotations_df = pd.concat(all_file_dfs).reset_index(drop=True)\n"
]
}
],
"source": [
"from opensoundscape.annotations import BoxedAnnotations\n",
"\n",
"# Create a dataframe of annotations\n",
"annotations = BoxedAnnotations.from_raven_files(\n",
" raven_files=selection_files, audio_files=audio_files, annotation_column=\"Species\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "0b8c74cb-3fbf-4f29-8ed5-d62f51b645a4",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"%%capture\n",
"# Parameters to use for label creation\n",
"clip_duration = 3\n",
"clip_overlap = 0\n",
"min_label_overlap = 0.25\n",
"species_of_interest = [\"NOCA\", \"EATO\", \"SCTA\", \"BAWW\", \"BCCH\", \"AMCR\", \"NOFL\"]\n",
"\n",
"# Create dataframe of one-hot labels\n",
"clip_labels = annotations.clip_labels(\n",
" clip_duration = clip_duration, \n",
" clip_overlap = clip_overlap,\n",
" min_label_overlap = min_label_overlap,\n",
" class_subset = species_of_interest # You can comment this line out if you want to include all species.\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "71d2b3ae-a37b-4e2a-a0c0-4bd41fce40ae",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from opensoundscape.visualization import inspect\n",
"\n",
"print(f\"clips with Northern Cardinal (NOCA) labels:\")\n",
"widget = inspect(\n",
" clip_labels[clip_labels[\"NOCA\"] == True], N=4, bandpass_range=(0, 8000)\n",
")\n",
"\n",
"print(f\"clips with Northern Flicker (NOFL) labels:\")\n",
"widget = inspect(\n",
" clip_labels[clip_labels[\"NOFL\"] == True], N=4, bandpass_range=(0, 8000)\n",
")\n",
"\n",
"print(f\"clips with none of the selected species ({species_of_interest}) labeled:\")\n",
"widget = inspect(clip_labels[clip_labels.sum(1) == 0], N=4, bandpass_range=(0, 8000))"
]
},
{
"cell_type": "markdown",
"id": "d7ec6fac-fb79-43dc-86c9-d66230189a94",
"metadata": {},
"source": [
"## Create train, validation, and test datasets\n",
"\n",
"To train and test a model, we use three datasets:\n",
"\n",
"* The **training dataset** is used to fit your machine learning model to the audio data. \n",
"* The **validation dataset** is a held-out dataset that is used to select hyperparameters (e.g. how many steps to train for) during training\n",
"* The **test dataset** is another held-out dataset that we use to check how the model performs on data that were not available at all during training.\n",
"\n",
"While both the training and validation datasets are used while training the model, the test dataset is never touched until the model is fully trained and completed.\n",
"\n",
"The training and validation datasets may be gathered from the same source as each other. In contrast, the test dataset is often gathered from a different source to assess whether the model's performance generalizes to a real-world problem. For example, training and validation data might be drawn from an online database like Xeno-Canto, whereas the testing data is from your own field data. \n",
"\n",
"### Create a test dataset\n",
"\n",
"We'll separate the test dataset first. For a good assessment of the model's generalization, we want the test set to be independent of the training and validation datasets. For example, we don't want to use clips from the same source recording in the training dataset and the test dataset.\n",
"\n",
"For this example, we'll use the recordings in the folders `Recording_1`, `Recording_2` and `Recording_3` as our training and validation data, and use the recordings in folder `Recording_4` as our test data. "
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "d8190cbf-d9ad-400d-ad44-789eead2a656",
"metadata": {},
"outputs": [],
"source": [
"# Select all files from Recording_4 as a test set\n",
"mask = clip_labels.reset_index()[\"file\"].apply(lambda x: \"Recording_4\" in x).values\n",
"test_set = clip_labels[mask]\n",
"\n",
"# All other files will be used as a training set\n",
"train_and_val_set = clip_labels.drop(test_set.index)\n",
"\n",
"# Save .csv tables of the training and validation sets to keep a record of them\n",
"train_and_val_set.to_csv(\"./annotated_data/train_and_val_set.csv\")\n",
"test_set.to_csv(\"./annotated_data/test_set.csv\")"
]
},
{
"cell_type": "markdown",
"id": "5b7fe29b-d6e7-4593-9b44-813c5aafb00b",
"metadata": {},
"source": [
"If you wanted, you could load the training and testing set from these saved CSV files."
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "81f53802-c25f-4cbe-ab7f-531b80f38cec",
"metadata": {},
"outputs": [],
"source": [
"train_and_val_set = pd.read_csv(\n",
" \"./annotated_data/train_and_val_set.csv\", index_col=[0, 1, 2]\n",
")\n",
"test_set = pd.read_csv(\"./annotated_data/test_set.csv\", index_col=[0, 1, 2])"
]
},
{
"cell_type": "markdown",
"id": "afb99584-33fc-4889-83b5-4c912e3c3188",
"metadata": {},
"source": [
"### Split training and validation datasets\n",
"\n",
"Now, separate the remaining non-test data into training and validation datasets.\n",
"\n",
"The idea of keeping a separate validation dataset is that, throughout training, we can 'peek' at the performance on the validation set to choose hyperparameters. (This is in contrast to the test dataset, which we will not look at until we've finished training our model.)\n",
"\n",
"One important hyperparameter is the number of **steps** to train for, in order to prevent overfitting. Each batch constitutes one training step. \n",
"\n",
"If a model's performance on a training dataset continues to improve as it trains, but its performance on the validation dataset plateaus or decreases, this could incate the model is **overfitting** on the training dataset, learning information specific to those particular samples instead of gaining the ability to generalize to new data.\n",
"\n",
"In that case, we want to revert back to the version of the model with the best validation set performance."
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "2f47db9c-bf65-46b9-b64b-040d13ea17e1",
"metadata": {},
"outputs": [],
"source": [
"# Split our training data into training and validation sets\n",
"train_df, valid_df = sklearn.model_selection.train_test_split(\n",
" train_and_val_set, test_size=0.1, random_state=0\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "74268296-4323-46c5-8a47-9f343f77844f",
"metadata": {},
"outputs": [],
"source": [
"train_df.to_csv(\"./annotated_data/train_set.csv\")\n",
"valid_df.to_csv(\"./annotated_data/valid_set.csv\")"
]
},
{
"cell_type": "markdown",
"id": "21d30e3e-eda1-4476-8ebf-db4b0844a1d0",
"metadata": {},
"source": [
"### Resample data for even class representation\n",
"\n",
"Before training, we will balance the number of samples of each class in the training set. This helps the model learn all of the classes, rather than paying too much attention to the classes with the most labeled annotations. "
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "5a75f8ae-c81b-4a1b-b62e-87fe1b64eca0",
"metadata": {},
"outputs": [],
"source": [
"from opensoundscape.data_selection import resample\n",
"\n",
"# upsample (repeat samples) so that all classes have 800 samples\n",
"balanced_train_df = resample(train_df, n_samples_per_class=800, random_state=0)"
]
},
{
"cell_type": "markdown",
"id": "a9730295-df2d-4fca-85d8-a7d756b1763f",
"metadata": {},
"source": [
"## Set up model\n",
"\n",
"Now we create a model object. We have to select several parameters when creating this object: its `architecture`, `classes`, and `sample_duration`. \n",
"\n",
"Some additional parameters can also be changed at this step, such as the preprocessor used to create spectrograms and the shape of the spectrograms. \n",
"\n",
"For more detail on this step, see the [\"Customize CNN training\"](\"tutorials/CNN.html\") tutorial.\n"
]
},
{
"cell_type": "markdown",
"id": "fe66d592-fb5b-4e9d-a832-9ae123b9a442",
"metadata": {},
"source": [
"### Create CNN object"
]
},
{
"cell_type": "markdown",
"id": "2c5061ad-3fae-4b00-967e-f1101ff5165e",
"metadata": {},
"source": [
"Now, create a CNN object with this architecture, the classes we put into the dataframe above, and the same sample duration as we selected above.\n",
"\n",
"The first time you run this script for a particular architecture, OpenSoundscape will download the desired architecture."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c61f98fb-0791-4e3d-ab51-ee36ae3e1dd5",
"metadata": {},
"outputs": [],
"source": [
"# Create a CNN object designed to recognize 3-second samples\n",
"from opensoundscape import CNN\n",
"\n",
"# Use resnet34 architecture\n",
"architecture = \"resnet34\"\n",
"\n",
"# Can use this code to get your classes, if needed\n",
"class_list = list(train_df.columns)\n",
"\n",
"model = CNN(\n",
" architecture=architecture,\n",
" classes=class_list,\n",
" sample_duration=clip_duration, # 3s, selected above\n",
" sample_rate=32000,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "f92a8de1-3d6b-4f03-bd61-dae8c17f1ddf",
"metadata": {},
"source": [
"### Check model device\n",
"\n",
"If a GPU is available on your computer, the CNN object automatically selects it for accellerating performance. You can override `.device` to use a specific device such as `cpu` or `cuda:3`"
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "9de0c6df-d999-4791-b358-312a076f6888",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"model.device is: mps\n"
]
}
],
"source": [
"print(f\"model.device is: {model.device}\")"
]
},
{
"cell_type": "markdown",
"id": "2c901111-323f-485d-bb45-f97a8abedafb",
"metadata": {},
"source": [
"### Set up WandB model logging\n",
"\n",
"While this step is optional, it is very helpful for model training. In this step, we set up model logging on a service called **Weights & Biases** (AKA WandB). \n",
"\n",
"Weights & Biases is a free website you can use to monitor model training. It is integrated with OpenSoundscape to include helpful functions such as checking on your model's training progress in real time, visualizing the spectrograms created for training your model, comparing multiple tries at training the same model, and more. For more information, check out this [blog post](https://wandb.ai/wandb_fc/repo-spotlight/reports/Community-Spotlight-OpenSoundscape--Vmlldzo0MDcwMTI4). \n",
"\n",
"The instructions below will help you set up `wandb` logging:\n",
"\n",
"* Create an account on the [Weights and Biases website](https://wandb.ai/). \n",
"* The first time you use `wandb`, you'll need to run `wandb.login()` in Python or `wandb login` on the command line, then enter the API key from your [settings](https://wandb.ai/settings) page\n",
"* In a Python script where you want to log model training, use `wandb.init()` as demonstrated below. The \"Entity\" or team option allows runs and projects to be shared across members in a group, making it easy to collaborate and see progress of other team members' runs.\n",
"\n",
"\n",
"As training progresses, performance metrics will be plotted to the wandb logging platform and visible on this run's web page. For example, this [wandb web page](https://wandb.ai/kitzeslab/opensoundscape%20training%20demo/runs/w1xyk7zr/workspace?workspace=user-samlapp) shows the content logged to wandb when this notebook was run by the Kitzes Lab. By default, OpenSoundscape + WandB integration creates several pages with information about the model:\n",
"\n",
"- Overview: hyperparameters, run description, and hardware available during the run\n",
"- Charts: \"Samples\" panel with audio and images of preprocessed samples (useful for checking that your preprocessing performs as expected and your labels are correct)\n",
"- Charts: graphs of each class's performance metrics over training time\n",
"- Model: summary of model architecture\n",
"- Logs: standard output of training script\n",
"- System: computational performance metrics including memory, CPU use, etc\n",
"\n",
"When training several models and comparing performance, the \"Project\" page of WandB provides comparisons of metrics and hyperparameters across training runs."
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "113a1a3c-1b0b-4159-83d7-43f7cc1a0d24",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"\u001b[34m\u001b[1mwandb\u001b[0m: Currently logged in as: \u001b[33msamlapp\u001b[0m (\u001b[33mdeepbirddetect\u001b[0m) to \u001b[32mhttps://api.wandb.ai\u001b[0m. Use \u001b[1m`wandb login --relogin`\u001b[0m to force relogin\n"
]
},
{
"data": {
"text/html": [
"creating run (0.6s)"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"Tracking run with wandb version 0.21.0"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"Run data is saved locally in /Users/SML161/opensoundscape/docs/tutorials/wandb/run-20260320_124550-qlbb9o4q"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"Syncing run Train CNN to Weights & Biases (docs) "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
" View project at https://wandb.ai/kitzeslab/OpenSoundscape%20tutorials"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
" View run at https://wandb.ai/kitzeslab/OpenSoundscape%20tutorials/runs/qlbb9o4q"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import wandb\n",
"\n",
"try:\n",
" wandb.login()\n",
" wandb_session = wandb.init(\n",
" entity=\"kitzeslab\", # replace with your entity/group name\n",
" project=\"OpenSoundscape tutorials\",\n",
" name=\"Train CNN\",\n",
" )\n",
"except: # if wandb.init fails, don't use wandb logging\n",
" print(\"failed to create wandb session. wandb session will be None\")\n",
" wandb_session = None"
]
},
{
"cell_type": "markdown",
"id": "f865c2ff-441b-40eb-a6d9-7665452c5add",
"metadata": {},
"source": [
"## Train the CNN\n",
"\n",
"Finally, train the CNN for two epoch. Typically, we would train the model for more than 30 steps, but because training is slow and is much better done outside of a Jupyter Notebook, we just include this as a short demonstration of training.\n",
"\n",
"Training proceeds in steps, with each step showing the model a **batch** containing a bunch of samples (usually at least 64). The machine learning model predicts on every sample in the batch, then the model weights are updated based on those samples. Larger batches can increase training speed, but require more memory. If you get a memory error, try reducing the batch size.\n",
"\n",
"We use default training parameters, but many aspects of CNN training can be customized (see the \"Customize CNN training\" tutorial for examples)."
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "981bffa6-842e-4e76-bbf1-ad92a3a72dee",
"metadata": {},
"outputs": [],
"source": [
"checkpoint_folder = Path(\"model_training_checkpoints\")\n",
"checkpoint_folder.mkdir(exist_ok=True)\n",
"\n",
"# Note: if you don't want to compute and log metrics for each individual class, set\n",
"# model.compute_per_class_metrics = False"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2ea86e7f-5533-4815-bf34-31e141002dd2",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"\u001b[34m\u001b[1mwandb\u001b[0m: \u001b[33mWARNING\u001b[0m Data passed to `wandb.Image` should consist of values in the range [0, 255], image data will be normalized to this range, but behavior will be removed in a future version of wandb.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Training Epoch 0\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "171da26a9cf04fc1903b0ddc29c97729",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/109 [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch: 0 [batch 0/109, 0.00%] \n",
"\tEpoch Running Average Loss: 0.728\n",
"\tMost Recent Batch Loss: 0.728\n",
"Epoch: 0 [batch 100/109, 91.74%] \n",
"\tEpoch Running Average Loss: 0.446\n",
"\tMost Recent Batch Loss: 0.400\n",
"\n",
"Validation.\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "cb6fe1d67f0f42bc9df7aa34b10d51c9",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/8 [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Training Epoch 1\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "70e67ad836d44c8da8e48fcb37dba4a5",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/109 [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch: 1 [batch 0/109, 0.00%] \n",
"\tEpoch Running Average Loss: 0.313\n",
"\tMost Recent Batch Loss: 0.313\n",
"Epoch: 1 [batch 100/109, 91.74%] \n",
"\tEpoch Running Average Loss: 0.296\n",
"\tMost Recent Batch Loss: 0.256\n",
"\n",
"Validation.\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "80a7d5322828417e80bfa7dd9bdc6746",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/8 [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Best Model Appears at Epoch 1 with Validation score 0.910.\n"
]
}
],
"source": [
"# %%capture --no-stdout --no-display\n",
"# Uncomment the line above to silence outputs from this cell\n",
"\n",
"model.train(\n",
" balanced_train_df,\n",
" valid_df,\n",
" steps=30,\n",
" batch_size=64,\n",
" log_interval=100, # log progress every 100 batches\n",
" num_workers=num_workers, # parallelized cpu tasks for preprocessing\n",
" wandb_session=wandb_session,\n",
" save_path=checkpoint_folder, # location to save checkpoints\n",
")"
]
},
{
"cell_type": "markdown",
"id": "2b498f89-d856-45b5-bfe6-b9e94e603ada",
"metadata": {},
"source": [
"Once this is finished running, you have trained the CNN. \n",
"\n",
"To generate predictions on audio files using the CNN, use the `.predict()` method of the CNN object. Here, we apply a sigmoid activation layer which maps the CNN's outputs (all real numbers) to a 0-1 range. "
]
},
{
"cell_type": "code",
"execution_count": 41,
"id": "6bde9106",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "0b1ddf34ac7d410c9cb4534cd9f3f0ef",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/5 [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" NOCA \\\n",
"file start_time end_time \n",
"annotated_data/mp3_Files/Recording_1/Recording_... 75.0 78.0 0.045070 \n",
"annotated_data/mp3_Files/Recording_1/Recording_... 12.0 15.0 0.032290 \n",
"annotated_data/mp3_Files/Recording_2/Recording_... 60.0 63.0 0.006733 \n",
"annotated_data/mp3_Files/Recording_1/Recording_... 9.0 12.0 0.101767 \n",
"annotated_data/mp3_Files/Recording_1/Recording_... 9.0 12.0 0.212231 \n",
"\n",
" EATO \\\n",
"file start_time end_time \n",
"annotated_data/mp3_Files/Recording_1/Recording_... 75.0 78.0 0.400547 \n",
"annotated_data/mp3_Files/Recording_1/Recording_... 12.0 15.0 0.556500 \n",
"annotated_data/mp3_Files/Recording_2/Recording_... 60.0 63.0 0.075441 \n",
"annotated_data/mp3_Files/Recording_1/Recording_... 9.0 12.0 0.144876 \n",
"annotated_data/mp3_Files/Recording_1/Recording_... 9.0 12.0 0.157232 \n",
"\n",
" SCTA \\\n",
"file start_time end_time \n",
"annotated_data/mp3_Files/Recording_1/Recording_... 75.0 78.0 0.001133 \n",
"annotated_data/mp3_Files/Recording_1/Recording_... 12.0 15.0 0.000280 \n",
"annotated_data/mp3_Files/Recording_2/Recording_... 60.0 63.0 0.047699 \n",
"annotated_data/mp3_Files/Recording_1/Recording_... 9.0 12.0 0.015483 \n",
"annotated_data/mp3_Files/Recording_1/Recording_... 9.0 12.0 0.001565 \n",
"\n",
" BAWW \\\n",
"file start_time end_time \n",
"annotated_data/mp3_Files/Recording_1/Recording_... 75.0 78.0 0.157310 \n",
"annotated_data/mp3_Files/Recording_1/Recording_... 12.0 15.0 0.006704 \n",
"annotated_data/mp3_Files/Recording_2/Recording_... 60.0 63.0 0.021585 \n",
"annotated_data/mp3_Files/Recording_1/Recording_... 9.0 12.0 0.013932 \n",
"annotated_data/mp3_Files/Recording_1/Recording_... 9.0 12.0 0.032767 \n",
"\n",
" BCCH \\\n",
"file start_time end_time \n",
"annotated_data/mp3_Files/Recording_1/Recording_... 75.0 78.0 0.027055 \n",
"annotated_data/mp3_Files/Recording_1/Recording_... 12.0 15.0 0.013564 \n",
"annotated_data/mp3_Files/Recording_2/Recording_... 60.0 63.0 0.100145 \n",
"annotated_data/mp3_Files/Recording_1/Recording_... 9.0 12.0 0.113902 \n",
"annotated_data/mp3_Files/Recording_1/Recording_... 9.0 12.0 0.314256 \n",
"\n",
" AMCR \\\n",
"file start_time end_time \n",
"annotated_data/mp3_Files/Recording_1/Recording_... 75.0 78.0 0.200161 \n",
"annotated_data/mp3_Files/Recording_1/Recording_... 12.0 15.0 0.000709 \n",
"annotated_data/mp3_Files/Recording_2/Recording_... 60.0 63.0 0.007114 \n",
"annotated_data/mp3_Files/Recording_1/Recording_... 9.0 12.0 0.182582 \n",
"annotated_data/mp3_Files/Recording_1/Recording_... 9.0 12.0 0.001502 \n",
"\n",
" NOFL \n",
"file start_time end_time \n",
"annotated_data/mp3_Files/Recording_1/Recording_... 75.0 78.0 0.033820 \n",
"annotated_data/mp3_Files/Recording_1/Recording_... 12.0 15.0 0.004292 \n",
"annotated_data/mp3_Files/Recording_2/Recording_... 60.0 63.0 0.005321 \n",
"annotated_data/mp3_Files/Recording_1/Recording_... 9.0 12.0 0.005553 \n",
"annotated_data/mp3_Files/Recording_1/Recording_... 9.0 12.0 0.003562 "
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"scores_df = model.predict(valid_df.head(), activation_layer=\"sigmoid\")\n",
"scores_df.head()"
]
},
{
"cell_type": "markdown",
"id": "8aea4709",
"metadata": {},
"source": [
"## Saving and exporting the model\n",
"\n",
"There are a few different ways we can save the trained model depending on downstream use cases. In general, you can simply use the `model.save(path)` function to save the model in a JSON format that can be reloaded by OpenSoundscape (`opso.load_model(path)` or `CNN.load(path)`). To load the raw dictionary of saved content, use `torch.load(path,weights_only=False)`. \n",
"\n",
"When you want to continue training from a saved model file, it is helpful to use `model.save(pickle=True)` which saves a compressed version of the entire Python object, rather than a JSON-like format. Unlike the default saving method, this saved object retains temporary model training states like the optimizer and learning rate scheduler states. However, note that when you saved a pickled model object, you could encounter issues re-loading it in different Python environments or different versions of OpenSoundscape. After saving a pickled model, you can reload it in the same way as normal: `opso.load_model(path)` or `CNN.load(path)`. \n",
"\n",
"### ONNX Export\n",
"ONNX (Open Neural Network Exchange) is a cross-platform format for representing neural networks on a wide range of hardware and operating systems. Exporting a model to ONNX is useful for edge computing and other inference-only applications where you want independence from PyTorch. Note that not all models can be exported to ONNX. In particular, the supported method for creating a model that can be exported to ONNX is to initialize the model with the TorchSpectrogramPreprocessor class as the preprocessor. This preprocessor is designed for compatability with PyTorch's `torch.onnx.export` function. Here's an example:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "a8d82b80",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/SML161/miniconda3/envs/opso_dev/lib/python3.13/site-packages/onnxscript/converter.py:457: DeprecationWarning: Expression.__init__ got an unexpected keyword argument 'lineno'. Support for arbitrary keyword arguments is deprecated and will be removed in Python 3.15.\n",
" expr = ast.Expression(expr, lineno=expr.lineno, col_offset=expr.col_offset)\n",
"/Users/SML161/miniconda3/envs/opso_dev/lib/python3.13/site-packages/onnxscript/converter.py:457: DeprecationWarning: Expression.__init__ got an unexpected keyword argument 'col_offset'. Support for arbitrary keyword arguments is deprecated and will be removed in Python 3.15.\n",
" expr = ast.Expression(expr, lineno=expr.lineno, col_offset=expr.col_offset)\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[torch.onnx] Obtain model graph for `ONNXModel()` with `torch.export.export(..., strict=False)`...\n",
"[torch.onnx] Obtain model graph for `ONNXModel()` with `torch.export.export(..., strict=False)`... ✅\n",
"[torch.onnx] Run decomposition...\n",
"[torch.onnx] Run decomposition... ✅\n",
"[torch.onnx] Translate the graph into ONNX...\n",
"[torch.onnx] Translate the graph into ONNX... ✅\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/SML161/miniconda3/envs/opso_dev/lib/python3.13/site-packages/onnx/reference/ops/op_range.py:13: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)\n",
" return (np.arange(starts, ends, steps).astype(starts.dtype),)\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Applied 9 of general pattern rewrite rules.\n"
]
}
],
"source": [
"from opensoundscape import CNN, preprocessors\n",
"\n",
"model = CNN(\n",
" architecture=\"efficientnet_b0\",\n",
" classes=[0, 1, 2, 3],\n",
" sample_duration=3,\n",
" preprocessor_cls=preprocessors.TorchSpectrogramPreprocessor,\n",
" sample_rate=32000,\n",
")\n",
"onnx_program = model.save_onnx(\"./opso_efficientnet.onnx\", activation_layer=\"sigmoid\")"
]
},
{
"cell_type": "markdown",
"id": "088f7597",
"metadata": {},
"source": [
"This saves an ONNX program for inference (prediction). The program produces three outputs by default: the pre-processed audio sample (a spectrogram), the embedding layer outputs, and the final classifier outputs. You can turn off any of these outputs using the arguments to `.save_onnx()`. If `model.network.classifier_layer` is not set, the function will not know which layer to use for embeddings, and will instead create a program that only exports the pre-processed sample and the final classifier outputs. \n",
"\n",
"You can also directly use the lower-level functions `opso.export.to_onnx_program` to export custom model classes, or inspect the code in that function to build a custom onnx export method. \n",
"\n",
"The ONNX program can be run in various ways once it is exported. In Python, you can run onnx programs using the onnx_runtime package. Here's a sample script:\n",
"\n",
"\n",
"```python\n",
"import onnx, onnxruntime\n",
"import numpy as np\n",
"\n",
"combined_model = onnx.load(\"opso_efficientnet.onnx\")\n",
"output_names = [node.name for node in combined_model.graph.output]\n",
"\n",
"onnx.checker.check_model(combined_model)\n",
"\n",
"\n",
"EP_list = [\"CPUExecutionProvider\"] # [\"CUDAExecutionProvider\", \"CPUExecutionProvider\"]\n",
"ort_session = onnxruntime.InferenceSession(\"opso_efficientnet.onnx\", providers=EP_list)\n",
"\n",
"# make up some random inputs\n",
"audio_samples_per_input = (\n",
" combined_model.graph.input[0].type.tensor_type.shape.dim[2].dim_value\n",
")\n",
"batch_size = 3\n",
"input_batched = np.random.rand(batch_size, 1, audio_samples_per_input).astype(\n",
" np.float32\n",
")\n",
"\n",
"# compute ONNX Runtime output prediction\n",
"ort_inputs = {ort_session.get_inputs()[0].name: input_batched}\n",
"ort_outs = ort_session.run(None, ort_inputs)\n",
"\n",
"# restore the name-value dictionary mapping of outputs\n",
"outs_dict = {name: ort_outs[i] for i, name in enumerate(output_names)}\n",
"print(f\"shape of outputs for inference on one batch of batch size {batch_size}:\")\n",
"print({k: v.shape for k, v in outs_dict.items()})\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "b1946c1f",
"metadata": {},
"source": [
"We don't expect this CNN to actually be good at classifying sounds, since we only trained it with a few examples and for a few steps. We'd want to train with hundreds of examples per class for 1,000 steps as a starting point for training a useful model. \n",
"\n",
"For guidance on how to use machine learning classifiers, see the Classifieres 101 Guide on opensoundscape.org and the tutorial on predicting with pre-trained CNNs.\n",
"\n",
"For transfer learning from pre-trained CNNs, see the transfer learning tutorial notebook.\n",
"\n",
"**Clean up:** Run the following cell to delete the files created in this tutorial."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "440ca518-abcd-4bac-94e8-12ff8b8e46b1",
"metadata": {},
"outputs": [],
"source": [
"import shutil\n",
"from pathlib import Path\n",
"\n",
"# uncomment to remove the training files\n",
"# shutil.rmtree('./annotated_data')\n",
"\n",
"if Path(\"./wandb\").exists():\n",
" shutil.rmtree(\"./wandb\")\n",
"if Path(\"./model_training_checkpoints\").exists():\n",
" shutil.rmtree(\"./model_training_checkpoints\")\n",
"try:\n",
" Path(\"opso_efficientnet.onnx\").unlink()\n",
"except:\n",
" pass\n",
"try:\n",
" Path(\"annotation_Files.zip\").unlink()\n",
"except:\n",
" pass\n",
"try:\n",
" Path(\"mp3_Files.zip\").unlink()\n",
"except:\n",
" pass"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "opso_dev",
"language": "python",
"name": "opso_dev"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}