API Documentation


audio.py: Utilities for dealing with audio files

class opensoundscape.audio.Audio(samples, sample_rate)

Container for audio samples

bandpass(low_f, high_f, order=9)

bandpass audio signal frequencies

uses a phase-preserving algorithm (scipy.signal’s butter and solfiltfilt)

  • low_f – low frequency cutoff (-3 dB) in Hz of bandpass filter
  • high_f – high frequency cutoff (-3 dB) in Hz of bandpass filter
  • order – butterworth filter order (integer) ~= steepness of cutoff

Return duration of Audio

duration (float): The duration of the Audio
classmethod from_bytesio(bytesio, sample_rate=None, resample_type='kaiser_fast')

classmethod from_file(path, sample_rate=None, max_duration=None, resample_type='kaiser_fast')

Load audio from files

Deal with the various possible input types to load an audio file and generate a spectrogram

  • path (str, Path) – path to an audio file
  • sample_rate (int, None) – resample audio with value and resample_type, if None use source sample_rate (default: None)
  • resample_type – method used to resample_type (default: kaiser_fast)
  • max_duration – the maximum length of an input file, None is no maximum (default: None)

attributes samples and sample_rate

Return type:



save Audio to file

Parameters:path – destination for output

create frequency spectrum from an Audio object using fft

Returns:fft, frequencies

Given a time, convert it to the corresponding sample

Parameters:time – The time to multiply with the sample_rate
Returns:The rounded sample
Return type:sample
trim(start_time, end_time)

trim Audio object in time

  • start_time – time in seconds for start of extracted clip
  • end_time – time in seconds for end of extracted clip

a new Audio object containing samples from start_time to end_time

exception opensoundscape.audio.OpsoLoadAudioInputError

Custom exception indicating we can’t load input

exception opensoundscape.audio.OpsoLoadAudioInputTooLong

Custom exception indicating length of audio is too long

Audio Tools

audio_tools.py: set of tools that filter or modify audio files or sample arrays (not Audio objects)

opensoundscape.audio_tools.bandpass_filter(signal, low_f, high_f, sample_rate, order=9)

perform a butterworth bandpass filter on a discrete time signal using scipy.signal’s butter and solfiltfilt (phase-preserving version of sosfilt)

  • signal – discrete time signal (audio samples, list of float)
  • low_f – -3db point (?) for highpass filter (Hz)
  • high_f – -3db point (?) for highpass filter (Hz)
  • sample_rate – samples per second (Hz)
  • order=9 – higher values -> steeper dropoff

filtered time signal

opensoundscape.audio_tools.butter_bandpass(low_f, high_f, sample_rate, order=9)

generate coefficients for bandpass_filter()

  • low_f – low frequency of butterworth bandpass filter
  • high_f – high frequency of butterworth bandpass filter
  • sample_rate – audio sample rate
  • order=9 – order of butterworth filter

set of coefficients used in sosfiltfilt()

opensoundscape.audio_tools.clipping_detector(samples, threshold=0.6)

count the number of samples above a threshold value

  • samples – a time series of float values
  • threshold=0.6 – minimum value of sample to count as clipping

number of samples exceeding threshold

opensoundscape.audio_tools.convolve_file(in_file, out_file, ir_file, input_gain=1.0)

apply an impulse_response to a file using ffmpeg’s afir convolution

ir_file is an audio file containing a short burst of noise recorded in a space whose acoustics are to be recreated

this makes the files ‘sound as if’ it were recorded in the location that the impulse response (ir_file) was recorded

  • in_file – path to an audio file to process
  • out_file – path to save output to
  • ir_file – path to impulse response file
  • input_gain=1.0 – ratio for in_file sound’s amplitude in (0,1)

os response of ffmpeg command

opensoundscape.audio_tools.mixdown_with_delays(files_to_mix, destination, delays=None, levels=None, duration='first', verbose=0, create_txt_file=False)

use ffmpeg to mixdown a set of audio files, each starting at a specified time (padding beginnings with zeros)

  • files_to_mix – list of audio file paths
  • destination – path to save mixdown to
  • delays=None – list of delays (how many seconds of zero-padding to add at beginning of each file)
  • levels=None – optionally provide a list of relative levels (amplitudes) for each input
  • duration='first' – ffmpeg option for duration of output file: match duration of ‘longest’,’shortest’,or ‘first’ input file
  • verbose=0 – if >0, prints ffmpeg command and doesn’t suppress ffmpeg output (command line output is returned from this function)
  • create_txt_file=False – if True, also creates a second output file which lists all files that were included in the mixdown

ffmpeg command line output

opensoundscape.audio_tools.silence_filter(filename, smoothing_factor=10, window_len_samples=256, overlap_len_samples=128, threshold=None)

Identify whether a file is silent (0) or not (1)

Load samples from an mp3 file and identify whether or not it is likely to be silent. Silence is determined by finding the energy in windowed regions of these samples, and normalizing the detected energy by the average energy level in the recording.

If any windowed region has energy above the threshold, returns a 0; else returns 1.

  • filename (str) – file to inspect
  • smoothing_factor (int) – modifier to window_len_samples
  • window_len_samples – number of samples per window segment
  • overlap_len_samples – number of samples to overlap each window segment
  • threshold – threshold value (experimentally determined)

0 if file contains no significant energy over bakcground 1 if file contains significant energy over bakcground

If threshold is None: returns net_energy over background noise

opensoundscape.audio_tools.window_energy(samples, window_len_samples=256, overlap_len_samples=128)

Calculate audio energy with a sliding window

Calculate the energy in an array of audio samples

  • samples (np.ndarray) – array of audio samples loaded using librosa.load
  • window_len_samples – samples per window
  • overlap_len_samples – number of samples shared between consecutive windows

list of energy level (float) for each window



Run a command returning output, error

cmd: A string containing some command
(stdout, stderr): A tuple of standard out and standard error

Run a command returning the return code

cmd: A string containing some command
return_code: The return code of the function




Get the default configuration file as a dictionary

dict: A dictionary containing the default Opensoundscape configuration

Validate a configuration string

config: A string containing an Opensoundscape configuration
dict: A dictionary of the validated Opensoundscape configuration

Validate a configuration file

fname: A filename containing an Opensoundscape configuration
dict: A dictionary of the validated Opensoundscape configuration

Console Checks

Utilities related to console checks on docopt args


console.py: Entrypoint for opensoundscape


Run sphinx-build for our project


The Opensoundscape entrypoint for console interaction

Data Selection

opensoundscape.data_selection.binary_train_valid_split(input_df, label, label_column='Labels', train_size=0.8, random_state=101)

Split a dataset into train and validation dataframes

Given a Dataframe and a label in column “Labels” (singly labeled) generate a train dataset with ~80% of each label and a valid dataset with the rest.

  • input_df – A singly-labeled CSV file
  • label – One of the labels in the column label_column to use as a positive label (1), all others are negative (0)
  • label_column – Name of the column that labels should come from [default: “Labels”]
  • train_size – The decimal fraction to use for the training set [default: 0.8]
  • random_state – The random state to use for train_test_split [default: 101]
train_df: A Dataframe containing the training set valid_df: A Dataframe containing the validation set

Given a multi-labeled dataframe, generate a singly-labeled dataframe

Given a Dataframe with a “Labels” column that is multi-labeled (e.g. “hello|world”) split the row into singly labeled rows.

Parameters:input_df – A Dataframe with a multi-labeled “Labels” column (separated by “|”)
output_df: A Dataframe with singly-labeled “Labels” column
opensoundscape.data_selection.upsample(input_df, label_column='Labels', random_state=None)

Given a input DataFrame upsample to maximum value

Upsampling removes the class imbalance in your dataset. Rows for each label are repeated up to max_count // rows. Then, we randomly sample the rows to fill up to max_count.

input_df: A DataFrame to upsample label_column: The column to draw unique labels from random_state: Set the random_state during sampling
df: An upsampled DataFrame


class opensoundscape.datasets.SingleTargetAudioDataset(df, label_dict, filename_column='Destination', from_audio=True, label_column=None, height=224, width=224, add_noise=False, debug=None, random_trim_length=None, max_overlay_num=0, overlay_prob=0.2, overlay_weight='random')

Single Target Audio -> Image Dataset

Given a DataFrame with audio files in one of the columns, generate a Dataset of spectrogram images for basic machine learning tasks.

This class provides access to several types of augmentations that act on audio and images with the following arguments: - add_noise: for adding RandomAffine and ColorJitter noise to images - random_trim_length: for only using a short random clip extracted from the training data - max_overlay_num / overlay_prob / overlay_weight:

controlling the maximum number of additional spectrograms to overlay, the probability of overlaying an individual spectrogram, and the weight for the weighted sum of the spectrograms

Additional augmentations on tensors are available when calling train() from the module opensoundscape.torch.train.


df: A DataFrame with a column containing audio files label_dict: a dictionary mapping numeric labels to class names,

  • for example: {0:’American Robin’,1:’Northern Cardinal’}
  • pass None if you wish to retain numeric labels

filename_column: The column in the DataFrame which contains paths to data [default: Destination] from_audio: Whether the raw dataset is audio [default: True] label_column: The column with numeric labels if present [default: None] height: Height for resulting Tensor [default: 224] width: Width for resulting Tensor [default: 224] add_noise: Apply RandomAffine and ColorJitter filters [default: False] debug: Save images to a directory [default: None] random_trim_length: Extract a clip of this many seconds of audio starting at a random time

If None, the original clip will be used [default: None]

max_overlay_num: the maximum number of additional images to overlay, each with probability overlay_prob [default: 0] overlay_prob: Probability of an image from a different class being overlayed (combined as a weighted sum)

on the training image. typical values: 0, 0.66 [default: 0.2]
overlay_weight: the weight given to the overlaid image during augmentation.
When ‘random’, will randomly select a different weight between 0.2 and 0.5 for each overlay When not ‘random’, should be a float between 0 and 1 [default: ‘random’]
{ “X”: (3, H, W) , “y”: (1) if label_column != None }
image_from_audio(audio, mode='RGB')

Create a PIL image from audio

audio: audio object mode: PIL image mode, e.g. “L” or “RGB” [default: RGB]
overlay_random_image(original_image, original_length, original_class, original_path)

Overlay an image from another class

Select a random file from a different class. Trim if necessary to the same length as the given image. Overlay the images on top of each other with a weight

class opensoundscape.datasets.SplitterDataset(wavs, annotations=False, label_corrections=None, overlap=1, duration=5, output_directory='segments', include_last_segment=False)

A PyTorch Dataset for splitting a WAV files

wavs: A list of WAV files to split annotations: Should we search for corresponding annotations files? (default: False) label_corrections: Specify a correction labels CSV file w/ column headers “raw” and “corrected” (default: None) overlap: How much overlap should there be between samples (units: seconds, default: 1) duration: How long should each segment be? (units: seconds, default: 5) output_directory Where should segments be written? (default: segments/) include_last_segment: Do you want to include the last segment? (default: False)
  • Segments will be written to the output_directory
output: A list of CSV rows containing the source audio, segment begin
time (seconds), segment end time (seconds), segment audio, and present classes separated by ‘|’ if annotations were requested
opensoundscape.datasets.annotations_with_overlaps_with_clip(df, begin, end)

Determine if any rows overlap with current segment

df: A dataframe containing a Raven annotation file begin: The begin time of the current segment (unit: seconds) end: The end time of the current segment (unit: seconds)
sub_df: A dataframe of annotations which overlap with the begin/end times

Generate MD5 sum for a string

input_string: An input string
output: A string containing the md5 hash of input string

Grad Cam


opensoundscape.helpers.binarize(x, threshold)

return a list of 0, 1 by thresholding vector x

opensoundscape.helpers.bound(x, bounds)

restrict x to a range of bounds = [min, max]


get file name without extension from a path


convert a hexidecimal, Unix time string to a datetime timestamp


check for nan by equating x to itself

opensoundscape.helpers.jitter(x, width, distribution='gaussian')

Jitter (add random noise to) each value of x

  • x – scalar, array, or nd-array of numeric type
  • width – multiplier for random variable (stdev for ‘gaussian’ or r for ‘uniform’)
  • distribution – ‘gaussian’ (default) or ‘uniform’ if ‘gaussian’: draw jitter from gaussian with mu = 0, std = width if ‘uniform’: draw jitter from uniform on [-width, width]

x + random jitter

Return type:


opensoundscape.helpers.linear_scale(array, in_range=(0, 1), out_range=(0, 255))

Translate from range in_range to out_range

in_range: The starting range [default: (0, 1)] out_range: The output range [default: (0, 255)]
new_array: A translated array
opensoundscape.helpers.min_max_scale(array, feature_range=(0, 1))

rescale vaues in an a array linearly to feature_range

opensoundscape.helpers.rescale_features(X, rescaling_vector=None)

rescale all features by dividing by the max value for each feature

optionally provide the rescaling vector (1xlen(X) np.array), so that you can rescale a new dataset consistently with an old one

returns rescaled feature set and rescaling vector


run a bash command with Popen, return response


sigmoid function



Calculate speed of sound in meters per second

Calculate speed of sound for a given temperature in Celsius (Humidity has a negligible effect on speed of sound and so this functionality is not implemented)

Parameters:temperature – ambient temperature in Celsius
Returns:the speed of sound in meters per second
opensoundscape.localization.localize(receiver_positions, arrival_times, temperature=20.0, invert_alg='gps', center=True, pseudo=True)

Perform TDOA localization on a sound event

Localize a sound event given relative arrival times at multiple receivers. This function implements a localization algorithm from the equations described in the class handout (“Global Positioning Systems”). Localization can be performed in a global coordinate system in meters (i.e., UTM), or relative to recorder positions in meters.

  • receiver_positions – a list of [x,y,z] positions for each receiver Positions should be in meters, e.g., the UTM coordinate system.
  • arrival_times – a list of TDOA times (onset times) for each recorder The times should be in seconds.
  • temperature – ambient temperature in Celsius
  • invert_alg – what inversion algorithm to use
  • center – whether to center recorders before computing localization result. Computes localization relative to centered plot, then translates solution back to original recorder locations. (For behavior of original Sound Finder, use True)
  • pseudo – whether to use the pseudorange error (True) or sum of squares discrepancy (False) to pick the solution to return (For behavior of original Sound Finder, use False. However, in initial tests, pseudorange error appears to perform better.)

The solution (x,y,z,b) with the lower sum of squares discrepancy b is the error in the pseudorange (distance to mics), b=c*delta_t (delta_t is time error)

opensoundscape.localization.lorentz_ip(u, v=None)

Compute Lorentz inner product of two vectors

For vectors u and v, the Lorentz inner product for 3-dimensional case is defined as

u[0]*v[0] + u[1]*v[1] + u[2]*v[2] - u[3]*v[3]

Or, for 2-dimensional case as

u[0]*v[0] + u[1]*v[1] - u[2]*v[2]
u: vector with shape either (3,) or (4,) v: vector with same shape as x1; if None (default), sets v = u
float: value of Lorentz IP
opensoundscape.localization.travel_time(source, receiver, speed_of_sound)

Calculate time required for sound to travel from a souce to a receiver

  • source – cartesian position [x,y] or [x,y,z] of sound source
  • receiver – cartesian position [x,y] or [x,y,z] of sound receiver
  • speed_of_sound – speed of sound in m/s

time in seconds for sound to travel from source to receiver


Pulse Finder

PyTorch Prediction

DEPRECATED: use opensoundscape.torch.predict instead

these functions are currently used only to support localization.py the module contains a pytorch prediction function (deprecated) and some additional functionality for using gradcam

opensoundscape.pytorch_prediction.activation_region_limits(gcam, threshold=0.2)

calculate bounds of a GradCam activation region

  • gcam – a 2-d array gradcam activation array generated by gradcam_region()
  • threshold=0.2 – minimum value of gradcam (0-1) to count as ‘activated’

[ [min row, max_row], [min_col, max_col] ] indices of gradcam elements exceeding threshold

opensoundscape.pytorch_prediction.activation_region_to_box(activation_region, threshold=0.2)

draw a rectangle of the activation box as a boolean array (useful for plotting a mask over a spectrogram)

  • activation_region – a 2-d gradcam activation array
  • threshold=0.2 – minimum value of activation to count as ‘activated’

mask 2-d array of 0, 1 where 1’s form a solid box of activated region

opensoundscape.pytorch_prediction.gradcam_region(model, img_paths, img_shape, predictions=None, save_gcams=True, box_threshold=0.2)

Compute the GradCam activation region (the area of an image that was most important for classification in the CNN)

  • model – a pytorch model object
  • img_paths – list of paths to image files
  • = None (predictions) – [list of float] optionally, provide model predictions per file to avoid re-computing
  • = True (save_gcams) – bool, if False only box regions around gcams are saved

limits of the box surrounding the gcam activation region, as indices: [ [min row, max row], [min col, max col] ] gcams: (only returned if save_gcams == True) arrays with gcam activation values, shape = shape of image

Return type:


opensoundscape.pytorch_prediction.in_box(x, y, box_lims)

check if an x, y position falls within a set of limits

  • x – first index
  • y – second index
  • box_lims – [[x low,x high], [y low,y high]]

Returns: True if (x,y) is in box_lims, otherwise False

opensoundscape.pytorch_prediction.predict(model, img_paths, img_shape, batch_size=1, num_workers=12, apply_softmax=True)

get multi-class model predictions from a pytorch model for a set of images

  • model – a pytorch model object (not path to weights)
  • img_paths – a list of paths to RGB png spectrograms
  • batch_size=1 – pytorch parallelization parameter
  • num_workers=12 – pytorch parallelization parameter
  • apply_softmax=True – if True, performs a softmax on raw output of network

returns: df of predictions indexed by file


raven.py: Utilities for dealing with Raven files


Check Raven annotations files for a non-null class

directory: The path which contains Raven annotations file

Generate a CSV to specify any class overrides

directory: The path which contains Raven annotations files ending in *.selections.txt.lower
csv (string): A multiline string containing a CSV file with two columns
raw and corrected

Convert Raven annotation files to lowercase

directory: The path which contains Raven annotations file
opensoundscape.raven.query_annotations(directory, cls)

Given a directory of Raven annotations, query for a specific class

directory: The path which contains Raven annotations file cls: The class which you would like to query for
output (string): A multiline string containing annotation file and rows matching the query cls

Species Table


spectrogram.py: Utilities for dealing with spectrograms

class opensoundscape.spectrogram.Spectrogram(spectrogram, frequencies, times)

Immutable spectrogram container


create an amplitude vs time signal from spectrogram

by summing pixels in the vertical dimension

freq_range=None: sum Spectrogrm only in this range of [low, high] frequencies in Hz (if None, all frequencies are summed)
Returns:a time-series array of the vertical sum of spectrogram value
bandpass(min_f, max_f)

extract a frequency band from a spectrogram

crops the 2-d array of the spectrograms to the desired frequency range

  • min_f – low frequency in Hz for bandpass
  • high_f – high frequency in Hz for bandpass

bandpassed spectrogram object

classmethod from_audio(audio, window_type='hann', window_samples=512, overlap_samples=256, decibel_limits=(-100, -20))

create a Spectrogram object from an Audio object

  • window_type="hann" – see scipy.signal.spectrogram docs for description of window parameter
  • window_samples=512 – number of audio samples per spectrogram window (pixel)
  • overlap_samples=256 – number of samples shared by consecutive windows
  • = (decibel_limits) – limit the dB values to (min,max) (lower values set to min, higher values set to max)

opensoundscape.spectrogram.Spectrogram object

classmethod from_file()

create a Spectrogram object from a file

Parameters:file – path of image to load
Returns:opensoundscape.spectrogram.Spectrogram object
limit_db_range(min_db=-100, max_db=-20)

Limit the decibel values of the spectrogram to range from min_db to max_db

values less than min_db are set to min_db values greater than max_db are set to max_db

similar to Audacity’s gain and range parameters

  • min_db – values lower than this are set to this
  • max_db – values higher than this are set to this

Spectrogram object with db range applied

linear_scale(feature_range=(0, 1))

Linearly rescale spectrogram values to a range of values using in_range as decibel_limits

Parameters:feature_range – tuple of (low,high) values for output
Returns:Spectrogram object with values rescaled to feature_range
min_max_scale(feature_range=(0, 1))

Linearly rescale spectrogram values to a range of values using in_range as minimum and maximum

Parameters:feature_range – tuple of (low,high) values for output
Returns:Spectrogram object with values rescaled to feature_range
net_amplitude(signal_band, reject_bands=None)

create amplitude signal in signal_band and subtract amplitude from reject_bands

rescale the signal and reject bands by dividing by their bandwidths in Hz (amplitude of each reject_band is divided by the total bandwidth of all reject_bands. amplitude of signal_band is divided by badwidth of signal_band. )

  • signal_band – [low,high] frequency range in Hz (positive contribution)
  • band (reject) – list of [low,high] frequency ranges in Hz (negative contribution)

return: time-series array of net amplitude

plot(inline=True, fname=None, show_colorbar=False)

Plot the spectrogram with matplotlib.pyplot

  • inline=True
  • fname=None – specify a string path to save the plot to (ending in .png/.pdf)
  • show_colorbar – include image legend colorbar from pyplot
to_image(shape=None, mode='RGB', spec_range=[-100, -20])

create a Pillow Image from spectrogram linearly rescales values from db_range (default [-100, -20]) to [255,0] (ie, -20 db is loudest -> black, -100 db is quietest -> white)

  • destination – a file path (string)
  • shape=None – tuple of image dimensions, eg (224,224)
  • mode="RGB" – RGB for 3-channel color or “L” for 1-channel grayscale
  • spec_range=[-100,-20] – the lowest and highest possible values in the spectrogram

Pillow Image object

trim(start_time, end_time)

extract a time segment from a spectrogram

  • start_time – in seconds
  • end_time – in seconds

spectrogram object from extracted time segment


a set of utilites for converting between scientific and common names of bird species in different naming systems (xeno canto and bird net)


convert bird net common name (ignoring dashes, spaces, case) to scientific name as lowercase-hyphenated


convert bird net common name (ignoring dashes, spaces, case) to scientific name as lowercase-hyphenated


list of scientific-names (lowercase-hyphenated) of species in the loaded species table


convert scientific name as lowercase-hyphenated to birdnet common name as lowercasenospaces


convert scientific name as lowercase-hyphenated to xeno-canto common name as lowercasenospaces


convert xeno-canto common name (ignoring dashes, spaces, case) to scientific name as lowercase-hyphenated

Torch Spectrogram Augmentation

These functions were implemented for PyTorch in the following repository https://github.com/zcaceres/spec_augment The original paper is available on https://arxiv.org/abs/1904.08779

Torch Training

opensoundscape.torch.train.train(save_dir, model, train_dataset, valid_dataset, optimizer, loss_fn, epochs=25, batch_size=1, num_workers=0, log_every=5, tensor_augment=False, debug=False, print_logging=True)

Train a model


save_dir: A directory to save intermediate results model: A binary torch model,

  • e.g. torchvision.models.resnet18(pretrained=True)
  • must override classes, e.g. model.fc = torch.nn.Linear(model.fc.in_features, 2)

train_dataset: The training Dataset, e.g. created by SingleTargetAudioDataset() valid_dataset: The validation Dataset, e.g. created by SingleTargetAudioDataset() optimizer: A torch optimizer, e.g. torch.optim.SGD(model.parameters(), lr=1e-3) loss_fn: A torch loss function, e.g. torch.nn.CrossEntropyLoss() epochs: The number of epochs [default: 25] batch_size: The size of the batches [default: 1] num_workers: The number of cores to use for batch preparation [default: 1] log_every: Log statistics when epoch % log_every == 0 [default: 5] tensor_augment: Whether or not to use the tensor augment procedures [default: False] debug: Whether or not to write intermediate images [default: False]

Side Effects:
Write a file epoch-{epoch}.tar containing (rate of log_every): - Model state dictionary - Optimizer state dictionary - Labels in YAML format - Train: loss, accuracy, precision, recall, and f1 score - Validation: accuracy, precision, recall, and f1 score - train_dataset.label_dict Write a metadata file with parameter values to save_dir/metadata.txt
model parameters are saved to