Agile Bioacoustic Modeling with SongSpace
SongSpace provides a workflow for active or “agile” learning for bioacoustics data. Embed audio into a databse, query the database with vector search or classifieres, and select clips for active learning review or final verification for ecological analyses.
Embeddings are saved in a HopLite database. The same folder storing the (sql) embedding database will also store classifiers and tables for labeled datasets. The full workspace can be saved and loaded with ss.save(path) and SongSpace.load(path).
Run this tutorial
If running in Colab, uncomment the installation line below.
[1]:
# if 'google.colab' in str(get_ipython()):
# %pip install "opensoundscape==0.12.1" "bioacoustics-model-zoo==0.12.0"
[2]:
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import bioacoustics_model_zoo as bmz
from opensoundscape.annotations import BoxedAnnotations
from opensoundscape.vector_database import load_or_create_hoplite_usearch_db
from opensoundscape.ml.song_space import SongSpace
from opensoundscape.ml.shallow_classifier import select_from_hoplite
from opensoundscape.visualization import annotate, inspect
Download example files
Download a set of aquatic soundscape recordings with annotations of Rana sierrae vocalizations
Option 1: run the cell below
if you get a 403 error, DataDryad suspects you are a bot. Use Option 2.
Option 2:
Download and unzip the
rana_sierrae_2022.zipfolder containing audio and annotations from this public Dryad datasetMove the unzipped
rana_sierrae_2022folder into the current folder
[3]:
# # Note: the "!" preceding each line below allows us to run bash commands in a Jupyter notebook
# # If you are not running this code in a notebook, input these commands into your terminal instead
# !wget -O rana_sierrae_2022.zip https://datadryad.org/stash/downloads/file_stream/2722802;
# !unzip rana_sierrae_2022;
Prepare audio data
See the train_cnn.ipynb tutorial for step-by-step walkthrough of this process, or just run the cells below to prepare a training set.
[4]:
# Set this variable to specify where the folder `rana_sierrae_2022` is located:
dataset_path = Path("./rana_sierrae_2022/")
# let's generate clip labels of 5s duration (to match HawkEars) using the raven annotations
# and some utility functions from opensoundscape
from opensoundscape.annotations import BoxedAnnotations
audio_and_raven_files = pd.read_csv(f"{dataset_path}/audio_and_raven_files.csv")
# update the paths to where we have the audio and raven files stored
audio_and_raven_files["audio"] = audio_and_raven_files["audio"].apply(
lambda x: f"{dataset_path}/{x}"
)
audio_and_raven_files["raven"] = audio_and_raven_files["raven"].apply(
lambda x: f"{dataset_path}/{x}"
)
annotations = BoxedAnnotations.from_raven_files(
raven_files=audio_and_raven_files["raven"],
audio_files=audio_and_raven_files["audio"],
annotation_column="annotation",
)
# generate labels for 5s clips, including any labels that overlap by at least 0.2 seconds
labels = annotations.clip_labels(
clip_duration=3, min_label_overlap=0.2, final_clip=None
)
/Users/SML161/opensoundscape/opensoundscape/annotations.py:347: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
all_annotations_df = pd.concat(all_file_dfs).reset_index(drop=True)
Prepare labels
[5]:
dataset_path = Path("./rana_sierrae_2022/")
audio_and_raven_files = pd.read_csv(dataset_path / "audio_and_raven_files.csv")
audio_and_raven_files["audio"] = audio_and_raven_files["audio"].apply(
lambda x: str(dataset_path / x)
)
audio_and_raven_files["raven"] = audio_and_raven_files["raven"].apply(
lambda x: str(dataset_path / x)
)
annotations = BoxedAnnotations.from_raven_files(
raven_files=audio_and_raven_files["raven"],
audio_files=audio_and_raven_files["audio"],
annotation_column="annotation",
)
labels = annotations.clip_labels(clip_duration=3, min_label_overlap=0.2)
/Users/SML161/opensoundscape/opensoundscape/annotations.py:347: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
all_annotations_df = pd.concat(all_file_dfs).reset_index(drop=True)
[6]:
target_source_class = "C"
target_model_class = "RanaSierrae_C"
# start with one recording of target class
binary_labels = labels[[target_source_class]].rename(
columns={target_source_class: target_model_class}
)
seed_train = binary_labels.loc[
["rana_sierrae_2022/mp3/sine2022a_MSD-0558_20220623_060000_0-10s.mp3"]
]
other = binary_labels.drop(seed_train.index)
validation, unlabeled = train_test_split(other, test_size=0.8, random_state=0)
print("seed_train:", seed_train.shape)
print("validation:", validation.shape)
print("pool:", unlabeled.shape)
seed_train: (4, 1)
validation: (536, 1)
pool: (2148, 1)
All audio clips from the single audio file we’ll start with for positives:
[7]:
_ = inspect(seed_train, bandpass_range=(0, 2500))
Build database and SongSpace
The default Machine Learning embedding model is Perch V2, a TensorFlow model provided via the Bioacoustics Model Zoo. If you wish to avoid installing TensorFlow, consider specifying feature_extractor='perch2_onnx' for an ONNX formatted version (currently CPU only), or selecting another model such as ‘birdnet’ (TFLite) or ‘bs-convnext’ (PyTorch). Alternatively, advanced users can provide a custom embedding model with a .embed() method matching the opensoundscape.CNN.embed() API.
It is critical to maintain consistency of the machine learning model within a single SongSpace: you cannot change embedding models (or model versions) for different datasets or tasks within a single SongSpace. You’ll need to make a new SongSpace and re-ingest all audio files if you change backbone embedding models.
[10]:
ss = SongSpace("./Perch2SongSpace")
Connected to existing database with 2,691 embeddings from 672 files.
/Users/SML161/miniconda3/envs/opso_dev/lib/python3.13/site-packages/bioacoustics_model_zoo/perch_v2.py:215: UserWarning: Disabling TensorFlow's XLA compilation (setting tf.config.optimizer.set_jit(False)) because otherwise TF models on Mac hang at runtime as of Tensorflow 2.21.0
warnings.warn(
[11]:
import opensoundscape as opso
opso.set_seed(0)
[12]:
# Embed and register datasets in SongSpace.
ss.ingest_audio(
seed_train,
dataset_name="round1_train",
batch_size=32,
)
ss.ingest_audio(
validation,
dataset_name="validation",
allow_training=False,
batch_size=32,
)
ss.ingest_audio(
unlabeled,
dataset_name="pool_unlabeled",
batch_size=32,
)
ss.list_datasets()
all samples already have embeddings in the database
all samples already have embeddings in the database
all samples already have embeddings in the database
[12]:
['round1_train', 'validation', 'pool_unlabeled']
[13]:
ss.save()
Saved SongSpace to ./Perch2SongSpace with 0 classifiers and 3 datasets.
Similarity search for similar samples
[14]:
# Similarity search
matches_for_each_query = ss.similarity_search(seed_train, k=20, exact_search=True)
best_matches = matches_for_each_query.sort_values(
by="sort_score", ascending=False
).head(20)
# Review and annotate interactively.
_ = annotate(
best_matches,
bandpass_range=(0, 2500),
annotation_buttons=["Accept", "Reject"],
N=20,
)
embedding query samples
/Users/SML161/opensoundscape/opensoundscape/ml/cnn.py:2955: UserWarning: The columns of input samples df differ from `model.classes`. Discarding sample df columns.
warnings.warn(
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1782311451.465561 4562402 service.cc:153] XLA service 0x393fcb4c0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1782311451.465592 4562402 service.cc:161] StreamExecutor [0]: Host, Default Version (Driver: 0.0.0; Runtime: 0.0.0; Toolkit: 0.0.0; DNN: 0.0.0)
I0000 00:00:1782311451.726795 4562402 dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
W0000 00:00:1782311452.145856 4591712 cpp_gen_intrinsics.cc:74] Empty bitcode string provided for eigen. Optimizations relying on this IR will be disabled.
I0000 00:00:1782311452.146537 4591712 rsqrt.cc:179] Falling back to 1 / sqrt(x) for f32 false
I0000 00:00:1782311452.566058 4562402 device_compiler.h:208] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
performing similarity search for each of 4 query samples
[13]:
# ingest labels from the interactive labeling widget
new_pos = best_matches[best_matches["Accept"] == True][
["file", "start_time", "end_time"]
].copy()
new_pos[target_model_class] = 1
new_neg = best_matches[best_matches["Reject"] == True][
["file", "start_time", "end_time"]
].copy()
new_neg[target_model_class] = 0
search_labels = (
pd.concat([new_pos, new_neg], ignore_index=True)
.drop_duplicates()
.set_index(["file", "start_time", "end_time"])[[target_model_class]]
)
# add these labels to their own dataset in the SongSpace
ss.ingest_audio(
search_labels,
dataset_name="search_labels",
batch_size=32,
num_workers=0,
)
ss.save()
search_labels[target_model_class].value_counts()
all samples already have embeddings in the database
Saved SongSpace to ./Perch2SongSpace with 0 classifiers and 4 datasets.
[13]:
RanaSierrae_C
1 8
Name: count, dtype: int64
Train first classifier
[19]:
clf_round1 = ss.fit_classifier(
classes=[target_model_class],
train_datasets=["round1_train", "search_labels"],
validation_dataset="validation",
weak_negatives_proportion=10.0, # lots of weak negatives, since we have just a few positives!
weak_negatives_weight=0.05,
steps=100,
batch_size=128,
validation_interval=50,
logging_interval=50,
)
clf_round1.val_metrics
training classifier for 1 classes with 12 training samples and 536 validation samples
Finding matching window IDs for samples in label_df...
Finding matching window IDs for samples in label_df...
Epoch 50/100, Loss: 0.031, Val Loss: 1.820
val AU ROC: 0.803
val MAP: 0.326
Epoch 100/100, Loss: 0.013, Val Loss: 2.213
val AU ROC: 0.802
val MAP: 0.332
Loaded best model with validation loss: 1.820 at step 50 of 100
Training complete
[19]:
{'loss': 1.8197819471359253,
'auroc': 0.8019947863538479,
'map': 0.3318269957719899,
'per_class_auroc': [0.8019947863538479]}
save the classifier in the SoundScape, if we like it enough
[20]:
if "rana_round1" in ss.list_classifiers():
ss.remove_classifier("rana_round1")
ss.add_classifier("rana_round1", clf_round1)
ss.save()
Saved SongSpace to ./Perch2SongSpace with 1 classifiers and 4 datasets.
evaluate a saved classifier on a specific dataset
[21]:
round1_metrics = ss.evaluate("rana_round1", "validation")
round1_metrics
Finding matching window IDs for samples in label_df...
[21]:
{'RanaSierrae_C': {'average_precision': 0.3318269957719899,
'roc_auc': 0.8019947863538479},
'macro_average_precision': np.float64(0.3318269957719899),
'macro_roc_auc': np.float64(0.8019947863538479)}
Active learning round: review high-scoring candidates
[22]:
pool_scores = ss.predict_on_dataset(
classifier_name="rana_round1", dataset_name="pool_unlabeled"
)
# drop samples already labeled
labeled_idx = set(search_labels.index).union(set(seed_train.index))
pool_scores = pool_scores[~pool_scores.index.isin(labeled_idx)]
topk = pool_scores.nlargest(20, target_model_class).reset_index()
# Review and annotate interactively.
_ = annotate(
topk, bandpass_range=(0, 2500), annotation_buttons=["Accept", "Reject"], N=20
)
topk.head()
Finding matching window IDs for samples in label_df...
[22]:
| file | start_time | end_time | RanaSierrae_C | Accept | Reject | |
|---|---|---|---|---|---|---|
| 0 | rana_sierrae_2022/mp3/sine2022a_MSD-0558_20220... | 0.0 | 3.0 | 4.038275 | None | None |
| 1 | rana_sierrae_2022/mp3/sine2022a_MSD-0558_20220... | 3.0 | 6.0 | 4.014554 | None | None |
| 2 | rana_sierrae_2022/mp3/sine2022a_MSD-0558_20220... | 6.0 | 9.0 | 3.934664 | None | None |
| 3 | rana_sierrae_2022/mp3/sine2022a_MSD-0558_20220... | 6.0 | 9.0 | 3.896363 | None | None |
| 4 | rana_sierrae_2022/mp3/sine2022a_MSD-0558_20220... | 6.0 | 9.0 | 3.894268 | None | None |
ingest labels
[23]:
new_pos = topk[topk["Accept"] == True][["file", "start_time", "end_time"]].copy()
new_pos[target_model_class] = 1
new_neg = topk[topk["Reject"] == True][["file", "start_time", "end_time"]].copy()
new_neg[target_model_class] = 0
round2_train = (
pd.concat([new_pos, new_neg], ignore_index=True)
.drop_duplicates()
.set_index(["file", "start_time", "end_time"])[[target_model_class]]
)
ss.ingest_audio(
round2_train,
dataset_name="round2_train",
batch_size=32,
num_workers=0,
)
ss.save()
round2_train[target_model_class].value_counts()
all samples already have embeddings in the database
Saved SongSpace to ./Perch2SongSpace with 1 classifiers and 5 datasets.
[23]:
RanaSierrae_C
0 8
1 4
Name: count, dtype: int64
build a new classifier
[24]:
clf_round2 = ss.fit_classifier(
classes=[target_model_class],
train_datasets=["round1_train", "search_labels", "round2_train"],
validation_dataset="validation",
weak_negatives_proportion=1.0,
weak_negatives_weight=0.001,
steps=200,
batch_size=128,
validation_interval=30,
logging_interval=30,
)
if "rana_round2" in ss.list_classifiers():
ss.remove_classifier("rana_round2")
ss.add_classifier("rana_round2", clf_round2)
round2_metrics = ss.evaluate("rana_round2", "validation")
round2_metrics
training classifier for 1 classes with 24 training samples and 536 validation samples
Finding matching window IDs for samples in label_df...
Finding matching window IDs for samples in label_df...
Epoch 30/200, Loss: 0.342, Val Loss: 0.629
val AU ROC: 0.818
val MAP: 0.575
Epoch 60/200, Loss: 0.198, Val Loss: 0.539
val AU ROC: 0.812
val MAP: 0.558
Epoch 90/200, Loss: 0.131, Val Loss: 0.494
val AU ROC: 0.805
val MAP: 0.563
Epoch 120/200, Loss: 0.094, Val Loss: 0.469
val AU ROC: 0.801
val MAP: 0.560
Epoch 150/200, Loss: 0.071, Val Loss: 0.454
val AU ROC: 0.796
val MAP: 0.555
Epoch 180/200, Loss: 0.056, Val Loss: 0.444
val AU ROC: 0.793
val MAP: 0.548
Loaded best model with validation loss: 0.444 at step 180 of 200
Training complete
Finding matching window IDs for samples in label_df...
[24]:
{'RanaSierrae_C': {'average_precision': 0.5429201413929045,
'roc_auc': 0.7913408137821604},
'macro_average_precision': np.float64(0.5429201413929045),
'macro_roc_auc': np.float64(0.7913408137821604)}
Active learning round 2: review high-scoring candidates
[25]:
pool_scores = ss.predict_on_dataset(
classifier_name="rana_round2", dataset_name="pool_unlabeled"
)
# drop samples already labeled
labeled_idx = (
set(search_labels.index).union(set(seed_train.index)).union(set(round2_train.index))
)
pool_scores = pool_scores[~pool_scores.index.isin(labeled_idx)]
topk = pool_scores.nlargest(20, target_model_class).reset_index()
# Review and annotate interactively.
_ = annotate(
topk, bandpass_range=(0, 2500), annotation_buttons=["Accept", "Reject"], N=20
)
topk.head()
Finding matching window IDs for samples in label_df...
[25]:
| file | start_time | end_time | RanaSierrae_C | Accept | Reject | |
|---|---|---|---|---|---|---|
| 0 | rana_sierrae_2022/mp3/sine2022a_MSD-0558_20220... | 6.0 | 9.0 | 5.725680 | None | None |
| 1 | rana_sierrae_2022/mp3/sine2022a_MSD-0558_20220... | 3.0 | 6.0 | 5.206980 | None | None |
| 2 | rana_sierrae_2022/mp3/sine2022a_MSD-0558_20220... | 0.0 | 3.0 | 4.048398 | None | None |
| 3 | rana_sierrae_2022/mp3/sine2022a_MSD-0558_20220... | 3.0 | 6.0 | 3.675750 | None | None |
| 4 | rana_sierrae_2022/mp3/sine2022a_MSD-0558_20220... | 9.0 | 12.0 | 3.413406 | None | None |
ingest labels
[26]:
new_pos = topk[topk["Accept"] == True][["file", "start_time", "end_time"]].copy()
new_pos[target_model_class] = 1
new_neg = topk[topk["Reject"] == True][["file", "start_time", "end_time"]].copy()
new_neg[target_model_class] = 0
round3_train = (
pd.concat([new_pos, new_neg], ignore_index=True)
.drop_duplicates()
.set_index(["file", "start_time", "end_time"])[[target_model_class]]
)
ss.ingest_audio(
round3_train,
dataset_name="round3_train",
batch_size=32,
num_workers=0,
)
ss.save()
round3_train[target_model_class].value_counts()
Saved SongSpace to ./Perch2SongSpace with 2 classifiers and 6 datasets.
[26]:
Series([], Name: count, dtype: int64)
build new classifier
[27]:
clf_round3 = ss.fit_classifier(
classes=[target_model_class],
train_datasets=["round1_train", "search_labels", "round2_train", "round3_train"],
validation_dataset="validation",
weak_negatives_proportion=1.0,
weak_negatives_weight=0.001,
steps=200,
batch_size=128,
validation_interval=30,
logging_interval=30,
)
if "rana_round3" in ss.list_classifiers():
ss.remove_classifier("rana_round3")
ss.add_classifier("rana_round3", clf_round3)
ss.save()
round3_metrics = ss.evaluate("rana_round3", "validation")
round3_metrics
training classifier for 1 classes with 24 training samples and 536 validation samples
Finding matching window IDs for samples in label_df...
Finding matching window IDs for samples in label_df...
Epoch 30/200, Loss: 0.349, Val Loss: 0.644
val AU ROC: 0.818
val MAP: 0.557
Epoch 60/200, Loss: 0.203, Val Loss: 0.553
val AU ROC: 0.811
val MAP: 0.549
Epoch 90/200, Loss: 0.134, Val Loss: 0.506
val AU ROC: 0.804
val MAP: 0.553
Epoch 120/200, Loss: 0.096, Val Loss: 0.481
val AU ROC: 0.799
val MAP: 0.551
Epoch 150/200, Loss: 0.073, Val Loss: 0.466
val AU ROC: 0.793
val MAP: 0.540
Epoch 180/200, Loss: 0.058, Val Loss: 0.456
val AU ROC: 0.790
val MAP: 0.528
Loaded best model with validation loss: 0.456 at step 180 of 200
Training complete
Saved SongSpace to ./Perch2SongSpace with 3 classifiers and 6 datasets.
Finding matching window IDs for samples in label_df...
[27]:
{'RanaSierrae_C': {'average_precision': 0.5286647807313382,
'roc_auc': 0.7881672900374023},
'macro_average_precision': np.float64(0.5286647807313382),
'macro_roc_auc': np.float64(0.7881672900374023)}
we now have a solid classifier to use for downstream tasks.
Select clips for manual verification
Use stratified or thresholded selection from the full embedded database.
select_from_hoplite provides several options for filtering. We can loop over the variables of interest to select stratified clips.
Filtering options include:
first and last date
earliest and latest time
minimum and maximum score
list of recordings (audio file paths)
list of deployments
list of projects
We also specify which classes we want to extract clips for, how many, and under which strategy:
top_k: highest scoring k (eg, 5) clips matching the filters
random_k: randomly selected k clips matching the filters
all: all clips matching the filters
[ ]:
# select the global 5 most confident 'RanaSierrae_C' clips from the pool according to the round 3 classifier
clips = select_from_hoplite(
db=ss.db,
classifier=ss.classifiers["rana_round3"],
classes=["RanaSierrae_C"],
strategy="top_k",
k=5,
)
inspect(clips, bandpass_range=(0, 2500))
clips
| file | start_time | end_time | datetime | deployment | project | window_id | class | |
|---|---|---|---|---|---|---|---|---|
| 0 | rana_sierrae_2022/mp3/sine2022a_MSD-0558_20220... | 3.0 | 6.0 | 2022-06-23 06:15:00 | mp3 | round1_train | 1264 | RanaSierrae_C |
| 1 | rana_sierrae_2022/mp3/sine2022a_MSD-0558_20220... | 3.0 | 6.0 | 2022-06-22 20:15:00 | mp3 | round1_train | 1220 | RanaSierrae_C |
| 2 | rana_sierrae_2022/mp3/sine2022a_MSD-0558_20220... | 0.0 | 3.0 | 2022-06-22 18:15:00 | mp3 | round1_train | 599 | RanaSierrae_C |
| 3 | rana_sierrae_2022/mp3/sine2022a_MSD-0558_20220... | 6.0 | 9.0 | 2022-06-22 20:15:00 | mp3 | round1_train | 1219 | RanaSierrae_C |
| 4 | rana_sierrae_2022/mp3/sine2022a_MSD-0558_20220... | 6.0 | 9.0 | 2022-06-23 06:00:00 | mp3 | round1_train | 3 | RanaSierrae_C |
We can re-load the SongSpace in another Python session, which will retain all the saved classifiers, labeled datasets, and embeddings. The clip dataframes created by these examples can be saved to CVS for annotation in Dipper or other review software.
[26]:
# reload the SongSpace, as we would in a new script/notebook
from opensoundscape import SongSpace
from opensoundscape.visualization import inspect
ss_reloaded = SongSpace.open("./Perch2SongSpace")
print(f"Classifiers: {ss_reloaded.list_classifiers()}")
print(f"Datasets: {ss_reloaded.list_datasets()}")
Connecting to existing db at Perch2SongSpace
Connected database has 2,691 embeddings from 672 files.
Classifiers: ['rana_round1', 'rana_round2', 'rana_round3']
Datasets: ['round1_train', 'validation', 'pool_unlabeled', 'search_labels', 'round2_train', 'round3_train']
/Users/SML161/miniconda3/envs/opso_dev/lib/python3.13/site-packages/bioacoustics_model_zoo/perch_v2.py:208: UserWarning: Disabling TensorFlow's XLA compilation (setting tf.config.optimizer.set_jit(False)) because otherwise TF models on Mac hang at runtime as of Tensorflow 2.21.0
warnings.warn(
Example: Stratify by date range and deployment
a typical stratification pattern for reviewing clips for an occupancy analysis
[ ]:
date_ranges = [
("2022-06-20", "2022-06-21"),
("2022-06-22", "2022-06-23"),
("2022-06-24", "2022-06-25"),
("2022-06-26", "2022-06-27"),
]
clips = ss_reloaded.stratified_selection(
ss_reloaded.classifiers["rana_round3"],
classes=["RanaSierrae_C"],
stratify_deployments=True,
k=1,
date_ranges=date_ranges,
)
# table ready for Dipper review with stratification by "date_range" and "deployment" in binary annotation mode
# selected.to_csv('RanaSierrae_C_clips_for_review.csv')
Other stratification and filteringpatterns
Let’s now select the highest scorking global k=2 clips for each of 4 date ranges. We’ll enforce a score threshold of 0.
[27]:
selected = ss_reloaded.stratified_selection(
classifier=ss_reloaded.classifiers["rana_round3"],
classes=["RanaSierrae_C"],
strategy="top_k",
k=2,
min_score=0,
date_ranges=date_ranges,
)
for date_range, clips in selected.groupby("date_range"):
print(f"Date range: {date_range}")
inspect(selected, bandpass_range=(0, 2500))
Date range: 2022-06-20 to 2022-06-21
Date range: 2022-06-22 to 2022-06-23
Date range: 2022-06-24 to 2022-06-25
Date range: 2022-06-26 to 2022-06-27
Example of clip selection for all clips above a threshold
[30]:
search_results_df = ss_reloaded.select(
ss_reloaded.classifiers["rana_round3"],
classes=["RanaSierrae_C"],
strategy="all",
min_score=0,
)
len(search_results_df)
[30]:
538
select random clips in a score bin
plus: use random state to create reproducible results
plus: restrict the time range
[31]:
import datetime
search_results_df = ss_reloaded.select(
classifier="rana_round3",
classes=["RanaSierrae_C"],
strategy="random_k",
min_score=-2,
max_score=1,
random_state=0,
# accepts either time strings or datetime.time objects
time_range=("00:00:00", datetime.time(8, 0, 0)),
)
search_results_df.head(3)
[31]:
| file | start_time | end_time | datetime | deployment | project | window_id | score | class | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | rana_sierrae_2022/mp3/sine2022a_MSD-0558_20220... | 6.0 | 9.0 | 2022-06-26 01:00:00 | mp3 | round1_train | 1527 | 0.630615 | RanaSierrae_C |
| 1 | rana_sierrae_2022/mp3/sine2022a_MSD-0558_20220... | 9.0 | 12.0 | 2022-06-21 07:00:00 | mp3 | round1_train | 2200 | -0.090810 | RanaSierrae_C |
| 2 | rana_sierrae_2022/mp3/sine2022a_MSD-0558_20220... | 3.0 | 6.0 | 2022-06-20 04:30:00 | mp3 | round1_train | 1343 | -0.179905 | RanaSierrae_C |
Finally, we can also directly count the number of clips scoring in a bin, without retrieving clip information.
This is memory-efficient on large datasets because we don’t need to aggregate the clip information.
[ ]:
from opensoundscape.ml.shallow_classifier import count_dets_hoplite
import pandas as pd
counts = count_dets_hoplite(
db=ss_reloaded.db,
classifier=ss_reloaded.classifiers["rana_round3"],
classes=["RanaSierrae_C"],
score_bins=[(-2, -1), (-1, 0), (0, 1), (1, 2)],
)
pd.DataFrame(counts)
| RanaSierrae_C | ||
|---|---|---|
| -2 | -1 | 734 |
| -1 | 0 | 1012 |
| 0 | 1 | 383 |
| 1 | 2 | 106 |
[ ]:
# Optional cleanup: uncomment and run to remove the SongSpace folder containing the embedding database, classifiers, and datasets
# import shutil
# shutil.rmtree('./songspace_agile_db/')