Practical Guide (with examples)
In this part of the documentation, usage of the package and wrapper scripts will be described in detail. As basty is a pipeline that consists of relatively well-separated stages, it would be easier to describe each stage separately in correspondence with its script. The main scripts to create a project and perform behavior mapping are located in scripts/project
. The scripts should be run after carefully setting the configuration files and paths. Default parameters are configured in the basty/project/helper
. If you want to set a different set of parameters, you need to edit scripts.
Initialization
Before everything, a project must be initialized. Each project has its own directory where its files are stored. Since experiment recordings are usually quite long, it can easily get unfeasible to keep all the data in the random access memory. Thus, results are stored in .pkl
(pands.DataFrame
s) or .npy
(numpy
array) and ExptRecord
instances (stored as .z
files with joblib
) are updated after each stage.
Initialization of the projects is done with the init_project.py
script, as the name suggests. All it does is initialize a Project
class instance from basty.project.experiment_processing
with suitable parameters and a path of main_cfg.yaml
. Keywords arguments annotation_priority
, inactive_annotation
, noise_annotation
and arouse_annotation
are just to specify how to deal with annotation files. For instance, inactive_annotation
determines the string correspondence of label 0
.
python init_project.py --main-cfg-path /path/to/main_cfg.yaml
Reading the Data and Feature Extraction
The second script that one should proceed with is extract_features.py
. This script wraps around four steps:
- Computing pose values (includes reading the experiment data and pre-processing) and storing
pose.pkl
~--compute-pose-values
, - Computing spatio-temporal features and storing
snap_stft.pkl
&delta_stft.pkl
~--compute-spatio-temporal-features
, - Applying wavelet transformation to
snap_stft.pkl
and storingwsft.npy
~--compute-postural-dynamics
, - Performing flattening, frame normalization, and storing
behavioral_reprs.npy
~--compute-behavioral-representations
.
You can simply pass the --compute-all
argument to run all four steps together in order. The below two commands are equivalent.
python extract_features.py --main-cfg-path /path/to/main_cfg.yaml --compute-all
python extract_features.py --main-cfg-path /path/to/main_cfg.yaml \
--compute-pose-values --compute-spatio-temporal-features --compute-postural-dynamics --compute-behavioral-representations
Since the preprocessing consists of many steps such as interpolation, smoothing, adaptive filtering based on confidence scores, etc., there are many parameters. Unless you’re not dealing with something extreme, the below default values should work fine. Kalman filter is implemented, but not recommended as it has lots of parameters that affect the resulting signal a lot. You can turn off a filter by setting its threshold or window size to 0.
pose_prep_kwargs = {
"compute_oriented_pose": True,
"compute_egocentric_frames": False,
"local_outlier_threshold": 9,
"local_outlier_winsize": 15,
"decreasing_llh_winsize": 30,
"decreasing_llh_lower_z": 0,
"low_llh_threshold": 0.0,
"median_filter_winsize": 6,
"boxcar_filter_winsize": 6,
"jump_quantile": 0,
"interpolation_method": "linear",
"interpolation_kwargs": {},
"kalman_filter_kwargs": {},
}
It is possible compute snap spatio-temporal features without computing delta spatio-temporal features (or vice versa) by changing the values of the below arguments.
stft_kwargs = {
"delta": compute_delta,
"snap": compute_snap,
}
The use_cartesian_blent
argument of FeatureExtraction.compute_behavioral_representations()
allow to average over x
and y
components of pose spatio-temporal features after wavelet transformation. If you do not make frames egocentric (which amplifies erroneous tracking), this is helpful for making feature representations orientation-independent. Another argument of FeatureExtraction.compute_behavioral_representations()
is norm
and it determines to type of frame normalization (such as L1
, L2
, max
). L1
normalization is recommended.
Detecting Activities
The next stage of the pipeline is detecting activity bouts and sleep epochs (i.e., long bouts), this stage is called “experiment outlining” in the pipeline. There exist two steps in this stage, namely “detecting dormancy epochs” (--outline-dormant-epochs
) and “detecting activity bouts” (--outline-active-bouts
).
python outline_experiments.py --main-cfg-path /path/to/main_cfg.yaml --outline-all
python outline_experiments.py --main-cfg-path /path/to/main_cfg.yaml \
--outline-dormant-epochs --outline-active-bouts
Both of these steps can be performed in an unsupervised or supervised fashion. You can determine which approach to follow by using the boolean parameter use_supervised_learning
. It is recommended to perform the unsupervised approach for outlining dormancy epochs and supervised learning for detecting activity bouts. The label_conversion_dict
parameter is used to convert annotations to more general categories for supervised learning purposes. For instance, in the below label_conversion_dict
configuration for ExptDormantEpochs.outline_dormant_epochs
, label 0
corresponds to the dormancy, and label 1
corresponds to arouse state.
label_conversion_dict = {
0: [
"Idle&Other",
"HaltereSwitch",
"Feeding",
"Grooming",
"ProboscisPump",
],
1: [
"Moving",
],
2: [
"Noise",
],
}
Similarly, for ExptDormantEpochs.outline_active_bouts
, we can configure it as follows, where 0
is for quiescence and 1
is for micro-activities.
label_conversion_dict = {
0: [
"Idle&Other",
],
1: [
"Feeding",
"Grooming",
"ProboscisPump",
"HaltereSwitch",
"Moving",
],
2: [
"Noise",
]
}
Mapping Activities Behavioral Embedding Spaces
The python script map_behaviors.py
provides some critical and important functionality for behavior mapping, together with many redundant and impractical ones. We will use map_behaviors.py
to compute behavioral embedding spaces with different approaches (e.g., joint and disparate behavioral embeddings, using supervised or unsupervised dimensionality reduction). The other not-so-useful functionalities this script provides are related to clustering and analysis of clusters. As basty evolved to be a supervised or semi-supervised pipeline, we did not evaluate the performances of clustering-related components.
Our behavioral nearest neighbor prediction scheme requires the computation of behavioral embedding spaces called semi-supervised pair embeddings. In this approach, we run a semi-supervised extension of UMAP with each annotated and unannotated behavioral experiment pair. As a result for $R^{+}$ annotated and $R^{-}$ unannotated experiments, $R^{+} \times R^{-}$ many low dimensional behavioral space is generated.
python map_behaviors.py --main-cfg-path /path/to/main_cfg.yaml --compute-semisupervised-pair-embeddings
For more detailed and accurate explanations of the dimensionality reduction parameters please see UMAP documentation. The below defaults should work well in most cases. We recommend sticking to Hellinger distance as behavioral representations are $\textnormal{L}_1$ normalized vectors, treating them as probability distributions make more sense.
UMAP_kwargs = {}
UMAP_kwargs["embedding_n_neighbors"] = 75
UMAP_kwargs["embedding_min_dist"] = 0.0
UMAP_kwargs["embedding_spread"] = 1.0
UMAP_kwargs["embedding_n_components"] = 2
UMAP_kwargs["embedding_metric"] = "hellinger"
UMAP_kwargs["embedding_low_memory"] = False
use_annotations_to_mask = (True, False)
embedding_kwargs = {
**UMAP_kwargs,
"use_annotations_to_mask": use_annotations_to_mask,
}
The parameter use_annotations_to_mask
is pair of boolean values, which controls the subset of frames to be used in the computation of behavioral embeddings. If the first value is True
, then only the annotated frames of the supervised experiment will be used. Otherwise, all frames detected to be dormant and active in the previous stage of the pipeline will be used. The same is true for the unsupervised experiment with respect to the second value of the tuple, but only if evaluation_mode
is true
in the main configuration (as a matter of fact, an unsupervised experiment is supposed to lack annotations in the first place).
The below parameters are for density-based clustering. See HDBSCAN documentation, if you want to experiment with clustering in behavioral embedding spaces.
HDBSCAN_kwargs = {}
HDBSCAN_kwargs["prediction_data"] = True
HDBSCAN_kwargs["approx_min_span_tree"] = True
HDBSCAN_kwargs["cluster_selection_method"] = "eom"
HDBSCAN_kwargs["cluster_selection_epsilon"] = 0.0
HDBSCAN_kwargs["min_cluster_size"] = 500
HDBSCAN_kwargs["min_cluster_samples"] = 5
clustering_kwargs = {**HDBSCAN_kwargs}
Computing Behavioral Scores & Predicting Behavioral Categories
Lastly, predict_behavioral_categories.py
script computes behavioral scores and predicts corresponding behavioral categories by performing a nearest neighbor based analysis in the semi-supervised behavior embeddings. It can be configured by using different command line arguments. To see available arguments and their possible values, one can run the below command.
python predict_behavioral_categories.py --help
The output is the following.
usage: predict_behavioral_categories.py [-h] --main-cfg-path MAIN_CFG_PATH --num-neighbors NUM_NEIGHBORS [--neighbor-weights {distance,sq_distance}]
[--neighbor-weights-norm {count,log_count,sqrt_count,proportion}] [--activation {softmax,standard}]
[--voting-weights {entropy,uncertainity}] [--voting {hard,soft}] [--save-weights | --no-save-weights]
Here, as the name suggests, num-neighbors
is basically the number of nearest neighbors contributing to the behavioral score. This value is typically in the wide interval of $5$ to $100$. However, as the value of num-neighbors
increases, imbalanced classes (if exists) become a more serious problem. As a matter of fact, behavioral categories usually have a very imbalanced number of occurrence distributions. To deal with this, one might want to use --neighbor-weights-norm
option. For instance, --neighbor-weights-norm log_count
will normalize each score with the $\log$ number of occurrences of the corresponding behavioral category. As a default, this script does not apply any such normalization. The argument --neighbor-weights
can be used to weight points by the inverse of their distance, i.e., closer neighbors will have a greater influence than neighbors which are further away. Before combining behavioral weights contributed by each annotated experiment, one also may want to map those weights to interval $(0,1)$ to avoid complications related to differences in the scales. This can be achieved by --activation softmax
or --activation standard
. The other two arguments, namely --voting-weights
and --voting
, configure how the behavioral weights contributed by different annotated experiments are combined to compute a final behavioral score. If --voting soft
is passed, then the behavioral weights are directly included in the summation. Else, if --voting hard
is passed, then behavioral weights are binarized before the summation as maximum category being $1$ and others being $0$. It is also possible to adjust the contribution of each annotated experiment with respect to the level of uncertainty by using --voting-weights
. In other words, if the behavioral weight mainly favors one category and is certain about it, then their contribution will have a greater influence on the final score.
A sensible example list of arguments for predict_behavior_categories.py
is given below.
python predict_behavior_categories.py --main-cfg-path /path/to/main_cfg.yaml \
--num-neighbors 15 --neighbor-weights distance --neighbor-weights-norm log_count \
--activation standard --voting soft
Finally, a report of predictions can simply be generated with the export_behavior_predictions.py
script as below.
python export_behavior_predictions.py --main-cfg-path /path/to/main_cfg.yaml