Keypoint-MoSeq: parsing behavior by linking point tracking to pose dynamics (2024)

Ethical compliance

All experimental procedures were approved by the Harvard Medical School Institutional Animal Care and Use Committee (protocol number 04930) and were performed in compliance with the ethical regulations of Harvard University as well as the Guide for Animal Care and Use of Laboratory Animals.

Animal care and behavioral experiments

Unless otherwise noted, behavioral recordings were performed on 8–16-week-old C57/BL6 mice (The Jackson Laboratory stock no. 000664). Mice were transferred to our colony at 6–8 weeks of age and housed in a reverse 12-h light/12-h dark cycle. We single-housed mice after stereotactic surgery and group-housed them otherwise. On recording days, mice were brought to the laboratory, habituated in darkness for at least 20 min, and then placed in the behavioral arena for 30–60 min. We recorded 6 male mice for 10 sessions (6 h) in the initial round of open field recordings; 5 male mice for 52 sessions (50 h) during the accelerometry recordings; 16 male mice for 16 sessions (8 h) during the environmental enrichment experiment; and 5 male mice for 9 sessions (6 h) during the thermistor recordings. The dopamine photometry recordings were obtained from a recent study²². They include 6 C57/BL6 mice and 8 DAT-IRES-cre (The Jackson Laboratory stock no. 006660) mice of both sexes, recorded for 378 sessions. Of these, we selected a random subset of 95 sessions (~50 h) for benchmarking keypoint-MoSeq.

Stereotactic surgery procedures

For all stereotactic surgeries, mice were anesthetized using 1–2% isoflurane in oxygen, at a flow rate of 1 l min⁻¹ for the duration of the procedure. Anteroposterior (AP) and mediolateral (ML) coordinates (in millimeters) were zeroed relative to bregma, and the dorsoventral (DV) coordinate was zeroed relative to the pial surface. All mice were monitored daily for 4 days following surgery and were allowed to recover for at least 1 week. Mice were then habituated to handling and brief head-fixation before beginning recordings.

For dopamine recordings, 400 nl of AAV5.CAG.dLight1.1 (Addgene, 111067; titer: 4.85 × 10¹²) was injected at a 1:2 dilution into the DLS (AP 0.260; ML 2.550; DV −2.40), and a single 200-μm-diameter, 0.37–0.57-NA fiber cannula was implanted 200 μm above the injection site (see ref. ²² for additional details).

For accelerometry recordings, we surgically attached a Mill-Max connector (DigiKey, ED8450-ND) and head bar to the skull and secured it with dental cement (Metabond). A nine-degree-of-freedom absolute orientation IMU (Bosch, BN0055) was mounted on the Mill-Max connector using a custom printed circuit board (PCB) with a net weight below 1 g.

For thermistor surgeries, we adapted a protocol previously described³⁷. We first prepared the implant (GAG22K7MCD419, TE Connectivity) by stripping the leads and soldering them to two male Mill-Max pins (0.05-inch pitch, 851-93-050-10-001000). The pins and their solder joins were then entirely covered in Prime-Dent light-curable cement, and cured for 10–20 s, to ensure the longevity and stability of the electrical connection. Each implant was tested by touching two leads of a multimeter (set to measure resistance) to the female side of the Mill-Max, breathing gently on the thermistor, and checking for a resistance drop of roughly 20 kΩ to 18 kΩ.

To implant the thermistor, a midline incision was made from ~1 mm behind lambda to ~1 mm anterior to the nasal suture, and the skull cleaned and lightly scored. A craniotomy was made just anterior to the nasal suture (well posterior to the position originally reported³⁷), large enough for the thermistor to fit fully inside. The thermistor was fully inserted along the AP axis so that it lay flat in the horizontal plane inside the nasal cavity. The craniotomy was then sealed with KwikSil, and the thermistor wire was secured to the skull 1–2 mm posterior to the craniotomy with cyanoacrylate glue (Loctite 454). Then dental cement (Metabond) was used to attach the Mill-Max connector in an upright position between bregma and lambda, and a head bar was cemented to the skull at lambda.

Microsoft Azure recording setup

For the initial set of open field recordings (Figs. 1, 2, 3a–g and 5g–l), mice were recorded in a square arena with transparent floor and walls (30 cm length and width). Microsoft Azure Kinect cameras captured simultaneous depth and near-IR video at 30 Hz. Six cameras were used in total: one above, one below and four side cameras at right angles at the same height as the mouse.

Accelerometry recordings

For the accelerometry recordings, we used a single Microsoft Azure Kinect camera placed above the mouse, and an arena with transparent floor and opaque circular walls (45-cm diameter). Data were transferred from the IMU using a lightweight tether attached to a custom-built active commutator. The IMU was connected to a Teensy microcontroller, which was programmed using the Adafruit BNO055 library with default settings (sample rate: 100 Hz, units: m/s²). To synchronize the IMU measurements and video recordings, we used an array of near-IR LEDs to display a rapid sequence of random 4-bit codes that updated throughout the recording. The code sequence was later extracted from the behavioral videos and used to fit a piecewise linear model between timestamps from the videos and timestamps from the IMU.

Thermistor recordings

To record mouse respiration and movement at high frame rates, we built a multi-camera recording arena using six Basler ace acA1300-200um Monochrome USB 3.0 Cameras (Edmund Optics, 33-978) that recorded from above, from below and four side views. The cameras were triggered at 120 Hz using an Arduino. Video compression was performed in real time on a GPU using a custom library (https://github.com/calebweinreb/multicamera_acquisition/). Mice were recorded in an open-top glass cube and illuminated with 32 near-IR high-power LED stars (LEDSupply, CREEXPE-FRD-3). To avoid reflections and saturations effects, the bottom camera was triggered slightly out of phase with the top cameras, and the LEDs were split into two groups: one group below the arena that turned on during the bottom camera’s exposure, and one group above the arena that turned on during the top and side cameras’ exposure.

To record the thermistor signal, we designed a custom PCB that used an op-amp (INA330AIDGST, Texas Instruments) to transform the thermistor’s resistance fluctuations into voltages, and another circuit element to keep the voltage within the 0–3.3 V range. The PCB was connected to an Arduino (separate from the one controlling the cameras) that recorded the output. The PCB parts list, schematic and microcontroller code are available upon reasonable request to the laboratory of S.R.D.

Before behavioral recording sessions with the thermistor, mice were briefly head-fixed, and a cable with a custom headstage was inserted into the head-mounted Mill-Max adaptor. The cable was commutated with an assisted electric commutator from Doric Lenses and connected to the input of the op-amp on the custom PCB. To synchronize the thermistor and video data, we piped a copy of the camera trigger signal from the camera-Arduino to the thermistor-Arduino and recorded this signal alongside the thermistor output.

Environmental enrichment recordings

To test the effects of environmental enrichment on behavior, we built an arena for overhead video recording of an open-topped home cage. The home cage was surrounded on each side by a 16-inch vertical barrier, illuminated from above by three near-IR LED starts (LEDSupply, CREEXPE-FRD-3) and recorded with a Basler ace acA1300-200um Monochrome USB 3.0 Camera (Edmund Optics 33-978). For half the recordings, the cage was filled with bedding, nesting material, chew sticks and a transparent, dome-shaped hut. For the other half, the cage was completely empty (except for the mouse).

Software

The following publicly available software packages were used for analysis: Python (version 3.8), NumPy (version 1.24.3), Scikit-learn (version 1.2.2), PyTorch (version 1.9), Jax (version 0.3.22), SciPy (version 1.10.1), Matplotlib (version 3.7.1), Statsmodels (version 0.13.5), Motionmapperpy (version 1.0), DeepLabCut (version 2.2.1), SLEAP (version 1.2.3), B-SOiD (version 1.5.1), VAME (version 1.1), GIMBAL (version 0.0.1), HRNet (unversioned), LightningPose (version 0.0.4) and segmentation_models_pytorch (version 0.3.3).

Statistics

All reported P values for comparisons between distributions were derived from Mann–Whitney U tests unless stated otherwise. In all comparisons to ‘shuffle’, the shuffle represents a cyclic permutation of the data.

Processing depth videos

Applying MoSeq to depth videos involves: (1) mouse tracking and background subtraction; (2) egocentric alignment and cropping; (3) PCA; and (4) probabilistic modeling. We applied steps 2–4 as described in the MoSeq2 pipeline²⁵. For step 1, we trained a convolutional neural network (CNN) with a Unet++⁴⁴ architecture to segment the mouse from background using ~5,000 hand-labeled frames as training data.

Keypoint tracking for Microsoft Azure IR recordings

We used CNNs with an HRNet⁴⁵ architecture (https://github.com/stefanopini/simple-HRNet/) with a final stride of two for pose tracking. The networks were trained on ~1,000 hand-labeled frames each for the overhead, below-floor and side-view Microsoft Azure cameras. Frame labeling was crowdsourced through a commercial service (Scale AI). The crowdsourced labels were comparable to those from experts in our laboratory (Extended Data Fig. 1d). For the overhead camera, we tracked two ears and six points along the dorsal midline (tail base, lumbar spine, thoracic spine, cervical spine, head and nose). For the below-floor camera, we tracked the tip of each forepaw, the tip and base of each hind paw, and four points along the ventral midline (tail base, genitals, abdomen and nose). For the side cameras, we tracked the same eight points as for the overhead camera and included the six limb points that were used for the below-floor camera (14 total). We trained a separate CNN for each camera angle. Target activations were formed by centering a Gaussian with a 10-pixel (px) standard deviation on each keypoint. We used the location of the maximum pixel in each output channel of the neural network to determine keypoint coordinates and used the value at that pixel to set the confidence score. The resulting mean absolute error (MAE) between network detections and manual annotations was 2.9 px for the training data and 3.2 px for held-out data. We also trained DeepLabCut and SLEAP models on the overhead-camera and below-floor-camera datasets. For DeepLabCut, we used version 2.2.1, setting the architecture to resnet50 architecture and the ‘pos_dist_thresh’ parameter to 10, resulting in train and test MAEs of 3.4 px and 3.8 px, respectively. For SLEAP, we used version 1.2.3 with the baseline_large_rf.single.json configuration, resulting in train and test MAEs of 3.5 px and 4.7 px. For Lightning Pose⁴⁰, we used version 0.0.4 and default parameters with ‘pca_singleview’ and ‘temporal’ loss terms.

Keypoint tracking for thermistor recordings

We trained separate keypoint detection networks for the Basler camera arena (used for the thermistor recordings). CNNs with an HRNet architecture were trained on ~1,000 hand-labeled frames each for the overhead and below-floor cameras and ~3,000 hand-labeled frames for the side-view cameras. The same keypoints were used as the ones for the Microsoft Azure dataset.

3D pose inference

Using 2D keypoint detections from six cameras, 3D keypoint coordinates were triangulated and then refined using GIMBAL, a model-based approach that leverages anatomical constraints and motion continuity²⁹. To fit GIMBAL, we computed initial 3D keypoint estimates using robust triangulation (that is, by taking the median across all camera pairs, as in 3D-DeepLabCut⁴⁶) and then filtered to remove outliers using the EllipticEnvelope method from sklearn; we then fit the skeletal parameters and directional priors for GIMBAL using expectation maximization with 50 pose states. Finally, we applied the fitted GIMBAL model to each recording, using the following parameters for all keypoints: obs_outlier_variance = 1e6, obs_inlier_variance = 10, pos_dt_variance = 10. The latter parameters were chosen based on the accuracy of the resulting 3D keypoint estimates, as assessed from visual inspection. Camera calibration and initial triangulation were performed using a custom library (https://github.com/calebweinreb/multicam-calibration/tree/main/multicam_calibration/).

Keypoint change score

We defined the keypoint ‘change score’ as the total velocity of keypoints after egocentric alignment. The goal of the change score is to highlight sudden shifts in pose. It was calculated by: (1) transforming keypoints into egocentric coordinates; (2) smoothing the transformed coordinates with Gaussian kernel (sigma = 1 frame); (3) calculating total change in coordinates across each frame; and (4) z-scoring. Formally, the score can be defined as:

$${\rm{Change}}\,{\rm{score}}\left(t\right)={z\,{\mathrm{score}}}(\left|\;{y}_{{t}}-{y}_{{t-1}}\right|)$$

where y_t are the keypoint coordinates after Gaussian smoothing.

Spectral analysis of keypoint jitter

To analyze keypoint jitter, we quantified the magnitude of fluctuations across a range of frequencies by computing a spectrogram for each keypoint along each coordinate axis. Spectrograms were computed using the python function scipy.signal.spectrogram with nperseg = 128 and noverlap = 124. The spectrograms were then combined through averaging: each keypoint was assigned a spectrogram by averaging over the two coordinate axes, and the entire animal was assigned a spectrogram by averaging over all keypoints.

We used the keypoint-specific spectrograms to calculate cross-correlations with −log₁₀ (neural network detection confidence), as well as the ‘error magnitude’ (Fig. 1g). Error magnitude was defined as the distance between the detected 2D location of a keypoint (based on a single camera angle) and a re-projection of its 3D position (based on consensus across six camera angles; see ‘3D pose inference’ above). We also computed the cross-correlation between nose and tail-base fluctuations at each frequency, as measured by the overhead and below-floor cameras, respectively. Finally, we averaged spectral power across keypoints to compute the cross-correlation with model transition probabilities (Fig. 1g). The model transition probabilities were defined for each frame as the fraction of N = 20 model fits in which a transition occurred on that frame. Formally, if z⁽ⁱ⁾ denotes the syllable sequence learned by model fit i, then the transition probability at time t is calculated as

$$\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}\delta \left({z}_{{t}}^{\left({i}\right)}\ne {z}_{{t-t}}^{\left({i}\right)}\right)$$

Applying keypoint-MoSeq

Datasets were modeled separately and multiple models with different random seeds were fit for each dataset (see Supplementary Table 1 for number of fits per dataset).

Modeling consisted of two phases: (1) fitting an AR-HMM to a fixed pose trajectory derived from PCA of egocentric-aligned keypoints; and (2) fitting a full keypoint-MoSeq model initialized from the AR-HMM. References in the text to ‘MoSeq applied to keypoints’ or ‘MoSeq (keypoints)’, for example, in Figs. 1 and 2, refer to output of step 1. Both steps are described below, followed by a detailed description of the model and inference algorithm in the ‘mathematical notation’ section. In all cases, we excluded rare states (frequency < 0.5%) from downstream analysis. We have made the code available as a user-friendly package via https://keypoint-moseq.readthedocs.io/en/latest/. With a consumer GPU, keypoint-MoSeq requires 30–60 min of computation time to model 5 h of data. The computation time scales linearly with dataset size.

Fitting an initial AR-HMM

We first modified the keypoint coordinates, defining keypoints with confidence below 0.5 as missing data and in imputing their values via linear interpolation, and then augmenting all coordinates with a small amount of random noise; the noise values were uniformly sampled from the interval [−0.1, 0.1] and helped prevent degeneracy during model fitting. Importantly, these preprocessing steps were only applied during AR-HMM fitting—the original coordinates were used when fitting the full keypoint-MoSeq model.

Next, we centered the coordinates on each frame, aligned them using the tail–nose angle, and then transformed them using PCA with whitening. The number of principal components (PCs) was chosen for each dataset as the minimum required to explain 90% of total variance. This resulted in four PCs for the overhead-camera 2D datasets, six PCs for the below-floor camera 2D datasets and six PCs for the 3D dataset.

We then used Gibbs sampling to infer the states and parameters of an AR-HMM, including the state sequence z, the autoregressive parameters A, b and Q, and the transition parameters π and β. The hyperparameters for this step, listed in ‘mathematical notation’ below, were generally identical to those in the original depth MoSeq model. The one exception was the stickiness hyperparameter κ, which we adjusted separately for each dataset to ensure a median state duration of 400 ms.

Fitting a full keypoint-MoSeq model

We next fit the full set of variables for keypoint-MoSeq, which include the AR-HMM variables mentioned above, as well as the location v and heading h, latent pose trajectory x, per-keypoint noise level σ² and per-frame/per-keypoint noise scale s. Fitting was performed using Gibbs sampling for 500 iterations, at which point the log joint probability appeared to have stabilized.

The hyperparameters for this step are enumerated in ‘mathematical notation’. In general, we used the same hyperparameter values across datasets. The two exceptions were the stickiness hyperparameter κ, which again had to be adjusted to maintain a median state duration of 400 ms, and s₀, which determines a prior on the noise scale. Because low-confidence keypoint detections often have high error, we set s₀ using a logistic curve that transitions between a high-noise regime (s₀ = 100) for detections with low confidence and a low-noise regime (s₀ = 1) for detections with high confidence:

$${s}_{0}=1+100{\left(1+{\rm{e}}^{20\left({\rm{confidence}}-0.4\right)}\right)}^{-1}$$

The κ value used for each dataset is reported in Supplementary Table 2.

Trajectory plots

To visualize the modal trajectory associated with each syllable (Fig. 2e), we (1) computed the full set of trajectories for all instances of all syllables, (2) used a local density criterion to identify a single representative instance of each syllable and (3) computed a final trajectory using the nearest neighbors of the representative trajectory.

Computing the trajectory of individual syllable instances

Let y_t, v_t and h_t denote the keypoint coordinates, centroid and heading of the mouse at time t, and let F(v, h; y) denote the rigid transformation that egocentrically aligns y using centroid v and heading h. Given a syllable instance with onset time T, we computed the corresponding trajectory X_T by centering and aligning the sequence of poses $({y}_{{T-5}},\ldots ,{y}_{{T+15}})$ using the centroid and heading on time T. In other words

$${X}_{{T}}=\left[F\left({v}_{{T}},{h}_{{T}}{\rm{;}}{y}_{{T-5}}\right),\ldots ,F\left({v}_{{T}},{h}_{{T}}{\rm{;}}{y}_{{T+15}}\right)\right]$$

Identifying a representative instance of each syllable

The collection of trajectories computed above can be thought of as a set of points in a high dimensional trajectory space (for K keypoints in 2D, this space would have dimension 40K). Each point has a syllable label, and the segregation of these labels in the trajectory space represents the kinematic differences between syllables. To capture these differences, we computed a local probability density function for each syllable, and a global density function across all syllables. We then selected a representative trajectory X for each syllable by maximizing the ratio:

$$\frac{{\rm{Local}}\, {\rm{density}}(X)}{{\rm{Global}}\, {\rm{density}}(X)}$$

The density functions were computed as the mean distance from each point to its 50 nearest neighbors. For the global density, the nearest neighbors were selected from among all instances of all syllables. For the local densities, the nearest neighbors were selected from among instances of the target syllable.

Computing final trajectories for each syllable

For each syllable and its representative trajectory X, we identified the 50 nearest neighbors of X from among other instances of the same syllable and then computed a final trajectory as the mean across these nearest neighbors. The trajectory plots in Fig. 2e consist of ten evenly-space poses along this trajectory, that is, the poses at times $T-5,{T}-3,\ldots ,T+13$.

Testing robustness to missing data

To test the ability of keypoint-MoSeq to infer syllables and sequences in the face of missing data, we artificially ablated random subsets of keypoints at randomly timed intervals and then modeled the ablated data (Extended Data Fig. 2c–f). The ablation intervals began on every 10th second of the recording and lasted between 33 ms and 3 s (uniformly at random). For each interval, anywhere between 1 and 8 keypoints were selected (uniformly at random). Ablation entailed (1) erasing the keypoint coordinates and then filling the gap by linear interpolation; (2) setting the corresponding confidence values to 0. We then applied keypoint-MoSeq 20 times with different random seeds, using a single, fixed set of parameters derived previously from standard model fitting on the unablated dataset. Fixing the parameters ensured that syllable labels would be comparable across repeated model fits.

Cross-syllable likelihoods

We defined each cross-syllable likelihood as the probability (on average) that instances of one syllable could have arisen based on the dynamics of another syllable. The probabilities were computed based on the discrete latent states z_t, continuous latent states x_t and autoregressive parameters A, b and Q output by keypoint-MoSeq. The instances I(n) of syllable n were defined as the set of all sequences $({t}_{\rm{s}},\ldots ,{t}_{\rm{e}})$ of consecutive timepoints such that z_t = n for all ${t}_{\rm{s}}\le t\le {t}_{\rm{e}}$ and ${z}_{{t}_{\rm{s}}-1}\ne n\ne {z}_{{t}_{\rm{e}}+1}$. For each such instance, one can calculate the probability $P\left({x}_{{t}_{\rm{s}}},\ldots ,{x}_{{t}_{\rm{e}}}\right|{A}_{{m}},{b}_{{m}},{Q}_{{m}})$ that the corresponding sequence of latent states arose from the autoregressive dynamics of syllable m. The cross-syllable likelihood C_nm is defined in terms of these probabilities as

$${C}_{{nm}}=\frac{1}{\left|I(n)\right|}\sum _{\left({{t}}_{\rm{s}},\ldots ,{{t}}_{\rm{e}}\right)\in {\rm{I}}({{n}})}\frac{\left({x}_{{t}_{\rm{s}}},\ldots ,{x}_{{t}_{\rm{e}}}\right|{A}_{{m}},{b}_{{m}},{Q}_{{m}})}{\left({x}_{{t}_{\rm{s}}},\ldots ,{x}_{{t}_{\rm{e}}}\right|{A}_{{n}},{b}_{{n}},{Q}_{{n}})}$$

Generating synthetic keypoint data

To generate the synthetic keypoint trajectories used for Extended Data Fig. 2h, we fit a linear dynamical system (LDS) to egocentrically aligned keypoint trajectories and then sampled randomly generated outputs from the fitted model. The LDS was identical to the model underlying keypoint-MoSeq (see ‘mathematical notation’), except that it only had one discrete state, lacked centroid and heading variables and allowed separate noise terms for the x and y coordinates of each keypoint.

Expected marginal likelihood score

Because keypoint-MoSeq can at best produce point estimates of the model parameters—which will differ from run to run—users typically run the model several times and then rank the resulting fits. For ranking model fits, we defined a custom metric called the expected marginal likelihood score. The score evaluates a given set of autoregressive parameters (A, b, Q) by the expected value of the marginal log likelihood: ${E}_{{x} \sim P\left(x|\;y\right)}\log P\left(x|A,b,Q\right)$. In practice, given an ensemble of pose trajectories x⁽ⁱ⁾ and parameters ${\theta }^{(i)}=(A,b,Q)$ derived from N separate MCMC chains, the scores are computed as:

$${\rm{Score}}\left({\theta }^{\left({i}\right)}\right)=\frac{1}{1-N}\sum _{{j\ne i}}\log P\left({{x}^{\,\left({j}\,\right)}{\rm{|}}\theta }^{(i)}\right)$$

The scores shown in Extended Data Fig. 3j–m were computed using an ensemble of N = 20 chains. We chose this custom score instead of a more standard metric (such as held-out likelihood) because computing the latter is intractable for the keypoint-MoSeq model.

Environmental enrichment analysis

We fit a single keypoint-MoSeq model to the environmental enrichment dataset, which included recordings in an enriched home cage and control recordings in an empty cage. The transition graph (Extended Data Fig. 8b) was generated with keypoint-MoSeq’s analysis pipeline (https://keypoint-moseq.readthedocs.io/en/latest/analysis.html#syllable-transition-graph/) using node positions from a force directed layout. Detection of differentially used syllables was also performed using the analysis pipeline, which applies a Kruskal–Wallis test for significant differences in the per-session frequency of each syllable (https://keypoint-moseq.readthedocs.io/en/latest/analysis.html#compare-between-groups/). Syllables were clustered into three groups by applying community detection (networkx.community.louvain_communities) to a complete graph where nodes are syllables and edges were weighted by the bigram probabilities ${b}_{ij}=P({z}_{t}=i,\,{z}_{t+1}=j)$).

Applying published methods for behavior analysis

We applied B-SOiD, VAME and MotionMapper using default parameters, except for the parameter scans in Extended Data Fig. 5 (see Supplementary Table 3 for a summary for all parameter choices). In general, we were unable to uniformly improve the performance of any method by deviating from these default parameters. For example, switching VAME’s state-partition method from hidden Markov model (HMM) to k-means led to higher change score alignment (Extended Data Fig. 5a) but caused a decrease in alignment to supervised behavior labels (Fig. 5e,f shows performance under an HMM; performance under k-means is not shown). Our application of each method is described in detail below.

B-SOiD is an automated pipeline for behavioral clustering that: (1) preprocesses keypoint trajectories to generate pose and movement features; (2) performs dimensionality reduction on a subset of frames using uniform manifold approximation and projection; (3) clusters points in the uniform manifold approximation and projection space; and (4) uses a classifier to extend the clustering to all frames¹². We fit B-SOiD separately for each dataset. In each case, steps 2–4 were performed multiple times with different random seeds (see Supplementary Table 1 for number of fits per dataset), and the pipeline was applied with standard parameters; 50,000 randomly sampled frames were used for dimensionality reduction and clustering, and the min_cluster_size range was set to 0.5–1%. Because B-SOiD uses a hardcoded window of 100 ms to calculate pose and movement features, we reran the pipeline with falsely inflated frame rates for the window-size scan in Extended Data Fig. 5a. In all analyses involving B-SOiD, rare states (frequency < 0.5%) were excluded from the analysis.

VAME is a pipeline for behavioral clustering that: (1) preprocesses keypoint trajectories and transforms them into egocentric coordinates; (2) fits a recurrent neural network; (3) clusters the latent code of the recurrent neural network¹³. We applied these steps separately to each dataset, in each case running step 3 multiple times with different random seeds (see Supplementary Table 1 for number of fits per dataset). For step 1, we used the same parameters as in keypoint-MoSeq—egocentric alignment was performed along the tail–nose axis, and we set the pose_confidence threshold to 0.5. For step 2, we set time_window = 30 and zdims = 30 for all datasets, except for the zdim-scan in Extended Data Fig. 5a. VAME provides two different options for step 3: fitting an HMM (default) or applying k-means (alternative). We fit an HMM for all datasets and additionally applied k-means to the initial open dataset. In general, we approximately matched the number of states/clusters in VAME to the number identified by keypoint-MoSeq, except when scanning over state number in Extended Data Fig. 5a. In all analyses involving VAME, rare states (frequency < 0.5%) were excluded from analysis.

MotionMapper performs unsupervised behavioral segmentation by: (1) applying a wavelet transform to preprocessed pose data; (2) nonlinearly embedding the transformed data in 2D; and (3) clustering the 2D data with a watershed transform¹⁷. We applied these steps separately to each dataset, in each case running steps 2–3 multiple times with different random seeds (see Supplementary Table 1 for number of fits per dataset). There are several published implementations of MotionMapper, which perform essentially the same set of transformations but differ in programming language. We obtained similar results from a recent Python implementation from the Berman laboratory (https://github.com/bermanlabemory/motionmapperpy/) and a published MATLAB implementation³⁰. All results in the paper are from the Python implementation, which we applied as follows. Data were first egocentrically aligned along the tail–nose axis and then projected into eight dimensions using PCA. Ten log-spaced frequencies between 0.25 Hz and 15 Hz were used for the wavelet transform, and dimensionality reduction was performed using t-distributed stochastic neighbor embedding. The threshold for watershedding was chosen to produce at least 25 clusters, consistent with keypoint-MoSeq for the overhead-camera data. Rare states (frequency < 0.5%) were excluded from analysis. For the parameter scan in Extended Data Fig. 5a, we varied each of these parameters while holding the others fixed, including the threshold for watershedding, the number of initial PCA dimensions, and the frequency range of wavelet analysis. We also repeated a subset of these analyses using an alternative autoencoder-based dimensionality reduction approach, as described in the motionmapperpy tutorial (https://github.com/bermanlabemory/motionmapperpy/blob/master/demo/motionmapperpy_mouse_demo.ipynb/).

Predicting kinematics from state sequences

We trained decoding models based on spline regression to predict kinematic parameters (height, velocity and turn speed) from state sequences output by keypoint-MoSeq and other behavior segmentation methods (Fig. 3e and Extended Data Fig. 5c). Let z_t represent an unsupervised behavioral state sequence and let B denote a spline basis, where B_t,i is the value of spline i and frame t. We generated such a basis using the ‘bs’ function from the Python package ‘patsy’, passing in six log-spaced knot locations (1.0, 2.0, 3.9, 7.7, 15.2 and 30.0) and obtaining basis values over a 300-frame interval. This resulted in a 300-by-5 basis matrix B. The spline basis and state sequence were combined to form a 5N-dimensional design matrix, where N is the number of distinct behavioral states. Specifically, for each instance $({t}_{\rm{s}},\ldots ,{t}_{\rm{e}})$ of state n (see ‘Cross-syllable likelihoods’ for a definition of state instances), we inserted the first ${t}_{\rm{e}}-{t}_{\rm{s}}$ frames of B into dimensions $5n,\ldots ,5n+5$ of the design matrix, aligning the first frame of B to frame t_s in the design matrix. Kinematic features were regressed against the design matrix using Ridge regression from scikit-learn and fivefold cross-validation. We used a range of values from 10⁻³ to 10³ for the regularization parameter α and reported the results with greatest accuracy.

Rearing analysis

To compare the dynamics of rear-associated states across methods, we systematically identified all instances of rearing in our initial open field dataset. During a stereotypical rear, mice briefly stood on their hind legs and extended their head upwards, leading to a transient increase in height from its modal value of 3–5 cm to a peak of 7–10 cm. Rears were typically brief, with mice exiting and then returning to a prone position within a few seconds. We encoded these features using the following criteria. First, rear onsets were defined as increases in height from below 5 cm to above 7 cm that occurred within the span of a second, with onset formally defined as the first frame where the height exceeded 5 cm. Next, rear offsets were defined as decreases in height from above 7 cm to below 5 cm that occurred within the span of a second, with offset formally defined as the first frame where the height fell below 7 cm. Finally, we defined complete rears as onset–offset pairs defining an interval with length between 0.5 s and 2 s. Height was determined from the distribution of depth values in cropped, aligned and background-segmented videos. Specifically, we used the 98th percentile of the distribution in each frame.

Accelerometry processing

From the IMU, we obtained absolute rotations r_y, r_p and r_r (yaw, pitch and roll) and accelerations a_x, a_y and a_z (dorsal/ventral, posterior/anterior and left/right). To control for subtle variations in implant geometry and chip calibration, we centered the distribution of sensor readings for each variable on each session. We defined total acceleration as the norm of the three acceleration components:

$$\left|a\right|=\sqrt{{a}_{{x}}^{2}+{a}_{{y}}^{2}+{a}_{{z}}^{2}}$$

Similarly, we defined total angular velocity as the norm |ω| of rotation derivative:

$$\omega =\left(\frac{{\rm{d}}{r}_{{y}}}{{{\rm{d}}t}},\frac{{\rm{d}}{r}_{{p}}}{{{\rm{d}}t}},\frac{{\rm{d}}{r}_{{r}}}{{{\rm{d}}t}}\right)$$

Finally, to calculate jerk, we smoothed the acceleration signal with a 50-ms Gaussian kernel, generating a time series $\widetilde{a}$, and then computed the norm of its derivative:

$${\rm{Jerk}}=\left|\frac{{\rm{d}}\widetilde{a}}{{{\rm{d}}t}}\right|$$

Aligning dopamine fluctuations to behavior states

For a detailed description of photometry data acquisition and preprocessing, see ref. ²². Briefly, photometry signals were: (1) normalized using ΔF/F₀ with a 5-s window; (2) adjusted against a reference to remove motion artifacts and other non-ligand-associated fluctuations; (3) z-scored using a 20-s sliding window; and (4) temporally aligned to the 30-Hz behavioral videos.

Given a set of state onsets (either for a single state or across all states), we computed the onset-aligned dopamine trace by averaging the dopamine signal across onset-centered windows. From the resulting traces, each of which can be denoted as a time series of dopamine signal values (${d}_{{-T}},\ldots ,{d}_{{T}}$), we defined the total fluctuation size (Fig. 4d) and temporal asymmetry (Fig. 4e) as

$$\begin{array}{l}{\rm{Temporal}}\,{\rm{asymmetry}}=\frac{1}{15}\mathop{\sum }\limits_{t=0}^{15}{d}_{\rm{t}}-\frac{1}{15}\mathop{\sum }\limits_{t=-15}^{0}{d}_{{t}} \\{\rm{Total}}\, {\rm{fluctuation}}\,{\rm{size}}=\mathop{\sum}\limits_{t=-15}^{15}\left|{d}_{{t}}\right|\end{array}$$

A third metric—the average dopamine during each state (Extended Data Fig. 7b)—was defined simply as the mean of the dopamine signal across all frames bearing that state label. For each metric, shuffle distributions were generated by repeating the calculation with a temporally reversed copy of the dopamine times series.

Supervised behavior benchmark

Videos and behavioral annotations for the supervised open field behavior benchmark (Fig. 5a–c) were obtained from ref. ³¹. The dataset contains 20 videos that are each 10–20-min long. Each video includes frame-by-frame annotations of five possible behaviors: locomote, rear, face groom, body groom and defecate. We excluded ‘defecate’ from the analysis because it was extremely rare (<0.1% of frames).

For pose tracking, we used DLC’s SuperAnimal inference API that performs inference on videos without the need to annotate poses in those videos⁴⁷. Specifically, we used SuperAnimal-TopViewMouse that applies DLCRNet-50 as the pose estimation model. Keypoint detections were obtained using DeepLabCut’s API function deeplabcut.video_inference_superanimal. The API function uses a pretrained model called SuperAnimal-TopViewMouse and performs video adaptation that applies multi-resolution ensemble (that is, the image height resized to 400, 500 and 600 with a fixed aspect ratio) and rapid self-training (model trained on zero shot predictions with confidence above 0.1) for 1,000 iterations to counter domain shift and reduce jittering predictions.

Keypoint coordinates and behavioral annotations for the supervised social behavior benchmark (Fig. 5d–f) were obtained from the CalMS21 dataset³² (task1). The dataset contains 70 videos of resident–intruder interactions with frame-by-frame annotations of four possible behaviors: attack, investigate, mount or other. All unsupervised behavior segmentation methods were fitted to 2D keypoint data for the resident mouse.

We used four metrics¹³ to compare supervised annotations and unsupervised states from each method. These included NMI, hom*ogeneity, adjusted rand score and purity. All metrics besides purity were computed using the Python library scikit-learn (that is, with the function normalized_mutual_info_score, hom*ogeneity_score, adjusted_rand_score). The purity score was defined as in ref. ¹³.

Thermistor signal processing

During respiration, the movement of air through a mouse’s nasal cavity generates fluctuations in temperature that can be detected by a thermistor; temperature decreases during inhalations (because the mouse is warmer than the air around it) and rises between inhalations. Below we refer to the between-inhalation intervals as ‘exhales’ but note that they may also contain pauses in respiration—pauses and exhales likely cannot be distinguished because warming of the thermistor occurs whether or not air is flowing.

To segment inhales and exhales using the thermistor signal, we first applied a 60-Hz notch filter (scipy.signal.iirnotch, q = 10) and a low-pass filter (scipy.signal.butter, order = 3, cutoff = 40 Hz, analog = false) to the raw signal, and then used a median filter to subtract the slow DC offset component of the signal. We then performed peak detection using scipy.signal.find_peaks (minimium inter-peak distance of 50 ms, minimum and maximum widths of 10 ms and 1,500 ms, respectively). To distinguish true peaks (inhalation onsets) from spurious peaks (noise), we varied the minimum prominence parameter from 10⁻⁴ to 1 while keeping other parameters fixed, and then used the value at which the number of peaks stabilized. Using the chosen minimum prominence, the signal was then analyzed twice—once at the chosen value, and again with a slightly more permissive minimum prominence (1/8 of the chosen value). Any low-amplitude breaths detected with the more permissive setting that overlapped with periods of breathing between 1 Hz and 6 Hz were added to the detections. This same process was then repeated to find exhale onsets but with the thermistor signal inverted. Finally, inhales and exhales were paired, and any instances of two inhales/exhales in a row were patched by inserting an exhale/inhale at the local extremum between them. Detections were then inspected manually, and any recordings with excessive noise, unusually high breathing rates (>14 Hz), or unusual autocorrelation profiles were removed from further analyses.

Classifying sniff-aligned syllables

To test whether syllables were significantly sniff aligned, we compared the probability of inhalation in the 50 ms before versus 50 ms after syllable onset. Specifically, for each syllable, we quantified the pre-inhalation versus post-inhalation fraction across all instances of that syllable, and then compared the pre-distribution and post-distribution values using a paired t-test. Syllables with P < 0.001 were considered significant.

Fly gait analysis

For the analysis of fly behavior, we used a published dataset of keypoint coordinates³⁹, which were derived from behavioral videos originally reported in ref. ¹⁷. The full dataset contains 1-h recordings (100 fps) of single flies moving freely on a backlit 100-mm-diameter arena. Keypoints were tracked using LEAP (test accuracy ~2.5 px). MotionMapper results (including names for each cluster) were also included in the published dataset. We chose four 1-h sessions (uniformly at random) for analysis with keypoint-MoSeq. All results reported here were derived from this 4-h dataset.

The analysis of syllable probabilities across the stride cycle (Fig. 6i–k) was limited to periods of ‘fast locomotion’, as defined by the MotionMapper labeling (state label 7). To identify the start and end of each stride cycle, we applied PCA to egocentric keypoint coordinates (restricted to fast locomotion frames). We found that the first PC oscillated in a manner reflecting the fly’s gait, and thus smoothed the first PC using a one-frame Gaussian filter and performed peak detection on the smoothed signal. Each inter-peak interval was defined as one stride. Stances and swings (Fig. 6j and Extended Data Fig. 10g) were defined by backward and forward motion of the leg tips, respectively (in egocentric coordinates).

Mathematical notation

1.
χ⁻²(ν, τ²) denotes the scaled inverse Chi-squared distribution.
2.
$\otimes$ denotes the Kronecker product.
3.
Δ^N is the N-dimensional simplex.
4.
I_N is the N × N identity matrix.
5.
1_N × M is the N × M matrix of ones.
6.
$\bf{x}_{{t}_{1}:{t}_{2}}$ denotes the concatenation $\left[\bf{x}_{{t}_{1}},\bf{x}_{{t}_{1}+1},\ldots ,\bf{x}_{{t}_{2}}\right]$ where t₁< t₂.

Generative model

Keypoint-MoSeq learns syllables by fitting an SLDS model⁴⁸, which decomposes an animal’s pose trajectory into a sequence of stereotyped dynamical motifs. In general, SLDS models explain time-series observations y₁, …, y_T through a hierarchy of latent states, including continuous states $\bf{x}_{{t}}\in {{\mathbb{R}}}^{{M}}$ that represent the observations y_t in a low-dimensional space, and discrete states z_t ∈ {1, …, N} that govern the dynamics of x_t over time. In keypoint-MoSeq, the discrete states correspond to syllables, the continuous states correspond to pose, and the observations are keypoint coordinates. We further adapted SLDS by (1) including a sticky hierarchical Dirichlet prior (HDP); (2) explicitly modeling the animal’s location and heading; and (3) including a robust (heavy-tailed) observation distribution for keypoints. Below we review SLDS models in general and then describe each of the customizations implemented in keypoint-MoSeq.

SLDSs

The discrete states z_t ∈ {1, …, N} are assumed to form a Markov chain, meaning

$${z}_{{t+1}}{\rm{| }}{z}_{{t}}\sim {\rm{Cat}}\left({\pi }_{{z}_{{t}}}\right)$$

where ${\pi }_{i}\in {\Delta }^{N}$ is the probability of transitioning from discrete state i to each other state. Conditional on the discrete states z_t, the continuous states x_t follow an L-order vector autoregressive process with Gaussian noise. This means that the expected value of each x_t is a linear function of the previous L states $\bf{x}_{{t-L:t-1}}$, as shown below

$${x}_{{t}}{\rm{| }}{z}_{{t}},{x}_{{t-L:t-1}}{\mathscr{\sim }}{\mathscr{N}}\left({A}_{{z}_{t}}{x}_{{t-L:t-1}}+{b}_{{z}_{{t}}},{Q}_{{z}_{{t}}}\right)$$

where ${A}_{{i}}\in {{\mathbb{R}}}^{{M\times {LM}}}$ is the autoregressive dynamics matrix, $\bf{b}_{{i}}\in {{\mathbb{R}}}^{{M}}$ is the dynamics bias vector, and ${Q}_{{i}}\in {{\mathbb{R}}}^{{M\times M}}$ is the dynamics noise matrix for each discrete state i = 1, …, N. The dynamics parameters A_i, b_i and Q_i have a matrix normal inverse Wishart (MNIW) prior

$$\left[{A}_{{i}}{\rm{| }}\bf{b}_{{i}}\right],{Q}_{{i}}\sim {\rm{MNIW}}\left({\nu }_{0},{S}_{0},{M}_{0},{K}_{0}\right)$$

Sticky HDP

A key feature of depth Moseq is the use of a sticky-HDP prior for the transition matrix. In general, HDP priors allow the number of distinct states in a HMM to be inferred directly from the data. The ‘sticky’ variant of the HDP prior includes an additional hyperparameter κ that tunes the frequency of self-transitions in the discrete state sequence z_t, and thus the distribution of syllable durations. As in depth MoSeq, we implement a sticky-HDP prior using the weak limit approximation⁴⁹, as shown below:

$$\begin{array}{ll}\beta & \sim \text{Dir}\left(\gamma /N,\ldots ,\gamma /N\right)\\ {\pi }_{{i}}{\rm{| }}\beta & \sim \text{Dir}\left(\alpha {\beta }_{1},\ldots ,\alpha {\beta }_{{v}}+\kappa \ldots ,\alpha {\beta }_{{N}}\right)\end{array}$$

where κ is being added in the ith position. Here $\beta \in {\Delta }^{\rm{N}}$ is a global vector of augmented syllable transition probabilities, and the hyperparameters γ, α and κ control the sparsity of states, the weight of the sparsity prior and the bias toward self-transitions, respectively.

SLDS for postural dynamics

Keypoint coordinates reflect not only the pose of an animal, but also its location and heading. To disambiguate these factors, we define a canonical, egocentric reference frame in which the postural dynamics are modeled. The canonically aligned poses are then transformed into global coordinates using explicit centroid and heading variables that are learned by the model.

Concretely, let ${Y}_{\rm{t}}\in {{\mathbb{R}}}^{\rm{K\times D}}$ represent the coordinates of K keypoints at time t, where $D\in \{2,3\}$. We define latent variables $\bf{v}_{t}\in {{\mathbb{R}}}^{D}$ and ${h}_{{t}}\in \left[0,2\pi \right]$ to represent the animal’s centroid and heading angle. We assume that each heading angle h_t has an independent, uniform prior and that the centroid is autocorrelated as follows:

$$\begin{array}{c}{h}_{{t}}\sim {\rm{Unif}}(0,2\pi )\\ \bf{v}_{{t}}| \bf{v}_{{t-1}}\sim {\mathscr{N}}\left(\bf{v}_{{t-1}},{\sigma }_{{\rm{loc}}}^{2}\right)\end{array}$$

At each time point t, the pose Y_t is generated via rotation and translation of a centered and oriented pose ${\widetilde{Y}}_{{t}}$ that depends on the current continuous latent state x_t:

$${Y}_{{t}}={\tilde{Y}}_{{t}}R({h}_{{t}})+{{\bf{1}}}_{K}\bf{v}_{\rm{t}}^{\top }\,{\rm{where\; vec}}({\tilde{Y}}_{{t}})\sim {\mathscr{N}}((\varGamma \otimes {I}_{D})(C\bf{x}_{{t}}+\bf{d}),{S}_{{t}})$$

where R(h_t) is a matrix that rotates by angle h_t in the xy plane, and $\varGamma \in {R}^{{K\times \left(K-1\right)}}$ is defined by the truncated singular value decomposition $\varGamma \Delta {\varGamma }^{{\rm{\top }}}={I}_{{K}}-{{\bf{1}}}_{{K\times K}}/K$. Note that Γ encodes a linear transformation that isometrically maps ${{\mathbb{R}}}^{{\left(K-1\right)\times D}}$ to the set of all centered keypoint arrangements in ${{\mathbb{R}}}^{{K\times D}}$, and thus ensures that ${\mathbb{E}}\left({\widetilde{Y}}_{{t}}\right)$ is always centered⁵⁰. The parameters $C\in {{\mathbb{R}}}^{{\left(K-1\right)D\times M}}$ and $\bf{d}\in {{\mathbb{R}}}^{{\left(K-1\right)D}}$ are initialized using PCA applied to the transformed keypoint coordinates ${\varGamma }^{{T}}{\widetilde{Y}}_{{t}}$. In principle C and d can be adjusted further during model fitting, and we describe the corresponding Gibbs updates in the inference section below. In practice, however, we keep C and d fixed to their initial values when fitting keypoint-MoSeq.

Robust observations

To account for occasional large errors during keypoint tracking, we use the heavy-tailed Student’s t-distribution, which corresponds to a normal distribution whose variance is itself a random variable. Here, we instantiate the random variances explicitly as a product of two parameters: a baseline variance σ_k for each keypoint and a time-varying scale s_t,k. We assume:

$$\begin{array}{cc}{\sigma }_{k}^{2} & \sim {\chi }^{-2}\left({\nu }_{\sigma },{\sigma }_{0}^{2}\right)\\ {s}_{t,k}^{2} & \sim {\chi }^{-2}\left({\nu }_{s},{s}_{0,t,k}\right)\end{array}$$

where ν_σ > 0 and ν_s > 0 are degrees of freedom, ${\sigma }_{0}^{2} > 0$ is a baseline scaling parameter, and ${s}_{0,t,k} > 0$ is a local scaling parameter, which encodes a prior on the scale of error for each keypoint on each frame. Where possible, we calculated the local scaling parameters as a function of the neural network confidences for each keypoint. The function was calibrated using the empirical relationship between confidence values and error sizes. The overall noise covariance S_t is generated from σ_k and s_t,k as follows:

$${S}_{t}={\rm{diag}}\left({\sigma }_{1}^{2}{s}_{t,1}^{2},\ldots ,{\sigma }_{K}^{2}{s}_{t,K}^{2}\right)\otimes {I}_{D}$$

Related work

Keypoint-MoSeq extends the model used in depth MoSeq¹⁶, where a low-dimensional pose trajectory x_t (derived from egocentrically aligned depth videos) is used to fit an AR-HMM with a transition matrix π, autoregressive parameters A_i, b_i and Q_i and discrete states z_t like those described here. Indeed, conditional on x_t, the models for keypoin-MoSeq and depth MoSeq are identical. The main differences are that keypoint-MoSeq treats x_t as a latent variable (that is, updates it during fitting), includes explicit centroid and heading variables, and uses a robust noise model.

Disambiguating poses from position and heading is a common task in unsupervised behavior algorithms, and researchers have adopted a variety of approaches. VAME¹³, for example, isolates pose by centering and aligning data ahead of time, whereas B-SOiD¹² transforms the keypoint data into a vector of relative distances and angles. The statistical pose model GIMBAL²⁹, on the other hand, introduces latent heading and centroid variables that are inferred simultaneously with the rest of the model. Keypoint-MoSeq adopts this latter approach, which can remove spurious correlations between egocentric features that can arise from errors in keypoint localization.

Inference algorithm

Our full model contains latent variables v, h, x, z and s and parameters A, b, Q, C, d, σ, β and π. We fit each of these variables—except for C and d—using Gibbs sampling, in which each variable is iteratively resampled from its posterior distribution conditional on the current values of all the other variables. The posterior distributions P(π, β∣z) and P(A, b, Q∣z, x) are unchanged from the original MoSeq paper and will not be be reproduced here (see ref. ¹⁶, pages 42–44, and note the changes of notation Q → Σ, z → x and x → y). The Gibbs updates for variables C, d, σ, s, v and h are described below.

Resampling P(C, d∣s, σ, x, v, h, Y)

Let ${\tilde{\bf{x}}}_{\rm{t}}$ represent x_t with a 1 appended and define

$${\tilde{S}}_{t}=({\varGamma }^{\top }{\rm{diag}}(\mathop{\sigma}\nolimits_{1}^{2}{s}_{t,1},\ldots ,\mathop{\sigma }\nolimits_{K}^{2}{s}_{t,K})\varGamma )\otimes {I}_{D}$$

The posterior update is $\left(C,{\bf{d}}\right){\sim}{\mathscr{N}}\left({\text{vec}}\left(C,{\bf{d}}\right){{|}}{\mu }_{n},{\varSigma }_{n}\right)$ where

$${\varSigma }_{n}={({\sigma }_{C}^{-2}I+{S}_{x,x})}^{-1}{\rm{and}}\,{\mu }_{n}={\varSigma }_{n}{S}_{y,x}$$

with

$$\begin{array}{r}{S}_{x,x}=\mathop{\sum }\limits_{t=1}^{T}{\tilde{\bf{x}}}_{t}{\tilde{\bf{x}}}_{t}^{\top }\otimes {\varGamma }^{\top}{\tilde{S}}_{t}^{-1}\varGamma \otimes {I}_{D}\,{\rm{and}}\,{S}_{y,x}=\mathop{\sum }\limits_{t=1}^{T}\left({\tilde{\bf{x}}}_{t}^{\top }\otimes {\tilde{S}}^{-1}\varGamma \otimes {I}_{D}\right){\rm{vec}}{({\tilde{Y}}_{t})}^{\top}\end{array}$$

Resampling P(s∣C, d, σ, x, v, h, Y)

Each s_t,k is conditionally independent with posterior

$$\begin{array}{c}{s}_{t,k}{\rm{| }}C,\bf{d},{\sigma }_{k},\bf{x},Y\sim {\chi }^{\,-2}\left({\nu }_{s}+D,\left({\nu }_{s}{s}_{0}+{\sigma }_{k}^{-2}\parallel\Big(\varGamma {\left(C\bf{x}_{t}+\bf{d}\right)\Big)}_{k}-{\widetilde{Y}}_{t,k}{\parallel }^{2}\right)/\left({\nu }_{s}+D\right)\right)\end{array}$$

Resampling P(σ∣C, d, s, x, v, h, Y)

Each σ_k is conditionally independent with posterior

$$\begin{array}{c}{\sigma }_{k}^{2}\sim {\chi }^{-2}\left({\nu }_{\sigma }+{DT},\left({\nu }_{\sigma }{\sigma }_{0}^{2}+{S}_{y}\right){\left({\nu }_{\sigma }+{DT}\right)}^{-1}\right)\end{array}$$

where ${S}_{y}={\sum }_{t=1}^{N}{\Vert \varGamma {(C\bf{x}_{t}+\bf{d})}_{k}-{\tilde{Y}}_{t,k}\Vert }^{2}/{s}_{t,k}$

Resampling P(v∣C, d, σ, s, x, h, Y)

Because the translations v₁, …, v_T form an LDS, they can be updated by Kalman sampling. The observation potentials have the form ${\mathscr{N}}\left(\bf{v}_{t}{\rm{| }}\mu ,{\gamma }^{2}{I}_{D}\right)$ where

$$\mu =\sum _{k}\frac{{\gamma }_{t}^{2}}{{\sigma }_{k}^{2}{s}_{t,k}}[{Y}_{t,k}-R{\left({h}_{t}\right)}^{{\rm{\top }}}\varGamma {\left(C\bf{x}_{t}+\bf{d}\right)}_{k}],\frac{1}{{\gamma }_{t}^{2}}=\sum _{k}\frac{1}{{\sigma }_{k}^{2}{s}_{t,k}}$$

Resampling P(h∣C, d, σ, s, x, v, Y)

The posterior of h_t is the von-Mises distribution $\text{vM}$(θ, κ) where κ and $\theta \in \left[0,2\pi \right]$ are the unique parameters satisfying $\left[\kappa \cos \left(\theta \right),\kappa \sin \left(\theta \right)\right]=\left[{S}_{1,1}+{S}_{2,2},{S}_{1,2}-{S}_{2,1}\right]$ for

$$\begin{array}{c}S=\mathop{\sum }\limits_{k}\frac{1}{{s}_{t,k}{\sigma }_{k}^{2}}\varGamma {\left(C\bf{x}_{t}+\bf{d}\right)}_{k}{\left({Y}_{t,k}-\bf{v}_{t}\right)}^{{\rm{\top }}}\end{array}$$

Resampling P(x∣C, d, σ, s, v, h, Y)

To resample x, we first express its temporal dependencies as a first-order autoregressive process, and then apply Kalman sampling. The change of variables is

$$\begin{array}{r}A^{\prime} =\left[\begin{array}{ccccc} & I & & & \\ & & I & & \\ & & & I & \\ {A}_{1} & {A}_{2} & \ldots & {A}_{L} & \bf{b}\end{array}\right] \; Q^{\prime} =\left[\begin{array}{cccc}0 & & & \\ & 0 & & \\ & & 0 & \\ & & & Q\end{array}\right] \; C^{\prime} =\left[\begin{array}{cc}0 & 0\\ \vdots & \vdots \\ 0 & 0\\ C & \bf{d}\end{array}\right]\; \bf{x}_{t}^{\prime} =\left[\begin{array}{c}\bf{x}_{t-L+1}\\ \vdots \\ \bf{x}_{t}\\ 1\end{array}\right]\end{array}$$

Kalman sampling can then be applied to the sample the conditional distribution

$$\begin{array}{r}P({\bf{x}^{\prime}}_{1:T}| {\tilde{Y}}_{1:T})\propto \mathop{\prod }\limits_{t=1}^{T}{\mathscr{N}}({\bf{x}^{\prime} }_{t}| {A^{\prime} }^{({z}_{t})}{\bf{x}^{\prime} }_{t-1},{Q^{\prime} }^{({z}_{t})}){\mathscr{N}}({\rm{vec}}({\tilde{Y}}_{t})| {C^\prime} {\bf{x}^{\prime} }_{t},{S}_{t}).\end{array}$$

(Assume x′ is left-padded with zeros for negative time indices.)

Hyperparameters

We used the following hyperparameter values throughout the paper.

Transition matrix

$$\begin{array}{l}N=100\\ \gamma =1,000\\ \alpha =100\\ \kappa \ \ {\rm{fit \; to \; each \; dataset}}\end{array}$$

Autoregressive process

$$\begin{array}{l}M \ \ {\rm{set \; using \; PCA \; explained \; variance \; curve}}\\ L=3\\ {\nu }_{0}=M+2\\ {S}_{0}=0.01{I}_{M}\\ {M}_{0}=[{0}_{M\times (L-1)}\,{I}_{M}\,{1}_{M\times 1}]\\ {K}_{0}=10{I}_{M(L+1)}\end{array}$$

Observation process

$$\begin{array}{l}{\sigma }_{0}^{2}=1\\ {\nu }_{\sigma }={10}^{5}\\ {\nu }_{s}=5\\ {s}_{0,t,k}\,{\rm{set}}\;{\rm{based}}\;{\rm{on}}\;{\rm{neural}}\;{\rm{network}}\;{\rm{confidence}}\end{array}$$

Centroid autocorrelation

$$\begin{array}{c}{\sigma }_{\text{loc}}^{2}=0.4\end{array}$$

Derivation of Gibbs updates

Derivation of C, d updates

To simply notation, define

$$\begin{array}{r}{\tilde{S}}_{t}={\rm{diag}}({\sigma }_{1}^{2}{s}_{t,1},\ldots ,{\sigma }_{K}^{2}{s}_{t,K}),\,{\tilde{\bf{x}}}_{t}=(\bf{x}_{t},1),\,\tilde{C}=(C,\bf{d})\end{array}$$

The likelihood of the centered and aligned keypoint locations $\tilde{Y}$ can be expanded as follows

$$\begin{array}{ll} & P\left(\tilde{Y}{\rm{| }}\tilde{C},\tilde{\bf{x}},\tilde{S}\right)=\mathop{\prod }\limits_{t=1}^{T}{\mathscr{N}}\left(\text{vec}\left({\tilde{Y}}_{t}\right){\rm{| }}\left(\varGamma \otimes {I}_{D}\right)\tilde{C}{\tilde{\bf{x}}}_{t},{\tilde{S}}_{t}\otimes {I}_{D}\right)\\ \propto & \exp \left[-\frac{1}{2}\mathop{\sum }\limits_{t=1}^{T}\left({\tilde{\bf{x}}}_{t}^{{\rm{\top }}}{\tilde{C}}^{{\rm{\top }}}\left({\varGamma }^{{\rm{\top }}}{\tilde{S}}_{t}^{-1}\varGamma \otimes {I}_{D}\right)\tilde{C}{\tilde{\bf{x}}}_{t}-2\text{vec}{\left({\tilde{Y}}_{t}\right)}^{{\rm{\top }}}\left({\tilde{S}}_{t}^{-1}\varGamma \otimes {I}_{D}\right)\tilde{C}{\tilde{\bf{x}}}_{t}\right)\right]\\ \propto & \exp \left[-\frac{1}{2}\mathop{\sum }\limits_{t=1}^{T}\left(\text{vec}{\left(\tilde{C}\right)}^{\top }\left({\tilde{\bf{x}}}_{t}{\tilde{\bf{x}}}_{t}^{\top }\otimes {\varGamma }^{\top }{\tilde{S}}_{t}^{-1}\varGamma \otimes {I}_{D}\right)\text{vec}\left(\tilde{C}\right)\right)\right.\\ &\quad\,\, \left.\left(-2\text{vec}{\left(\tilde{C}\right)}^{{\rm{\top }}}\left({\tilde{\bf{x}}}_{t}^{{\rm{\top }}}\otimes {\tilde{S}}_{t}^{-1}\varGamma \otimes {I}_{D}\right)\text{vec}\left({\tilde{Y}}_{t}\right)\right)\,\right]\\ \propto & \exp \left[-\frac{1}{2}\left(\text{vec}{\left(\tilde{C}\right)}^{\top }{S}_{x,x}\text{vec}\left(\tilde{C}\right)-2\text{vec}{\left(\tilde{C}\right)}^{\top }{S}_{x,y}\right)\right]\end{array}$$

where

$$\begin{array}{c}{S}_{x,x}=\mathop{\sum }\limits_{t=1}^{T}{\tilde{\bf{x}}}_{t}{\tilde{\bf{x}}}_{t}^{{\rm{\top }}}\otimes {\varGamma }^{{\rm{\top }}}{\tilde{S}}_{t}^{-1}\varGamma \otimes {I}_{D}\;\text{and}\;{S}_{x,y}=\mathop{\sum }\limits_{t=1}^{T}\left({\tilde{\bf{x}}}_{t}^{{\rm{\top }}}\otimes {\tilde{S}}^{-1}\varGamma \otimes {I}_{D}\right)\text{vec}\left({\tilde{Y}}_{t}\right)\end{array}$$

Multiplying by the prior $\text{vec}\left(\tilde{C}\right){\mathscr{\sim }}{\mathscr{N}}\left(0,{\sigma }_{C}^{2}I\right)$ yields

$$\begin{array}{r}P(\tilde{C}| \tilde{Y},\tilde{\bf{x}},\tilde{S})\propto {\mathscr{N}}({\rm{vec}}(\tilde{C})| {\mu }_{n},{\varSigma }_{n})\end{array}$$

where

$$\begin{array}{cc}{\varSigma }_{n}={\left({\sigma }_{C}^{-2}I+{S}_{x,x}\right)}^{-1} \; \text{and} \; {\mu }_{n}={\varSigma }_{n}{S}_{y,x}\end{array}$$

Derivation of σ _k, s _t,k updates

For each time t and keypoint k, let ${\bar{Y}}_{t,k}=\varGamma \left(C\bf{x}_{t}+\bf{d}\right)$. The likelihood of the centered and aligned keypoint location ${\tilde{Y}}_{t,k}$ is

$$P({\tilde{Y}}_{t,k}| {\bar{Y}}_{t,k},{s}_{t,k},{\sigma }_{k})={\mathscr{N}}({\tilde{Y}}_{t,k}| {\bar{Y}}_{t,k},\,{\sigma }_{k}^{2}{s}_{t,k}{I}_{D})\propto {({\sigma }_{k}^{2}{s}_{t,k})}^{-D/2}\exp \left[-\frac{{\Vert {\tilde{Y}}_{t,k}-{\bar{Y}}_{t,k}\Vert }^{2}}{2{\sigma }_{k}^{2}{s}_{t,k}}\right]$$

We can then calculate posteriors $P\left({s}_{t,k}{\rm{| }}{\sigma }_{k}\right)$ and $P\left({\sigma }_{k}{\rm{| }}{s}_{t,k}\right)$ as follows

$$\begin{array}{ll}P\left({s}_{t,k}{\rm{| }}{\sigma }_{k},{\tilde{Y}}_{t,k},{\bar{Y}}_{t,k}\right) & \propto {\chi }^{\,-1}\left({s}_{t,k}{\rm{| }}{\nu }_{s},{s}_{0}\right){\mathscr{N}}\left({\tilde{Y}}_{t,k}{\rm{| }}{\bar{Y}}_{t,k},\,{\sigma }_{k}^{2}{s}_{t,k}{I}_{D}\right)\\ & \propto {s}_{t,k}^{-1-\left({\nu }_{s}+D\right)/2}\exp \left[\frac{-{\nu }_{s}{s}_{0}}{2{s}_{t,k}}-\frac{\parallel {\tilde{Y}}_{t,k}-{\bar{Y}}_{t,k}{\parallel }^{2}}{2{\sigma }_{k}^{2}{s}_{t,k}}\right]\\ & \propto {\chi }^{\,-2}\left({s}_{t,k}{\rm{| }}{\nu }_{s}+D,\left({\nu }_{s}{s}_{0}+{\sigma }_{k}^{-2}\parallel {\tilde{Y}}_{t,k}-{\bar{Y}}_{t,k}{\parallel }^{2}\right){\left({\nu }_{s}+D\right)}^{-1}\right)\end{array}$$

$$\begin{array}{ll}P\left({\sigma }_{k}{\rm{| }}{\{{s}_{t,k},{\tilde{Y}}_{t,k},{\bar{Y}}_{t,k}\}}_{t=1}^{T}\right) & \propto {\chi }^{\,-1}\left({\sigma }_{k}^{2}{\rm{| }}{\nu }_{\sigma },{\sigma }_{0}^{2}\right)\mathop{\prod }\limits_{t=1}^{T}{\mathscr{N}}\left({\tilde{Y}}_{t,k}{\rm{| }}{\bar{Y}}_{t,k},{\sigma }_{k}^{2}{s}_{t,k}{I}_{D}\right)\\ & \propto {\sigma }_{k}^{-2-{\nu }_{\sigma }-{DT}}\exp \left[\frac{-{\nu }_{\sigma }{\sigma }_{0}^{2}}{2{\sigma }_{k}^{2}}-\frac{1}{2{\sigma }_{k}^{2}}\mathop{\sum }\limits_{t=1}^{T}\frac{\parallel {\tilde{Y}}_{t,k}-{\bar{Y}}_{t,k}{\parallel }^{2}}{{s}_{t,k}}\right]\\ & \propto {\chi }^{\,-2}\left({\sigma }_{k}^{2}{\rm{| }}{\nu }_{\sigma }+{DT},\left({\nu }_{\sigma }{\sigma }_{0}^{2}+{S}_{y}\right){\left({\nu }_{\sigma }+{DT}\,\right)}^{-1}\right)\end{array}$$

where ${S}_{y}={\sum }_{t}\parallel {\tilde{Y}}_{t,k}-{\bar{Y}}_{t,k}{\parallel }^{2}/{s}_{t,k}$

Derivation of v _t update

We assume an improper uniform prior on v_t, hence

$$\begin{array}{ll}P\left(\bf{v}_{t}{\rm{| }}{Y}_{t}\right) & \propto P\left({Y}_{t}{\rm{| }}\bf{v}_{t}\right)P\left(\bf{v}_{t}\right)\propto P\left({Y}_{t}{\rm{| }}\bf{v}_{t}\right)\\ & {\mathscr{\propto }}{\mathscr{N}}\left(\text{vec}\left(\left({Y}_{t}-{{\bf{1}}}_{K}\bf{v}_{t}^{{\rm{\top }}}\right)R{\left({h}_{t}\right)}^{{\rm{\top }}}\right){\rm{| }}\varGamma \left(C\bf{x}_{t}+\bf{d}\right),{S}_{t}\right)\\ & =\mathop{\prod}\limits _{k}{\mathscr{N}}\left(R\left({h}_{t}\right)\left({Y}_{t,k}-\bf{v}_{t}\right){\rm{| }}\varGamma {\left(C\bf{x}_{t}+\bf{d}\right)}_{k},{s}_{t,k}{\sigma }_{k}^{2}{I}_{D}\right)\\ & =\mathop{\prod}\limits _{k}{\mathscr{N}}\left({v}_{t}{\rm{| }}{Y}_{t,k}-R{\left({h}_{t}\right)}^{{\rm{\top }}}\varGamma {\left(C\bf{x}_{t}+\bf{d}\right)}_{k},{s}_{t,k}{\sigma }_{k}^{2}{I}_{D}\right)\\ & {\mathscr{=}}{\mathscr{N}}\left(\bf{v}_{t}{\rm{| }}{\mu }_{t},{\gamma }_{t}^{2}{I}_{D}\right)\end{array}$$

where

$$\begin{array}{c}\mu =\mathop{\sum }\limits_{k}\frac{{\gamma }_{t}^{2}}{{\sigma }_{k}^{2}{s}_{t,k}}\left({Y}_{t,k}-R{\left({h}_{t}\right)}^{{\rm{\top }}}\varGamma {\left(C\bf{x}_{t}+\bf{d}\right)}_{k}\right),\frac{1}{{\gamma }_{t}^{2}}=\mathop{\sum }\limits_{k}\frac{1}{{\sigma }_{k}^{2}{s}_{t,k}}\end{array}$$

Derivation of h _t update

We assume a proper uniform prior on h_t, hence

$$\begin{array}{ll}P\left({h}_{t}{\rm{| }}{Y}_{t}\right) & \propto P\left({Y}_{t}{\rm{| }}{h}_{t}\right)P\left({h}_{t}\right)\propto P\left({Y}_{t}{\rm{| }}{h}_{t}\right)\\ & \propto \exp \left[\sum _{k}\frac{{\left({Y}_{t,k}-\bf{v}_{t}\right)}^{{\rm{\top }}}R\left({h}_{t}\right)\varGamma {\left(C\bf{x}_{t}+\bf{d}\right)}_{k}}{{s}_{t,k}{\sigma }_{k}^{2}}\right]\\ & =\exp \left[\frac{{\rm{tr}}\left[R\left({h}_{t}\right)\varGamma {\left(C\bf{x}_{t}+\bf{d}\right)}_{k}{\left({Y}_{t,k}-\bf{v}_{t}\right)}^{{\rm{\top }}}\right]}{{s}_{t,k}{\sigma }_{k}^{2}}\right]\\ & \propto {\rm{exptr}}\left[R\left({h}_{t}\right)S\right] \; \text{where} \; S=\mathop{\sum }\limits_{k}\varGamma {\left(C\bf{x}_{t}+\bf{d}\right)}_{k}{\left({Y}_{t,k}-\bf{v}_{t}\right)}^{{\rm{\top }}}/\left({s}_{t,k}{\sigma }_{k}^{2}\right)\\ & \propto \exp \left[\cos \left({h}_{t}\right)\left({S}_{1,1}+{S}_{2,2}\right)+\sin \left({h}_{t}\right)\left({S}_{1,2}-{S}_{2,1}\right)\right]\end{array}$$

Let $\left[\kappa \cos \left(\theta \right),\kappa \sin \left(\theta \right)\right]$ represent $\left[{S}_{1,1}+{S}_{2,2},{S}_{1,2}-{S}_{2,1}\right]$ in polar coordinates. Then

$$\begin{array}{cc}P\left({Y}_{t}{\rm{| }}{h}_{t}\right) & \propto \exp \left[\kappa \cos \left({h}_{t}\right)\cos \left(\theta \right)+\sin \left({h}_{t}\right)\sin \left(\theta \right)\right]\\ & =\exp \left[\kappa \cos \left({h}_{t}-\theta \right)\right]\propto \text{vM}\left({h}_{t}{\rm{| }}\theta ,\kappa \right)\end{array}$$

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Keypoint-MoSeq: parsing behavior by linking point tracking to pose dynamics (2024)

Ethical compliance

Animal care and behavioral experiments

Stereotactic surgery procedures

Microsoft Azure recording setup

Accelerometry recordings

Thermistor recordings

Environmental enrichment recordings

Software

Statistics

Processing depth videos

Keypoint tracking for Microsoft Azure IR recordings

Keypoint tracking for thermistor recordings

3D pose inference

Keypoint change score

Spectral analysis of keypoint jitter

Applying keypoint-MoSeq

Fitting an initial AR-HMM

Fitting a full keypoint-MoSeq model

Trajectory plots

Computing the trajectory of individual syllable instances

Identifying a representative instance of each syllable

Computing final trajectories for each syllable

Testing robustness to missing data

Cross-syllable likelihoods

Generating synthetic keypoint data

Expected marginal likelihood score

Environmental enrichment analysis

Applying published methods for behavior analysis

Predicting kinematics from state sequences

Rearing analysis

Accelerometry processing

Aligning dopamine fluctuations to behavior states

Supervised behavior benchmark

Thermistor signal processing

Classifying sniff-aligned syllables

Fly gait analysis

Mathematical notation

Generative model

SLDSs

Sticky HDP

SLDS for postural dynamics

Robust observations

Related work

Inference algorithm

Resampling P(C, d∣s, σ, x, v, h, Y)

Resampling P(s∣C, d, σ, x, v, h, Y)

Resampling P(σ∣C, d, s, x, v, h, Y)

Resampling P(v∣C, d, σ, s, x, h, Y)

Resampling P(h∣C, d, σ, s, x, v, Y)

Resampling P(x∣C, d, σ, s, v, h, Y)

Hyperparameters

Transition matrix

Autoregressive process

Observation process

Centroid autocorrelation

Derivation of Gibbs updates

Derivation of C, d updates

Derivation of σ k, s t,k updates

Derivation of v t update

Derivation of h t update

Reporting summary

Derivation of σ _k, s _t,k updates

Derivation of v _t update

Derivation of h _t update