We try many methods (cat, average, fusion) on three models (Hubert, Wav2vec2, Torchcrepe).
We adopt two methods on the relationship between our scene-embedding and timestamp-embedding models. In "fusion_cat_xwc_time", every certain time inverted is averaged and concatenated. In other models, we simply average three models'(Hubert, Wav2vec2, Torchcrepe) embeddings.
Notice that Hubert models contain xlarge and large, Wav2vec2 model contains only large.