Transformer-based automatic music mood classification using multi-modal framework

According to studies, music affects our moods, and we are also inclined to choose a theme based on our current moods. Audio-based techniques can achieve promising results, but lyrics also give relevant information about the moods of a song which may not be present in the audio part. So a multi-modal...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Suresh Kumar, Sujeesha Ajithakumari, Rajan, Rajeev
Formato: Articulo
Lenguaje:Inglés
Publicado: 2023
Materias:
Acceso en línea:http://sedici.unlp.edu.ar/handle/10915/152120
Aporte de:
Descripción
Sumario:According to studies, music affects our moods, and we are also inclined to choose a theme based on our current moods. Audio-based techniques can achieve promising results, but lyrics also give relevant information about the moods of a song which may not be present in the audio part. So a multi-modal with both textual features and acoustic features can provide enhanced accuracy. Sequential networks such as long short-term memory networks (LSTM) and gated recurrent unit networks (GRU) are widely used in the most state-of-the-art natural language processing (NLP) models. A transformer model uses selfattention to compute representations of its inputs and outputs, unlike recurrent unit networks (RNNs) that use sequences and transformers that can parallelize over input positions during training. In this work, we proposed a multi-modal music mood classification system based on transformers and compared the system’s performance using a bi-directional GRU (Bi-GRU)- based system with and without attention. The performance is also analyzed for other state-of-the-art approaches. The proposed transformer-based model acquired higher accuracy than the Bi-GRU-based multimodal system with single-layer attention by providing a maximum accuracy of 77.94%.