Classificação de cenas aéreas em sensoriamento remoto: Uma abordagem utilizando dados de imagem e som e self-supervised learning
Carregando...
Arquivos
Data
Autores
Título da Revista
ISSN da Revista
Título de Volume
Editor
Universidade do Estado do Amazonas
Resumo
Scene classification is an activity in computer vision where models can understand
a context or environment without focusing solely on classifying a single object, as in
image classification. Therefore, it is an area of extensive research currently, as it is
used in important tasks such as content-based retrieval and smart content moderation.
Additionally, when performed with remote sensing data, it is crucial for understanding
the environment around us, being applied in tasks such as city monitoring and land use
classification. Emphasizing the classification of aerial scenes, many of these studies are
based on using convolutional neural networks for this activity, thus relying on a large
number of annotations for images. Hence, the application of new training techniques such
as self-supervised learning (SSL), where the model first learns to generate representations
from pseudolabels before performing the main task, has been more widely applied in recent
literature. Furthermore, the possibility of using multimodal data with geolocated images
and sounds to improve model performance in this task has been demonstrated through
the ADVANCE and SoundingEarth datasets. Therefore, this paper demonstrates the
use of SSL and audiovisual remote sensing data in conjunction with the application of
vision transformers, a new deep learning architecture based on attention mechanisms,
for generating embeddings. Firstly, pre-training was conducted on SoundingEarth, using
batch triplet loss to bring closer pairs of positive image and sound data and separate
distinct pairs. Subsequently, these representations were applied to a logistic regression
model to classify aerial scenes from ADVANCE. The results obtained showed precision,
recall, and F1-Score above 80% for models trained with both image and sound embeddings.
Considering only image embeddings, results were also above 80%, and considering only
audio, results were above 40% for these metrics.
