arXiv

Planktonzilla: Multimodal dataset and models for understanding plankton ecosystems

June 2, 2026 · Alan Gerson Contreras Montanares, Luis Valenzuela, Luis Mart\'i, Nayat Sanchez-Pi · Original Source

Title: Planktonzilla: Multimodal dataset and models for understanding plankton ecosystems

Abstract:

Marine plankton are fundamental to aquatic food webs and are essential for global CO2 sequestration; therefore, accurate species identification is vital for assessing ocean health and climate feedback mechanisms. However, current classification models often struggle to generalize across different instruments and environments. This limitation stems from training datasets that are isolated and labels that lack consistency.

To overcome these challenges, we present Planktonzilla-17M, a unified dataset that aggregates publicly available plankton image collections from thirteen distinct imaging systems. As the largest and most comprehensive plankton image dataset to date, it contains 17.4 million images accompanied by standardized taxonomy and geo-environmental metadata. Specifically, the dataset includes 3.74 million plankton images categorized into more than 602 taxonomic classes, with 201 of these classes identified at the species level.

Utilizing this extensive dataset, we conducted a controlled comparison between supervised learning and CLIP-style image-text training, both employing a shared Vision Transformer (ViT) backbone. Our results indicate that a supervised classifier performs on par with or better than CLIP-style training when the latter utilizes taxonomic lineage as text input. Conversely, we observed that BioCLIP and BioCLIP2 exhibit poor performance in zero-shot and few-shot scenarios when applied to plankton. These findings underscore the limitations of existing biological foundation models in marine imaging domains and demonstrate that leveraging Planktonzilla-17M significantly enhances plankton classification performance.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC