Object Detection Examples

Overview

http://www.viametoolkit.org/wp-content/uploads/2018/02/skate_detection.png

This document corresponds to the object detection example folder within a VIAME desktop installation. Object detection identifies and localizes objects of interest within images or video frames, producing bounding box detections with associated class labels and confidence scores.

VIAME includes a variety of detection algorithms spanning classical computer vision, traditional machine learning, and modern deep learning approaches. Most detectors are trainable from user-provided annotations – see the object detector training examples for details on training custom models.

Detection pipelines can be run from the command line using the viame tool, or from one of the user interfaces within VIAME (e.g. DIVE, VIEW, SEAL). In the DIVE interface, pipelines are organized into menu groups based on the first word of the pipeline file name. Detection pipelines appear under the Detector menu (e.g. Detector -> Generic Proposals for detector_generic_proposals.pipe). In the VIEW interface, all pipelines are available in the pipelines dropdown.

VIAME detectors fall into three broad categories:

  1. Deep learning detectors – learned models trained on annotated data (GPU recommended)

  2. Classical ML detectors – SVM classifiers applied over proposal regions or features

  3. Motion / heuristic detectors – detect objects via motion, shape, or other cues (no training)

Deep Learning Detectors

Deep learning detectors are the most common choice for production use. They learn to recognize objects from annotated training data and typically require a GPU for both training and inference. VIAME supports several deep learning detection frameworks. For details on training these detectors, see the object detector training examples.

Netharn Cascade Faster R-CNN (CFRNN)

The Netharn CFRNN detector is the current default deep learning detector in VIAME. It uses a Cascade Faster R-CNN architecture with a ResNeXt-101 backbone, providing strong accuracy across a wide range of object types and sizes. It is the most commonly used detector in production VIAME deployments including fish detection, scallop detection, and marine mammal surveys.

Key properties:

  • Two-stage detector with cascade refinement for improved localization

  • ResNeXt-101 backbone pretrained on ImageNet

  • Input resolution: 640x640 (configurable)

  • Automatic windowed/tiled inference for high-resolution imagery

  • Supports continuing training from a previously trained checkpoint

  • Handles mixed aspect ratios and high class imbalance well

  • Best with 500+ annotations per class

  • GPU required (PyTorch)

A grid/tiling variant is also available (Netharn CFRNN Grid) that is optimized for small objects in dense scenes (20+ objects per frame). The tiled approach processes overlapping image chips to better detect small targets.

RF-DETR (Real-Time Foundation DETR)

RF-DETR is a transformer-based detector that uses a DETR (DEtection TRansformer) architecture with a DINOv2 backbone. Its transformer attention mechanism makes it particularly effective in dense scenes with frequent occlusion, where objects overlap and partially obscure one another.

Key properties:

  • Transformer-based end-to-end detector (no anchors or NMS needed during training)

  • Three model sizes: nano (384px), base (560px), large (728px)

  • EMA (Exponential Moving Average) for training stability

  • Excels in dense scenes with overlapping objects and multi-scale variation

  • Good choice for medium-to-large objects with sufficient training data (400+)

  • GPU required (PyTorch)

MIT-YOLO (v9)

MIT-YOLO is a modern YOLO variant based on the YOLOv9-c architecture. It is a single-stage detector optimized for speed and accuracy, making it a good choice when fast inference is important or training data is moderate.

Key properties:

  • Single-stage detector with fast inference

  • YOLOv9-c architecture at 640x640 resolution

  • Handles multi-scale variation well

  • Well suited for real-time or near-real-time detection

  • Moderate data requirements (200+ annotations per class recommended)

  • GPU required (PyTorch)

Darknet YOLO

Darknet YOLO is a mature YOLO implementation supporting YOLOv4 and YOLOv7 variants at various resolutions. It is widely used and well-tested in production VIAME deployments, including arctic seal detection from aerial imagery.

Key properties:

  • Multiple architecture variants: YOLOv4-CSP (small/medium), YOLOv7 (small)

  • Configurable input resolution (512–832px)

  • Supports windowed/tiled inference for high-resolution imagery

  • Requires the Darknet YOLO add-on

  • GPU required

MMDetection Cascade R-CNN

MMDetection provides access to the OpenMMLab detection framework, primarily used for Cascade R-CNN with various backbones. It supports distributed training and advanced features like EQLv2 loss for handling imbalanced class distributions.

Key properties:

  • Cascade R-CNN architecture via OpenMMLab framework

  • Supports multiple backbones including ConvNeXt

  • Input resolution: up to 1333px

  • Distributed training support (multi-GPU, Slurm, MPI)

  • GPU required (PyTorch)

Detectron2 Faster R-CNN

The Detectron2 Faster R-CNN detector uses Facebook’s Detectron2 framework with a ResNet-50 + FPN backbone.

Key properties:

  • Two-stage Faster R-CNN with Feature Pyramid Network

  • ResNet-50 backbone pretrained on COCO

  • Input resolution: 800px (configurable)

  • GPU required (PyTorch)

LitDet (Faster R-CNN, SSD, and Others)

LitDet provides PyTorch Lightning-based implementations of several detection architectures with a focus on ease of use and reproducibility. The Faster R-CNN variant works well for large objects and tall aspect ratios in sparse scenes.

Key properties:

  • Faster R-CNN: ResNet-50 + FPN backbone at 640px – best for large or tall objects

  • SSD: VGG-16 backbone at 300px – lightweight and fast

  • Additional architectures: SSDLite, RetinaNet, FCOS

  • Fine-tuning from COCO pretrained weights

  • TensorBoard logging enabled by default

  • GPU required (PyTorch)

Netharn Mask R-CNN

Mask R-CNN extends Faster R-CNN with an additional branch for predicting segmentation masks alongside bounding boxes. Use this when both detection boxes and pixel-level masks are needed. Used in marine mammal surveys for precise animal outlines.

Key properties:

  • Produces both bounding boxes and instance segmentation masks

  • ResNet-50 backbone at 720px or 1280px

  • Useful for training data that includes polygon annotations

  • GPU required (PyTorch)

Zero-Shot / Foundation Models

These detectors use large pretrained vision models and can detect objects without any task-specific training data.

HuggingFace Zero-Shot Detector

The HuggingFace zero-shot detector uses a pretrained Grounding DINO model to detect objects from text descriptions without any task-specific training data.

Key properties:

  • No training required – detects objects by text query

  • Based on Grounding DINO (IDEA-Research/grounding-dino-tiny)

  • Multi-scale detection with NMS post-processing

  • Useful for quick exploration or bootstrapping annotations before training

  • GPU recommended

MaskCut (Unsupervised Object Discovery)

MaskCut uses DINO ViT (Vision Transformer) features to discover and segment objects in images without any training or text prompts. It generates object masks through an unsupervised spectral clustering approach.

Key properties:

  • No training or prompts required – fully unsupervised

  • Uses DINOv2 backbone features (small or base)

  • CRF post-processing for refined mask boundaries

  • Useful for exploring what objects exist in unlabeled data

  • GPU recommended

Novelty-Aware Detectors

These detectors extend standard detection with the ability to flag novel or out-of-distribution objects that were not seen during training.

ReMax DINO Detector

Combines a DINO (DEtection with Implicit Orientation) detector with ReMax novelty scoring. Detects known object classes while also assigning a novelty probability to each detection, flagging objects that may be from unseen categories.

Key properties:

  • Open-set detection: identifies both known and unknown object classes

  • DINO backbone with Swin Transformer

  • Useful for survey work where unexpected species may appear

  • Requires the learn add-on

  • GPU required (PyTorch)

ReMax ConvNeXt Detector

Similar to ReMax DINO but uses a ConvNeXt backbone via MMDetection. Also provides novelty scoring alongside standard detection.

Key properties:

  • Open-set detection with ConvNeXt modern CNN backbone

  • MMDetection Cascade R-CNN architecture

  • Requires the learn add-on

  • GPU required (PyTorch)

Classical ML Detectors

SVM Classifier

The SVM (Support Vector Machine) classifier operates over detection proposals or features extracted by another detector. It classifies proposal regions into object categories using learned SVM models.

Key properties:

  • Classical machine learning approach – fast training and inference

  • Can operate over fish-specific or generic object proposals

  • Low data requirements (50+ annotations per class can be sufficient)

  • No GPU required – runs entirely on CPU

  • Good as a quick baseline or for rapid prototyping

  • Requires .svm model files in a category_models directory

Motion / Heuristic Detectors

GMM Motion Detector

The GMM (Gaussian Mixture Model) motion detector identifies moving objects by building a statistical background model and detecting foreground regions that deviate from it.

Key properties:

  • No training required – unsupervised background subtraction

  • Best for stationary camera scenarios with moving objects

  • No GPU required

  • Not suitable for static objects or moving cameras

Canny Edge Detector

Detects circular and elliptical objects using Canny edge detection followed by contour linking and circle fitting. Configurable edge thresholds and contour size filters.

Key properties:

  • Detects circular/elliptical shapes via edge contours

  • Configurable edge thresholds and size filtering

  • No training required, no GPU required

Difference of Gaussians (DoG) Detector

Scale-space blob detector using a Gaussian pyramid with Difference of Gaussians. Detects dark and bright blobs within specified size ranges using sub-pixel extrema detection.

Key properties:

  • Detects blob-like objects across multiple scales

  • Configurable for dark blobs, bright blobs, or both

  • No training required, no GPU required

Simple Hough Detector

The Hough circle detector uses the OpenCV Hough Circle Transform to detect circular objects in images. Primarily useful as a simple demonstration or for specialized applications with circular targets.

Choosing a Detection Method

Running the Command Line Examples

Each example script sources the VIAME setup script to configure paths, then runs a detection pipeline via the viame tool. Pipelines read images from an input list file and write detections to computed_detections.csv in VIAME CSV format.

Example CLI scripts in this folder include:

  • run_generic_proposals – run the generic object proposal detector

  • run_fish_without_motion – run the default fish detector (requires add-on)

  • run_gmm_motion – run the GMM motion detector on video

  • run_huggingface_zeroshot – run the zero-shot detector (no training needed)

To run an example on Linux:

./run_generic_proposals.sh

Or on Windows:

run_generic_proposals.bat

Detections are written to computed_detections.csv in the current directory. The input images are specified in a text file (one image path per line), which can be modified to point to your own imagery.

Detection parameters can be overridden from the command line using the -s flag:

viame configs/pipelines/detector_generic_proposals.pipe \
      -s input:video_filename=my_images.txt

Running Examples in the GUI

In the DIVE interface, detection pipelines are available under the Detector menu. The menu is organized by the first word of the pipeline file name. For example, detector_generic_proposals.pipe appears as Detector -> Generic Proposals. To run a detector, load imagery into DIVE, then select the desired pipeline from the Detector menu.

Domain-specific detectors from installed add-ons (e.g. fish detectors, seal detectors) also appear in the Detector menu once the corresponding add-on is installed.

In the VIEW interface, all detection pipelines are available in the pipelines dropdown.

Domain-Specific Detectors

Several VIAME add-on packages include pretrained detectors for specific domains:

  • default-fish – fish detection models for underwater imagery

  • sea-lion – marine mammal detection (YOLO, CFRNN, Mask R-CNN, and fusion)

  • arctic-seal – arctic seal detection from aerial EO and IR imagery

  • habcam – scallop and fish detection from HabCam benthic survey imagery

  • seamap – fish species detection from SEAMAP survey imagery

These add-ons can be installed via the VIAME add-on manager. Once installed, their detection pipelines become available both on the command line and in the GUI.