Quick-Start Guide

VIAME (Video and Image Analytics for Multiple Environments) is a do-it-yourself AI system for analyzing imagery and video, primarily targeting marine species analytics but also useful as a general computer vision toolkit.

VIAME Flavors

VIAME comes in a few different interfaces with slightly different capabilities. Some interfaces are deprecated (e.g. VIEW) in favor of newer replacements (DIVE), though will still remain as an option for a few specialized cases, just not developed significantly further. Not listed in this document thoroughly are programming APIs for developers.

DIVE – Web and Desktop Annotator

https://www.viametoolkit.org/wp-content/uploads/2026/04/quickstart_dive_annotator.png https://www.viametoolkit.org/wp-content/uploads/2026/04/quickstart_dive_dataset_list.png

Originally created as the VIAME-Web interface (with a public server hosted at https://viame.kitware.com), a desktop version of this web annotator and model trainer is also available in both Windows .msi installers and .zip release formats.

This tool is currently the most general purpose annotator, and supports polygons, lines, points, or boxes, and can train models over multiple videos or image sequences using standard models.

SEAL – Multi-Model Desktop Annotator

Supports annotating detection or track boxes in multiple camera views simultaneously. If a transformation is loaded mapping pixels from one view to the other (e.g. boxes created in one camera view will show up in the other). Supports 2-4 cameras side-by-side in the viewer. Can only train models on one sequence from one camera at a time.

SEARCH – Standalone Search Tool

An older tool, used explicitly for image/video search and rapid model generation through a procedure called iterative query refinement (IQR), wherein the user provides an exemplar of what they’re looking for then the system provides new results for the user to accept or reject. While this is happening, a simple model is trained for the query which can be saved out and re-used in annotators.

VIEW – Original Desktop Annotator

Original VIAME desktop annotator for generating either detection or track-level annotations (boxes or polygons). Contains many optimizations for annotating and running pipelines on large (high resolution) imagery. Coded in C++ for efficiency. Can only train models over a single video or image sequence, with limited model selection. Some users prefer its style of track annotation or use it on high resolution clips.

Project Files

Project files are a collection of scripts targeting either groups of images or videos. They are documented later on in this guide. Project files are also used to launch some of the annotation GUIs in the desktop version of the software, or to train models across multiple sequences headless (without a GUI) to prevent the GUI from using any system resources while training, e.g. VRAM, reserving more for the training process.

Example Folders

In the “examples” folder of a VIAME install are a series of standalone .bat (Windows) or .sh (Linux) launchers broken down based on functionality covering all aspects of the system.

Capabilities Breakdown

Legend: Y = Full Support, P = Partial Support, ~ = Planned, N = No Support, via = Available through Example/Project Files

Platform & Installation Support

Feature

Example Files

Project Files

VIEW

SEARCH

SEAL

DIVE

Runnable from Desktop Installers on Local Desktop

Y

Y

Y

Y

Y

Y

Runnable Remotely over RDP (Windows) or VNC (Linux)

Y

Y

Y

Y

Y

Y

Runnable in Web Browser using Remote Server

N

N

N

N

N

Y

Windows .zip / Linux .tar.gz Installations Provided

Y

Y

Y

Y

Y

Y

Windows .exe / .msi Installers Provided

Y

Y

Y

Y

Y

~

Docker Instances Provided

Y

Y

N

N

N

Y

Feature Support by Interface

Feature

Example Files

Project Files

VIEW

SEARCH

SEAL

DIVE

Standard box-level annotation support in GUI

via

via

Y

P 1

Y

Y

Polygon-level annotation support in GUI

via

via

N

N

N

Y

Pixel-mask annotation support in GUI

N

N

N

N

N

Y

Key-point annotation support in GUI

via

via

P 2

N

N

Y

Joint annotation across 2 to 4 cameras simultaneously

N

N

N

N

Y

~

Detection model training on single sequence or video

N

Y

Y

P 3

Y

Y

Detection model training on multiple sequences or video

N

Y

N

P 3

N

Y

Ability to run arbitrary detection or tracking pipelines

Y

Y

Y

N

Y

Y

Ability to run detection pipelines on multiple cameras

P

Y

N

N

Y

N

Ability to perform image search and iterative refinement

via

via

N

Y

N

~

Annotation support on very large images in GUI

via

via

Y

N

P 4

P 4

Annotation support on images of varying resolutions

via

via

Y

N

N

Y

Ability to run stereo measurement pipelines

Y

Y

N

N

P 5

Y

Ability to run image enhancement under the hood

Y

Y

Y

Y

Y

Y

Ability to output enhanced images

via

Y

N

N

N

Y

Ability to output mosaiced images

Y

Y

N

N

N

N

Automatic scoring and evaluation of detections

Y

P

N

N

N

~

1 Can only confirm or reject boxes
2 Via drawing small boxes
3 SVM models only
4 Basic support, longer load time
5 Poor visualization of results

GPU vs CPU Installations

VIAME is designed to run on 8 Gb+ VRAM NVIDIA Graphics cards (1 or more), but:

  • Many algorithms can run with less and a generic 4 Gb patch is available on the install page

  • Also depends on if talking about just inference (pre-trained model running, uses less) or training

  • This is just for algorithms and processing pipelines; annotation GUIs can be run on CPU

Additionally:

  • Some algorithms are meant to run on CPU (motion tracker, baseline pixel classification)

  • Some algorithms are meant to run on GPU, but can run on CPU (deep frame classification)

  • Some are designed for GPU, and can run on but take forever on CPU (most deep CNN detectors, many deep learning training routines)

How do I know if I have a GPU?

https://www.viametoolkit.org/wp-content/uploads/2026/04/quickstart_device_manager_gpu.png

On Windows, look in Device Manager. Sometimes computers have more than one card (one embedded on the motherboard, then a 2nd in a plugin slot). Next, search for the card to know its specifications. On Linux, many terminal commands can tell you which GPU you have (e.g. nvidia-smi, lspci | grep -i nvidia).

Types of Annotation and Detection Models

There are four main types of annotations and detection models:

https://www.viametoolkit.org/wp-content/uploads/2026/04/quickstart_annotation_box_level.png

Box-Level: A bounding box around the object of interest.

https://www.viametoolkit.org/wp-content/uploads/2026/04/quickstart_annotation_frame_level.png

Frame-Level: The entire frame is classified (e.g. the whole image has a label).

https://www.viametoolkit.org/wp-content/uploads/2026/04/quickstart_annotation_pixel_level.png

Pixel-Level: Pixel masks or polygons tracing the exact outline of objects.

https://www.viametoolkit.org/wp-content/uploads/2026/04/quickstart_annotation_keypoints.png

Keypoints: Specific points of interest on objects (e.g. head, tail).

Detections vs Tracks

Detections and tracks are synonymous across examples and user interfaces. A track is a (temporal) sequence of single-frame detections, but a detection can also be viewed as a track with just a single state.

https://www.viametoolkit.org/wp-content/uploads/2026/04/quickstart_detection_example.png

Detection

https://www.viametoolkit.org/wp-content/uploads/2026/04/quickstart_track_example.png

Track

Annotation Formats

For details on annotation file formats, see the Detection File Conversions section.

VIAME-CSV is the primary input/output format supported by default, with a single line for either each detection, or each detection state in a track. It has 9 required fields comma separated, with optional additional columns for keypoints, attributes, polygons, and masks.

COCO JSON adaptation is also supported by some GUIs, with added track support.

Annotation Best Practices

https://www.viametoolkit.org/wp-content/uploads/2026/04/quickstart_annotation_best_practices.png https://www.viametoolkit.org/wp-content/uploads/2026/04/quickstart_annotation_gui_example.png

When creating bounding box annotations:

  • The goal is for the center of bounding box to remain over the center of the tracked object without clipping too many extremity pixels

  • Attempt to avoid dramatic box size changes that aren’t associated with an object’s movement or overly large boxes

  • Need to consider efficiency (time) vs quality tradeoffs when deciding to do boxes vs pixel masks, box quality, keypoints + boxes, etc.

3 Core Model Training Workflows

https://www.viametoolkit.org/wp-content/uploads/2026/04/quickstart_training_workflows_diagram.png

Workflow #1: Traditional Deep Learning from Scratch

(a) Load up imagery in annotator
(b) Annotate imagery manually
(c) Export detection or tracks files
(d) Repeat for as many sequences as possible in diverse backgrounds
(e) Run model training
(f) Evaluate model performance
(g) Repeat (d) thru (f) as desired on detector fail cases, focusing additional annotation on sequences with the most errors

Pros: Models perform better than most other solutions when trained with enough training data.

Cons: Requires a large amount of training data and user time to generate it.

Workflow #2: Deep Learning with Partial Automation

(a) Load up imagery in annotator
(b) Run an automated detector (can be IQR based, default model, other pre-trained detector, or user generated deep detector)
(c) Correct and export detection or tracks files
(d) Repeat for as many sequences as desired in diverse backgrounds
(e) Run model training
(f) Evaluate model performance
(g) Repeat (b) thru (f) as desired on detector fail cases

Pros: Can speed up annotation if automated detector is decent enough.

Cons: If automated detector is poor it can take more effort to correct automated outputs instead of doing annotations from scratch.

Workflow #3: IQR (Video Search with Adjudication) for Rapid Model Generation

(a) Create searchable index for a video archive (either at full frame level, detection level on top of pre-trained detectors, or track level)
(b) Launch search GUI
(c) Use search GUI to generate IQR (.svm) models
(d) Save models to category directory
(e) Evaluate models

Pros: Can be done with very little user effort, mostly computer runtime. Can be used to rapidly generate models for new classes.

Cons: GUI generally crashes after about 6 iterations due to memory issues. Models generally not as good as deep models trained on enough training data (but can be better for cases with not a lot of training data).