Validating the AI model

Guide to using the model "Report" page

The Report page on ReefCloud allows you to monitor how your AI model is performing for your project. The model is assessed by comparing the amount of agreement between your manual annotations and the machine annotations across your project. It does this by comparing machine and human annotations across the manually annotated points.

Four reports are provided:

Site Validation allows for a comparison of the model accuracy across sites highlighting disagreement between the human and machine within a given benthic category.
Model Validation provides an overall F1 score of model performance over time given training.
Label Validation shows the accuracy of the model, measured by the F1 score in classifying each label used within the project.
Survey Cover Error outlines the absolute difference between the percentage cover estimated by the human and machine for each label.

To start exploring your model's performance, navigate to the "Report" page of your ReefCloud project within the portal.

Site Validation

Site Validation allows you to compare machine accuracy across sites and between labels within a given benthic category. This can be useful for highlighting areas where the human and the machine disagree, which may indicate that the model needs more training.

The map displays an orange circle for each site: the larger the circle, the greater the disagreement between the human and machine at that site. Clicking on a circle displays an Error value, which is the % of annotations for that site where disagreement occurs. For example, an Error of 4.45% shows the machine has assigned a different label to the human in 4.45% of the points that have been annotated, (or in other words, a 95.55% Accuracy score at that site).

Large circles might indicate more training is needed at that site. It is generally encouraged to spread your annotation effort across all sites evenly.

Grey circles indicate sites where no human annotations have been completed: so the machine estimates of cover and a comparison to the human cannot be assessed. Blue circles indicate sites where there is complete agreement between the machine and human on labels.

For each site, you can further explore model performance for each label, by selecting a Benthic Group in the drop down menu. Note that this information will not display if you haven't completed Label Set Mapping, a step which defines for ReefCloud which Benthic Group each of your labels belong to.

The Cover plot displays the estimated mean percentage cover of each label within a given benthic category across the site selected (between 0% and 100% cover), based on annotations across all images within that site. Error bars provide 95% confidence intervals around each estimate. Overlapping error bars suggest human (blue) and model (orange) estimates are in agreement over the mean percentage cover of a given label at that site.

The Error plot highlights the absolute difference in estimated cover by the machine compared to the human. This may help to identify labels which have been inconsistently annotated by the human (possible in teams where a number of annotators are contributing to training the model), or labels which the model may require more training on. If blue bars are very large in this plot, we suggest returning to the Classify Images page and using the Quality Control (QC) feature to review training for specific labels.

Using the QC feature, the user can filter the points by human classification of either group code or a specific label, and focus on reviewing the annotations provided to train the model for that label. For example, if the "Pocillopora" label is showing a large blue error value, use the QC feature drop down menu or "Filters" options in the Classify Images page to search for all instances of "Pocillopora" annotated by human or machine. Here, a user within a project can check and correct annotations or confirm any machine classifications of "Pocillopora" that haven't been assessed yet.

Remember to annotate all Human Annotatable Points within a given image when moving through your annotations, either in standard classify images mode or QC mode.

Model Validation

Model Validation provides an estimate of two scores: an average source Accuracy and an F1 score. These are two metrics are typically used to assess the quality of classification models. Hover over the points on the graph to view the values for each score. As a rule, the higher the Accuracy and F1 score values, the better a model is at estimating classifications; however this is only valid given the model has received adequate training evenly spread across the dataset.

Accuracy is the percentage of all correctly classified observations: it reflects the amount of agreement between the human and machine across all annotated points in your project. For example, an Accuracy of 0.72 shows that in 72% of annotated points the machine has generated the same label that the Human has assigned. Accuracy is a good metric for when the classes are balanced.

F1 score is slightly harder to interpret, as it considers how the data are distributed. It does this by combining metrics for precision (the correct positive predictions relative to total positive predictions) and recall (the correct positive predictions relative to actual positives). F1 score is more useful when classes are imbalanced (e.g. some very rare classes or groups) and will penalise models that have too many false negatives more than accuracy will.

Since coral reef photo datasets are generally unbalanced, we suggest paying more attention to the F1 score than accuracy score.

The Model Validation plot shows how accuracy and F1 scores are changing as you invest time in training. Each time training occurs by a user annotating more points on the Classify Images page, the project model will learn, and new scores will be plotted on the graph. With consistent training effort, an improvement in model performance can be expected over time.

The F1 score and Accuracy scores are important to report on when publishing your data.

Label Validation

The Label Validation plot shows the accuracy of the machine, measured by the F1 score, in classifying each label used within the project. This plot shows the F1 score for each label, organised by group code, as well as the overall accuracy. The F1 score considers how data are distributed by combining two key metrics: precision and recall. Precision shows how often the machine is correct when classifying a particular label. Recall shows whether the ML model can find all the points with a specific label in the dataset (for further reading see 'Accuracy vs. precision vs. recall in machine learning: what's the difference?' by Evidently AI).

In ReefCloud, we recommend you use the F1 score results as a guide. The F1 score is sensitive to label balancing and label abundance, and in coral reef environments (where taxa are often unbalanced), the F1 score will often yield good scores for highly abundant classes, and poor scores for rare classes. For this reason, we recommend you observe the F1 score results in conjunction with the predicted/machine vs actual/human coral cover estimates. Don't go chasing higher F1 scores through selective annotating of a particular label, rather we recommend the following guidelines:

Annotate at least 30% of all human annotatable points per survey, sampling uniformly across the habitat gradient (for example, choose to sample every 3rd image).
When there are multiple observers annotating the same project, and thus training the same model, make sure all observers are 'calibrated' to an agreed labelling standard for the taxonomic label set of choice.
Review machine predictions using the filtering tool in patch view within the Classier to understand how the model is performing on your dataset for any given label.

Survey Cover Error

The Survey Error Estimate is the absolute difference between the percentage cover estimated by the machine and human for each label.

When the machine learning model is trained, a subset of the manually annotated points is set aside for validating the machine predictions. These test points, which represent an independent dataset, are not used to train the model on which we assess the survey error. The grey boxes in this figure denote reference thresholds at ~2% and ~5% error for high-resolution labels (i.e., genera) and functional groups (e.g., Hard Coral), respectively (see Gonzalez-Rivero et al., 2020 for more details). These thresholds provide guidance for acceptable estimates of error in high-resolution labels and functional groups when using deep learning for automated cover estimates, whereby introduced error by automated annotations will have little statistical influence over the detection of change in cover. For example, suppose the mean (vertical bar within each box along the y-axis) percentage cover estimate of 'Hard Coral' is 20%, an error estimate above 5%. In that case, the user may want to review annotations and focus on more training for the label 'Hard Coral' to reduce this error to an acceptable level.

For further information on error thresholds, please refer to the following resources:

Gonzalez-Rivero, Manuel, Oscar Beijbom, Alberto Rodriguez-Ramirez, Dominic EP Bryant, Anjani Ganase, Yeray Gonzalez-Marrero, Ana Herrera-Reveles et al. "Monitoring of coral reefs using artificial intelligence: A feasible and cost-effective approach."Remote Sensing 12, no. 3 (2020): 489. https://doi.org/10.3390/rs12030489
González-Rivero, Manuel, Oscar Beijbom, Alberto Rodriguez-Ramirez, Tadzio Holtrop, Yeray González-Marrero, Anjani Ganase, Chris Roelfsema, Stuart Phinn, and Ove Hoegh-Guldberg. "Scaling up ecological measurements of coral reefs using semi-automated field image collection and analysis." Remote Sensing 8, no. 1 (2016): 30. https://doi.org/10.3390/rs8010030
Nadon, Marc-Olivier, and Gray Stirling. "Field and simulation analyses of visual methods for sampling coral cover." Coral Reefs 25 (2006): 177-185. https://doi.org/10.1007/s00338-005-0074-5
Wyatt, Mathew, Sharyn Hickey, Ben Radford, Manuel Gonzalez-Rivero, Nader Boutros, Nikolaus Callow, Nicole Ryan, Arjun Chennu, Mohammed Bennamoun, and James Gilmour. "Safe AI for coral reefs: Benchmarking out-of-distribution detection algorithms for coral reef image surveys." Ecological Informatics (2025): 103207. https://doi.org/10.1016/j.ecoinf.2025.103207

Printable Report

A new feature, "Printable Report", is available to enhance the model validation process for users. You’ll find it on the "Report" page by clicking the blue "Printable Report" button at the top right of the screen. A html document reflecting the ML results from your project will be generated and can be printed in a pdf format. The ML Validation Report is a dynamic report intended to guide users on training their AI model, as well as fulfilling the reporting needs of validating the data exported from ReefCloud.

The Printable Report includes several enhancements designed to improve your ReefCloud experience, including:

User guidance to help improve model performance.
Project specific dynamic text and data-specific prompts tailored to your results.
Real-time data updates for the most current insights and guidance.
A report which can be easily appended to user reports, scientific papers, or presentations.

PreviousTag management NextView your project summary

Last updated 1 month ago

Was this helpful?

hashtagSite Validation

hashtagModel Validation

hashtagLabel Validation

hashtagSurvey Cover Error

hashtagPrintable Report

Site Validation

Model Validation

Label Validation

Survey Cover Error

Printable Report