Evaluating the accuracy of equivalent-source predictions using cross-validation

Leonardo Uieda, Santiado R. Soler

Info

About

Presented at EGU 2020 (online because of COVID-19), session "G4.3: Acquisition and processing of gravity and magnetic field data and their integrative interpretation". Details some of the work we've been doing in Verde and Harmonica for machine-learning style interpolation with equivalent-sources. In particular, applying state-of-the-art cross-validation strategies to estimate interpolation accuracy and tune equivalent-source parameters.

Short pitch

For the meeting, we had to introduce our presentation with a short pitch in the online chat.

We are looking into cross-validation (CV) to automatically determine the main parameters in equivalent source/layer processing: depth, damping, source layouts. But before we can do that, we need to understand how well CV estimates the error of equivalent source model predictions. We know from the ecological modeling literature that data that are spatially auto-correlated (close variables tend to be similar) tend to underestimate error in cross-validation.

Our main question for this presentation was: Does this also happen for potential field modeling with equivalent sources?

In short, yes it does.

We found that regular CV methods underestimate the prediction mean square error. But "block" versions of CV (again, borrowed from ecology) are capable of mitigating this bias.

Main message: take care when using cross-validation in potential field modeling. Prefer "block" versions of methods when possible.

Both the cross-validation and equivalent source models are open-source and implemented in Python libraries. The block cross-validation is being implemented into the next release of Verde and equivalent source processing is already available in Harmonica.

Abstract

We investigate the use of cross-validation (CV) techniques to estimate the accuracy of equivalent-source (also known as equivalent-layer) models for interpolation and processing of potential-field data. Our preliminary results indicate that some common CV algorithms (e.g., random permutations and k-folds) tend to overestimate the accuracy. We have found that blocked CV methods, where the data are split along spatial blocks instead of randomly, provide more conservative and realistic accuracy estimates. Beyond evaluating an equivalent-source model's performance, cross-validation can be used to automatically determine configuration parameters, like source depth and amount of regularization, that maximize prediction accuracy and avoid over-fitting.

Widely used in gravity and magnetic data processing, the equivalent-source technique consists of a linear model (usually point sources) used to predict the observed field at arbitrary locations. Upward-continuation, interpolation, gradient calculations, leveling, and reduction-to-the-pole can be performed simultaneously by using the model to make predictions (i.e., forward modelling). Likewise, the use of linear models to make predictions is the backbone of many machine learning (ML) applications. The predictive performance of ML models is usually evaluated through cross-validation, in which the data are split (usually randomly) into a training set and a validation set. Models are fit on the training set and their predictions are evaluated using the validation set using a goodness-of-fit metric, like the mean square error or the R² coefficient of determination. Many cross-validation methods exist in the literature, varying in how the data are split and how this process is repeated. Prior research from the statistical modelling of ecological data suggests that prediction accuracy is usually overestimated by traditional CV methods when the data are spatially auto-correlated. This issue can be mitigated by splitting the data along spatial blocks rather than randomly. We conducted experiments on synthetic gravity data to investigate the use of traditional and blocked CV methods in equivalent-source interpolation. We found that the overestimation problem also occurs and that more conservative accuracy estimates are obtained when applying blocked versions of random permutations and k-fold. Further studies need to be conducted to generalize these findings to upward-continuation, reduction-to-the-pole, and derivative calculation.

Open-source software implementations of the equivalent-source and blocked cross-validation (in progress) methods are available in the Python libraries Harmonica and Verde, which are part of the Fatiando a Terra project (www.fatiando.org).

Article Level Metrics


Comments? Let me know on Twitter (tweet @leouieda).
Found a typo/mistake? Send a fix through Github. All you need is an account and 5 minutes!

Related pages