Skip to content

Fill missing values (NaNs)#541

Open
mdtanker wants to merge 36 commits intofatiando:mainfrom
mdtanker:issue#439
Open

Fill missing values (NaNs)#541
mdtanker wants to merge 36 commits intofatiando:mainfrom
mdtanker:issue#439

Conversation

@mdtanker
Copy link
Copy Markdown
Member

@mdtanker mdtanker commented Mar 17, 2026

This PR adds a fill_missing() function. It accepts either a xarray.DataArray or xarray.Dataset, and fills NaNs with one of the following interpolators: vd.KNeighbors (default), vd.Linear, vd.Cubic, vd.Trend, vd.Spline, or vd.SplineCV. If a dataset has multiple variables, it will fill NaNs for all variables. It returns the same datatype it was given, and it preserves coordinate and variable names.

This required adding an isinstance check for whether it's a dataset or a dataarray. But I think that is okay since, as is, this function will only work with those two types since it uses grid_to_table, which only accepts those options. If we want it to be simpler and to work with a 2d array, I'm happy to try and update it.

The users initialize a Verde interpolator class, setting whatever parameters they want:

interpolator = vd.KNeighbors()
interpolator = vd.Linear()
interpolator = vd.Cubic()
interpolator = vd.Trend(5)
interpolator = vd.Spline(damping=10)
interpolator = vd.SplineCV()

This will default to vd.KNeighbors(k=5).

They then pass it to the fill_missing function

filled_grid = vd.fill_missing(grid, interpolator)

For a simpler version which only uses KNeighbors you can see 6911c68.

Two interpolators, Linear and Cubic, don't allow for extrapolation if the NaNs are on the outer edge of the grid. If NaNs are still present after running the function, a warning is raised to inform users and suggest re-running the grid through fill_missing with a different interpolator that allows extrapolation, such as KNeighbors, Spline or SplineCV.

Questions / To Do:

  • allow passing weights to the various .fit() methods
  • fix the failing tests. They pass if rtol is lowered from 1e-4 to 1e-2, not sure why the filled values vary so much between the test matrix.
  • add the script below figure script to the documentation, or alter it.
    Here is a plot comparing the various interpolators. This could be good for a documentation page.
image
Code for plot
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import numpy as np
import pyproj

import verde as vd

# We'll test this on the air temperature data from Texas
data = vd.datasets.fetch_texas_wind()
coordinates = (data.longitude.values, data.latitude.values)
region = vd.get_region(coordinates)

# Use a Mercator projection for our Cartesian gridder
projection = pyproj.Proj(proj="merc", lat_ts=data.latitude.mean())

# The output grid spacing will 15 arc-minutes
spacing = 15 / 60

# Now we can chain a blocked mean and spline together.
chain = vd.Chain(
    [
        ("mean", vd.BlockReduce(np.mean, spacing=spacing * 111e3)),
        ("spline", vd.SplineCV()),
    ]
)
chain.fit(projection(*coordinates), data.air_temperature_c)

# Now we can create a geographic grid of air temperature by providing a
# projection function to the grid method and mask points that are too far from
# the observations
grid_full = chain.grid(
    region=region,
    spacing=spacing,
    projection=projection,
    dims=["latitude", "longitude"],
    data_names="temperature",
)
max_dist = 2 * spacing * 111e3
grid_masked = vd.distance_mask(
    coordinates, maxdist=max_dist, grid=grid_full, projection=projection
)

grid_filled_nearest = vd.fill_missing(grid_masked,vd.KNeighbors())
grid_filled_linear = vd.fill_missing(grid_masked,vd.Linear())
grid_filled_cubic = vd.fill_missing(grid_masked,vd.Cubic())
# grid_filled_spline = vd.fill_missing(grid_masked,vd.Spline())
grid_filled_spline = vd.fill_missing(grid_masked,vd.SplineCV())

filled_grids = [
    grid_filled_nearest.temperature,
    grid_filled_linear.temperature,
    grid_filled_cubic.temperature,
    grid_filled_spline.temperature,
]
dif_grids = [g - grid_full.temperature for g in filled_grids]

titles = [
    "Filled with mean of 5 nearest neighbors",
    "Filled with linear interpolation",
    "Filled with cubic interpolation",
    "Filled with biharmonic spline interpolation",
]
# Plot the masked grid and the original data points
crs = ccrs.PlateCarree()
num_rows = 5
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8), (ax9, ax10)) = plt.subplots(
    num_rows, 2, figsize=(10, 20), subplot_kw=dict(projection=ccrs.Mercator())
)

temp_cpt_lims = vd.minmax(grid_full.temperature, min_percentile=2, max_percentile=98)
diff_max_abs = vd.maxabs(dif_grids, percentile=90)

axes = [ax1,ax2,ax3,ax4,ax5,ax6,ax7,ax8,ax9,ax10]

# plot first 2 grids
ax1.plot(*coordinates, ".k", markersize=1, transform=crs)
tmp = grid_full.temperature.plot.pcolormesh(
    ax=ax1,
    vmin=temp_cpt_lims[0],
    vmax=temp_cpt_lims[1],
    cmap="plasma",
    transform=crs,
    add_colorbar=False,
)
cb = plt.colorbar(tmp, ax=ax1, orientation="vertical", fraction=0.046, pad=0.15)
cb.set_label("Air temperature (C)")
ax1.set_title("Original grid")

ax2.plot(*coordinates, ".k", markersize=1, transform=crs)
tmp = grid_masked.temperature.plot.pcolormesh(
    ax=ax2,
    vmin=temp_cpt_lims[0],
    vmax=temp_cpt_lims[1],
    cmap="plasma",
    transform=crs,
    add_colorbar=False,
)
cb = plt.colorbar(tmp, ax=ax2, orientation="vertical", fraction=0.046, pad=0.15)
cb.set_label("Air temperature (C)")
ax2.set_title("Masked within {round(max_dist/1e3)} km of data")

# iterate over rows for rest of plots
ind = 2
for r in range(num_rows-1):
    ax = axes[ind]

    ax.plot(*coordinates, ".k", markersize=1, transform=crs)
    tmp = filled_grids[r].plot.pcolormesh(
        ax=ax,
        vmin=temp_cpt_lims[0],
        vmax=temp_cpt_lims[1],
        cmap="plasma",
        transform=crs,
        add_colorbar=False,
    )

    cb = plt.colorbar(tmp, ax=ax, orientation="vertical", fraction=0.046, pad=0.15)
    cb.set_label("Air temperature (C)")
    ax.set_title(titles[r])


    ind+=1
    ax = axes[ind]

    ax.plot(*coordinates, ".k", markersize=1, transform=crs)
    tmp = dif_grids[r].plot.pcolormesh(
        ax=ax,
        cmap="RdBu_r",
        vmin = -diff_max_abs,
        vmax = diff_max_abs,
        transform=crs,
        add_colorbar=False,
    )
    cb = plt.colorbar(tmp, ax=ax, orientation="vertical", fraction=0.046, pad=0.15)
    cb.set_label("Air temperature (C)")
    ax.set_title("Difference to original")

    ind+=1

for ax in axes:
    # Use an utility function to add tick labels and land and ocean features to the
    # map.
    vd.datasets.setup_texas_wind_map(ax, region=region)

plt.tight_layout()
plt.show()

Relevant issues/PRs:

This PR addresses #439 and will support fatiando/harmonica#396.

Pedro Silva and others added 9 commits March 19, 2024 20:29
Co-authored-by: Leonardo Uieda <leo@uieda.com>
Co-authored-by: Leonardo Uieda <leo@uieda.com>
Co-authored-by: Leonardo Uieda <leo@uieda.com>
Co-authored-by: Leonardo Uieda <leo@uieda.com>
Co-authored-by: Leonardo Uieda <leo@uieda.com>
Co-authored-by: Leonardo Uieda <leo@uieda.com>
@mdtanker mdtanker changed the title Issue#439 Fill nans Mar 17, 2026
@mdtanker mdtanker marked this pull request as draft March 18, 2026 09:43
@mdtanker mdtanker marked this pull request as ready for review March 18, 2026 15:51
Copy link
Copy Markdown
Member

@leouieda leouieda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Left some suggestions. While this will work for now, a much better method of filling holes would be to:

  1. Find the holes by connecting adjacent nan pixels.
  2. Get the data surrounding a hole (maybe go a few pixels deep) and use that to fit a Spline.
  3. Predict using the spline only inside the hole.
  4. Repeat for all holes.

This isn't so easy to implement since finding and connecting holes and selecting the data surrounding it will be tricky and probably require some numba code to do efficiently.

@mdtanker mdtanker changed the title Fill nans Fill missing values (NaNs) Mar 20, 2026
@mdtanker
Copy link
Copy Markdown
Member Author

mdtanker commented Mar 20, 2026

@leouieda thanks for the feedback! Yeah fitting splines to just the nearby data would be ideal. That would also be quite useful for merging overlapping grids (#451), where the overlap of grids would be set to NaN, then filled in using that method. I've attempted something below. It's based on distance right now, not the number of nearest points.

Here's a quick test showing it is definitely faster and is almost as good at filling nans compared to using the full grid.

import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import numpy as np
import pyproj
from time import perf_counter
import verde as vd

data = vd.datasets.fetch_texas_wind()
coordinates = (data.longitude.values, data.latitude.values)
region = vd.get_region(coordinates)
projection = pyproj.Proj(proj="merc", lat_ts=data.latitude.mean())
spacing = 15 / 60
chain = vd.Chain(
    [
        ("mean", vd.BlockReduce(np.mean, spacing=spacing * 111e3)),
        ("spline", vd.Spline()),
    ]
)
chain.fit(projection(*coordinates), data.air_temperature_c)

grid_full = chain.grid(
    region=region,
    spacing=spacing,
    projection=projection,
    dims=["latitude", "longitude"],
    data_names="temperature",
)
max_dist = 3 * spacing * 111e3
grid_masked = vd.distance_mask(
    coordinates, maxdist=max_dist, grid=grid_full, projection=projection
)

t0 = perf_counter()
grid_filled_spline = vd.fill_missing(grid_masked, vd.Spline())
t1 = perf_counter()
time_dif = t1-t0
dif = grid_full.temperature - grid_filled_spline.temperature

t0 = perf_counter()
grid_filled_spline_dist = vd.fill_missing(grid_masked, vd.Spline(), maxdist=.8)
t1 = perf_counter()
time_dif_maxdist = t1-t0
dif_maxdist = grid_full.temperature - grid_filled_spline_dist.temperature

print("No maxdist:")
print("\tRMSE:", np.sqrt(np.mean(dif.values**2)))
print(f"\tperf time: {time_dif} sec")

print("Maxdist of 0.8 degrees:")
print("\tRMSE:", np.sqrt(np.mean(dif_maxdist.values**2)))
print(f"\tperf time: {time_dif_maxdist} sec")
No maxdist:
	RMSE: 0.2557852127068975
	perf time: 1.2976604170398787 sec
Maxdist of 0.8 degrees:
	RMSE: 0.28706877240137524
	perf time: 0.19047226099064574 sec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants