Conversation
Co-authored-by: Leonardo Uieda <leo@uieda.com>
Co-authored-by: Leonardo Uieda <leo@uieda.com>
Co-authored-by: Leonardo Uieda <leo@uieda.com>
Co-authored-by: Leonardo Uieda <leo@uieda.com>
Co-authored-by: Leonardo Uieda <leo@uieda.com>
Co-authored-by: Leonardo Uieda <leo@uieda.com>
leouieda
left a comment
There was a problem hiding this comment.
Looking good! Left some suggestions. While this will work for now, a much better method of filling holes would be to:
- Find the holes by connecting adjacent nan pixels.
- Get the data surrounding a hole (maybe go a few pixels deep) and use that to fit a
Spline. - Predict using the spline only inside the hole.
- Repeat for all holes.
This isn't so easy to implement since finding and connecting holes and selecting the data surrounding it will be tricky and probably require some numba code to do efficiently.
|
@leouieda thanks for the feedback! Yeah fitting splines to just the nearby data would be ideal. That would also be quite useful for merging overlapping grids (#451), where the overlap of grids would be set to NaN, then filled in using that method. I've attempted something below. It's based on distance right now, not the number of nearest points. Here's a quick test showing it is definitely faster and is almost as good at filling nans compared to using the full grid. import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import numpy as np
import pyproj
from time import perf_counter
import verde as vd
data = vd.datasets.fetch_texas_wind()
coordinates = (data.longitude.values, data.latitude.values)
region = vd.get_region(coordinates)
projection = pyproj.Proj(proj="merc", lat_ts=data.latitude.mean())
spacing = 15 / 60
chain = vd.Chain(
[
("mean", vd.BlockReduce(np.mean, spacing=spacing * 111e3)),
("spline", vd.Spline()),
]
)
chain.fit(projection(*coordinates), data.air_temperature_c)
grid_full = chain.grid(
region=region,
spacing=spacing,
projection=projection,
dims=["latitude", "longitude"],
data_names="temperature",
)
max_dist = 3 * spacing * 111e3
grid_masked = vd.distance_mask(
coordinates, maxdist=max_dist, grid=grid_full, projection=projection
)
t0 = perf_counter()
grid_filled_spline = vd.fill_missing(grid_masked, vd.Spline())
t1 = perf_counter()
time_dif = t1-t0
dif = grid_full.temperature - grid_filled_spline.temperature
t0 = perf_counter()
grid_filled_spline_dist = vd.fill_missing(grid_masked, vd.Spline(), maxdist=.8)
t1 = perf_counter()
time_dif_maxdist = t1-t0
dif_maxdist = grid_full.temperature - grid_filled_spline_dist.temperature
print("No maxdist:")
print("\tRMSE:", np.sqrt(np.mean(dif.values**2)))
print(f"\tperf time: {time_dif} sec")
print("Maxdist of 0.8 degrees:")
print("\tRMSE:", np.sqrt(np.mean(dif_maxdist.values**2)))
print(f"\tperf time: {time_dif_maxdist} sec")No maxdist:
RMSE: 0.2557852127068975
perf time: 1.2976604170398787 sec
Maxdist of 0.8 degrees:
RMSE: 0.28706877240137524
perf time: 0.19047226099064574 sec |
This PR adds a
fill_missing()function. It accepts either axarray.DataArrayorxarray.Dataset, and fills NaNs with one of the following interpolators:vd.KNeighbors(default),vd.Linear,vd.Cubic,vd.Trend,vd.Spline, orvd.SplineCV. If a dataset has multiple variables, it will fill NaNs for all variables. It returns the same datatype it was given, and it preserves coordinate and variable names.This required adding an
isinstancecheck for whether it's a dataset or a dataarray. But I think that is okay since, as is, this function will only work with those two types since it usesgrid_to_table, which only accepts those options. If we want it to be simpler and to work with a 2d array, I'm happy to try and update it.The users initialize a Verde interpolator class, setting whatever parameters they want:
This will default to
vd.KNeighbors(k=5).They then pass it to the
fill_missingfunctionFor a simpler version which only uses KNeighbors you can see 6911c68.
Two interpolators,
LinearandCubic, don't allow for extrapolation if the NaNs are on the outer edge of the grid. If NaNs are still present after running the function, a warning is raised to inform users and suggest re-running the grid throughfill_missingwith a different interpolator that allows extrapolation, such asKNeighbors,SplineorSplineCV.Questions / To Do:
.fit()methodsrtolis lowered from 1e-4 to 1e-2, not sure why the filled values vary so much between the test matrix.Here is a plot comparing the various interpolators. This could be good for a documentation page.
Code for plot
Relevant issues/PRs:
This PR addresses #439 and will support fatiando/harmonica#396.