-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up wave.resource module #352
base: develop
Are you sure you want to change the base?
Conversation
This reverts commit a2d5f61.
…aframes and 2+ var datasets
@ssolson This PR is ready for review. Tests should pass now With some modifications to the type handling functions, and an appropriate |
|
Pylint added new warnings around the way it wants users to handle positional arguments. I'll address it. |
Addressed in #357 Let's merge #357 then make sure the tests pass here. |
Thanks @ssolson. I'll merge that in here and fix a couple minor items with some examples |
@akeeste TODO:
|
@ssolson this PR is now ready for review and all tests are passing. I tightened up the timing on the environmental contours, 3 extreme response, and PacWave examples. A straight forward test case on the difference in computational expense is using a wave resource function (e.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akeeste overall this addresses the issue. Thanks for putting this together. I have just a couple questions and a few minor clean up items.
@@ -442,11 +440,9 @@ def test_mler_export_time_series(self): | |||
mler["WaveSpectrum"] = self.mler["Norm_Spec"].values | |||
mler["Phase"] = self.mler["phase"].values | |||
k = resource.wave_number(wave_freq, 70) | |||
k = k.fillna(0) | |||
np.nan_to_num(k, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This returns k so it would need to be:
k=nan_to_num(k,0)
However k has no nans so I don't think this is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The zero frequency in wave_freq
results in a nan value. The call to np.nan_to_num()
updates the input k
in my testing, which is useful if k is not a numpy array as the input data is both updated and retains its type. If it's more clear for this particular test, I can update to redefine k
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a limiting factor to getting the PR through but are you saying np.nan_to_num
modifies k inplace when you run it?
The docs say the function returns the modified array and that was my experience when I paused the code here:
https://numpy.org/doc/2.0/reference/generated/numpy.nan_to_num.html
@@ -95,7 +95,8 @@ def test_kfromw(self): | |||
|
|||
expected = self.valdata1[i]["k"] | |||
k = wave.resource.wave_number(f, h, rho) | |||
calculated = k.loc[:, "k"].values | |||
# calculated = k.loc[:, "k"].values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete
mhkit/utils/type_handling.py
Outdated
# Rename to "variable" to match how multiple Dataset variables get converted into a DataArray dimension | ||
data = xr.DataArray(data) | ||
if data.dims[1] == "dim_1": | ||
# Slight chance their is already a name for the columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
their => there
mhkit/wave/performance.py
Outdated
LM: pandas DataFrame, xarray DatArray, or xarray Dataset | ||
Capture length | ||
JM: pandas DataFrame or xarray Dataset | ||
JM: pandas DataFrame, xarray DatArray, or xarray Dataset | ||
Wave energy flux | ||
frequency: pandas DataFrame or xarray Dataset | ||
frequency: pandas DataFrame, xarray DatArray, or xarray Dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xararray "DataArray" not "DatArray"
@@ -87,26 +86,24 @@ def elevation_spectrum( | |||
+ "temporal spacing for eta." | |||
) | |||
|
|||
S = xr.Dataset() | |||
for var in eta.data_vars: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This for loop allowed users to process multiple wave heights in a DataFrame ect. Removing it means the user can only process one eta at a time.
Does this update require use to remove this functionality? Would it make sense to be able to parse each variable in a DataSet into a DataArray? Or to have the user create any needed loop outside of the funtion for simplicity?
E.g. the following test will fail. Can we make this work?
def test_elevation_spectrum_multiple_variables(self):
time = np.linspace(0, 100, 1000)
eta1 = np.sin(2 * np.pi * 0.1 * time)
eta2 = np.sin(2 * np.pi * 0.2 * time)
eta3 = np.sin(2 * np.pi * 0.3 * time)
eta_dataset = xr.Dataset({
'eta1': (['time'], eta1),
'eta2': (['time'], eta2),
'eta3': (['time'], eta3)
}, coords={'time': time})
sample_rate = 10
nnft = 256
spectra = wave.resource.elevation_spectrum(
eta_dataset, sample_rate, nnft
)
If so lets finish out this test and add this to the test suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users can still input Datasets, but right now all variables must have the same dimensions, it will be converted to a DataArray up front. The function then returns a DataArray.
I'll look at reinstating a dataset/dataframe loop so that those types are returned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this loop again. Our previous slowdown of large pandas DataFrames --> xarray datasets could occur in these two functions now, but I don't think that is a typical use case. For example in the case described in #331, it's unlikely a user would have thousands of different wave elevation time series and convert them all to wave spectra (likewise for thousands of distinct spectra being converted to elevation time series). If that case does come up, the slow down should not be due to our implementation but the large quantity of data involved.
|
||
omega = xr.DataArray( | ||
data=2 * np.pi * f, dims=frequency_dimension, coords={frequency_dimension: f} | ||
) | ||
|
||
eta = xr.Dataset() | ||
for var in S.data_vars: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above removed the ability to iterate of multiple columns, but we still accept DataSets and multi column pandas
mhkit/wave/resource.py
Outdated
@@ -1153,7 +1164,7 @@ def wave_number(f, h, rho=1025, g=9.80665, to_pandas=True): | |||
""" | |||
if isinstance(f, (int, float)): | |||
f = np.asarray([f]) | |||
f = convert_to_dataarray(f) | |||
# f = convert_to_dataarray(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete
…levation_spectrum
@ssolson I addressed all your comments and again allowed datasets into |
@ssolson This is a follow-up to my other wave PRs and resolves #331. Handling the various edge cases robustly in pure numpy is difficult, so I want to first resolve #331 by using DataArrays throughout the wave resource functions instead of Datasets.
Similar to Ryan's testing mentioned in #331, I found that using DataArrays/Pandas has a 1000x speed up vs Datasets for very large input data. This should restore MHKiT's speed to it's previous state. Using a pure numpy base would have an additional 5-10x speed up from DataArrays, but I think the current work with DataArrays will: