Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up wave.resource module #352

Open
wants to merge 38 commits into
base: develop
Choose a base branch
from

Conversation

akeeste
Copy link
Contributor

@akeeste akeeste commented Sep 17, 2024

@ssolson This is a follow-up to my other wave PRs and resolves #331. Handling the various edge cases robustly in pure numpy is difficult, so I want to first resolve #331 by using DataArrays throughout the wave resource functions instead of Datasets.

Similar to Ryan's testing mentioned in #331, I found that using DataArrays/Pandas has a 1000x speed up vs Datasets for very large input data. This should restore MHKiT's speed to it's previous state. Using a pure numpy base would have an additional 5-10x speed up from DataArrays, but I think the current work with DataArrays will:

  • be sufficient for our users
  • be easier to develop with
  • be easier to handle edge cases

@akeeste akeeste marked this pull request as ready for review October 2, 2024 18:34
@akeeste akeeste marked this pull request as draft October 2, 2024 18:44
@akeeste akeeste marked this pull request as ready for review October 10, 2024 17:24
@akeeste
Copy link
Contributor Author

akeeste commented Oct 10, 2024

@ssolson This PR is ready for review. Tests should pass now

With some modifications to the type handling functions, and an appropriate frequency_dimension input if required, the wave.resource functions should handle Pandas Series, Pandas DataFrames, and xarray DataArrays regardless of input shape, dimensions names, dimension order, etc. I largely moved away from converting data to xarraay Datasets because they are slow and more difficult to work with.

@akeeste
Copy link
Contributor Author

akeeste commented Oct 10, 2024

mhkit.loads.graphics is unchanged, so I'm not sure why the pylint loads test is now failing on the number of positional arguments there. This branch is up to date with develop

@ssolson
Copy link
Contributor

ssolson commented Oct 14, 2024

mhkit.loads.graphics is unchanged, so I'm not sure why the pylint loads test is now failing on the number of positional arguments there. This branch is up to date with develop

Pylint added new warnings around the way it wants users to handle positional arguments. I'll address it.

@ssolson
Copy link
Contributor

ssolson commented Oct 14, 2024

mhkit.loads.graphics is unchanged, so I'm not sure why the pylint loads test is now failing on the number of positional arguments there. This branch is up to date with develop

Pylint added new warnings around the way it wants users to handle positional arguments. I'll address it.

Addressed in #357

Let's merge #357 then make sure the tests pass here.

@akeeste
Copy link
Contributor Author

akeeste commented Oct 14, 2024

Thanks @ssolson. I'll merge that in here and fix a couple minor items with some examples

@akeeste
Copy link
Contributor Author

akeeste commented Oct 14, 2024

@akeeste TODO:

  • fix a couple last issues with example notebooks
  • reduce time required for example notebooks to enforce speed back to previous benchmark

@akeeste
Copy link
Contributor Author

akeeste commented Oct 14, 2024

@ssolson this PR is now ready for review and all tests are passing. I tightened up the timing on the environmental contours, 3 extreme response, and PacWave examples.

A straight forward test case on the difference in computational expense is using a wave resource function (e.g. energy_period) with a year of NDBC spectral data, or repeating Ryan's script in #331

@ssolson ssolson added wave module Clean Up Improve code consistency and readability labels Oct 16, 2024
Copy link
Contributor

@ssolson ssolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akeeste overall this addresses the issue. Thanks for putting this together. I have just a couple questions and a few minor clean up items.

@@ -442,11 +440,9 @@ def test_mler_export_time_series(self):
mler["WaveSpectrum"] = self.mler["Norm_Spec"].values
mler["Phase"] = self.mler["phase"].values
k = resource.wave_number(wave_freq, 70)
k = k.fillna(0)
np.nan_to_num(k, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns k so it would need to be:
k=nan_to_num(k,0)

However k has no nans so I don't think this is needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The zero frequency in wave_freq results in a nan value. The call to np.nan_to_num() updates the input k in my testing, which is useful if k is not a numpy array as the input data is both updated and retains its type. If it's more clear for this particular test, I can update to redefine k

Copy link
Contributor

@ssolson ssolson Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a limiting factor to getting the PR through but are you saying np.nan_to_num modifies k inplace when you run it?

The docs say the function returns the modified array and that was my experience when I paused the code here:
https://numpy.org/doc/2.0/reference/generated/numpy.nan_to_num.html

@@ -95,7 +95,8 @@ def test_kfromw(self):

expected = self.valdata1[i]["k"]
k = wave.resource.wave_number(f, h, rho)
calculated = k.loc[:, "k"].values
# calculated = k.loc[:, "k"].values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

# Rename to "variable" to match how multiple Dataset variables get converted into a DataArray dimension
data = xr.DataArray(data)
if data.dims[1] == "dim_1":
# Slight chance their is already a name for the columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

their => there

Comment on lines 309 to 313
LM: pandas DataFrame, xarray DatArray, or xarray Dataset
Capture length
JM: pandas DataFrame or xarray Dataset
JM: pandas DataFrame, xarray DatArray, or xarray Dataset
Wave energy flux
frequency: pandas DataFrame or xarray Dataset
frequency: pandas DataFrame, xarray DatArray, or xarray Dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xararray "DataArray" not "DatArray"

@@ -87,26 +86,24 @@ def elevation_spectrum(
+ "temporal spacing for eta."
)

S = xr.Dataset()
for var in eta.data_vars:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This for loop allowed users to process multiple wave heights in a DataFrame ect. Removing it means the user can only process one eta at a time.

Does this update require use to remove this functionality? Would it make sense to be able to parse each variable in a DataSet into a DataArray? Or to have the user create any needed loop outside of the funtion for simplicity?

E.g. the following test will fail. Can we make this work?

    def test_elevation_spectrum_multiple_variables(self):
        time = np.linspace(0, 100, 1000)
        eta1 = np.sin(2 * np.pi * 0.1 * time)  
        eta2 = np.sin(2 * np.pi * 0.2 * time)  
        eta3 = np.sin(2 * np.pi * 0.3 * time)  
        
        eta_dataset = xr.Dataset({
            'eta1': (['time'], eta1),
            'eta2': (['time'], eta2),
            'eta3': (['time'], eta3)
        }, coords={'time': time})
        
        sample_rate = 10  
        nnft = 256 
        
        spectra = wave.resource.elevation_spectrum(
            eta_dataset, sample_rate, nnft
        )

If so lets finish out this test and add this to the test suite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users can still input Datasets, but right now all variables must have the same dimensions, it will be converted to a DataArray up front. The function then returns a DataArray.

I'll look at reinstating a dataset/dataframe loop so that those types are returned

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this loop again. Our previous slowdown of large pandas DataFrames --> xarray datasets could occur in these two functions now, but I don't think that is a typical use case. For example in the case described in #331, it's unlikely a user would have thousands of different wave elevation time series and convert them all to wave spectra (likewise for thousands of distinct spectra being converted to elevation time series). If that case does come up, the slow down should not be due to our implementation but the large quantity of data involved.


omega = xr.DataArray(
data=2 * np.pi * f, dims=frequency_dimension, coords={frequency_dimension: f}
)

eta = xr.Dataset()
for var in S.data_vars:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above removed the ability to iterate of multiple columns, but we still accept DataSets and multi column pandas

@@ -1153,7 +1164,7 @@ def wave_number(f, h, rho=1025, g=9.80665, to_pandas=True):
"""
if isinstance(f, (int, float)):
f = np.asarray([f])
f = convert_to_dataarray(f)
# f = convert_to_dataarray(f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

@akeeste
Copy link
Contributor Author

akeeste commented Oct 17, 2024

@ssolson I addressed all your comments and again allowed datasets into wave.resource.surface_elevation and wave.resource.elevation_spectrum

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Up Improve code consistency and readability wave module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants