Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setup_bathymetry hangs on simple domain #168

Open
ashjbarnes opened this issue Apr 24, 2024 · 12 comments
Open

setup_bathymetry hangs on simple domain #168

ashjbarnes opened this issue Apr 24, 2024 · 12 comments
Labels
question ❓ Further information is requested

Comments

@ashjbarnes
Copy link
Collaborator

I've been trying to reproduce the figure for the paper, and have therefore been re-making bathymetry. Strangely, some tasks that used to be really simple and fast (eg my region of study at 1/12 degree used to run on one node in ~2min) now hangs

On some further testing, it's now failing as it can't allocate stupid amounts of memory. Somewhere along the line we've messed up this function. I'm not sure how it's still passing the github actions! There's nothing really special about my domain.

I'll keep troubleshooting

@ashjbarnes ashjbarnes added the bug 🐞 Something isn't working label Apr 24, 2024
@navidcy
Copy link
Contributor

navidcy commented Apr 24, 2024

OK, good to raise this, but as it reads now it's only a note to self?

If you don't figure it out perhaps please add here a MWE to showcase the error you get so that it's documented. E.g., what is "simple domain"?

@navidcy
Copy link
Contributor

navidcy commented Apr 24, 2024

(why the figure for #100 needs bathymetry?)

@ashjbarnes
Copy link
Collaborator Author

Because I only have the bathy regridded for the high res domain. To expand the domain and have a border of low res I need more bathy (which also gives the land mask)

Simple domain:

yextent = [-56,-26]
xextent = [142,180]
resolution = 0.05

Everything else set to defaults. This exact code worked fine on 24 cores on a previous version but I'm yet to figure out which changes caused the issue

@navidcy
Copy link
Contributor

navidcy commented Apr 24, 2024

Thanks!

the kwargs are called longitude_extent and latitude_extent now; not sure if that's your mistake.
If you post actual code you run + error I might be able to help

But don't worry about figuring out the historical thread of things -- just try to make this work on current version

@navidcy
Copy link
Contributor

navidcy commented Apr 24, 2024

Still I don't understand why you need bathymetry for the figure proposed in #100 but that seems like something we should discuss in #100.

@ashjbarnes
Copy link
Collaborator Author

No it wasn’t to to with wrong variable names. Everything is input with updates argument names but something breaks (“numpy can’t allocate 2Eib of data”) or hangs forever. Used to take 2 min. More investigation needed

@navidcy
Copy link
Contributor

navidcy commented Apr 24, 2024

Could you provide w a code snippet that when I copy paste in python or in Jupyter notebook I will get the error?

@navidcy
Copy link
Contributor

navidcy commented Apr 25, 2024

I made an MWE.

import regional_mom6 as rmom6

import os
import xarray as xr
from pathlib import Path
from dask.distributed import Client

scratch = "/scratch/v45/nc3020"
gdata = "/g/data/v45/nc3020"
home = "/home/552/nc3020"

expt_name = "bathymetry_mwe"

input_dir = f"{scratch}/regional_mom6_configs/{expt_name}/"
run_dir = f"{home}/mom6_rundirs/{expt_name}/"
toolpath_dir = "/home/157/ahg157/repos/mom5/src/tools/"
tmp_dir = f"{gdata}/{expt_name}"

for path in (run_dir, tmp_dir, input_dir):
    os.makedirs(str(path), exist_ok=True)
    
expt = rmom6.experiment(
    longitude_extent = (142, 180),
    latitude_extent = (-56, -26),
    resolution = 1/20,
    date_range = ["2003-01-01 00:00:00", "2003-01-05 00:00:00"],
    number_vertical_layers = 75,
    layer_thickness_ratio = 10,
    depth = 4500,
    mom_run_dir = run_dir,
    mom_input_dir = input_dir,
    toolpath_dir = toolpath_dir
)

expt.setup_bathymetry(
    bathymetry_path='/g/data/ik11/inputs/GEBCO_2022/GEBCO_2022.nc',
    longitude_coordinate_name='lon',
    latitude_coordinate_name='lat',
    vertical_coordinate_name='elevation',
    minimum_layers=1
    )

expt.bathymetry.depth.plot()

@navidcy
Copy link
Contributor

navidcy commented Apr 25, 2024

The above gives

Begin regridding bathymetry...

If this process hangs it means that the chosen domain might be too big to handle this way. After ensuring access to appropriate computational resources, try calling ESMF directly from a terminal in the input directory via

mpirun ESMF_Regrid -s bathymetry_original.nc -d bathymetry_unfinished.nc -m bilinear --src_var elevation --dst_var elevation --netcdf4 --src_regional --dst_regional

For details see https://xesmf.readthedocs.io/en/latest/large_problems_on_HPC.html

Aftewards, we run 'tidy_bathymetry' method to skip the expensive interpolation step, and finishing metadata, encoding and cleanup.
Regridding in parallel: True

and hangs there at least for 10-15min, after which I lost patience and killed the kernel.

However, if I change to

    longitude_extent = (142, 144),
    latitude_extent = (-56, -52),
    resolution = 1/4,

I get this plot after few seconds...

Unknown-2

I don't see the claimed bug!

On the contrary, I see that the code warns the user that If this process hangs it means ... so not only there is no bug but it seems that the code helps the users if they wanna be waiting less.

@navidcy navidcy changed the title setup bathymetry hangs on simple domain setup_bathymetry hangs on simple domain Apr 25, 2024
@navidcy navidcy added question ❓ Further information is requested and removed bug 🐞 Something isn't working labels Apr 25, 2024
@ashjbarnes
Copy link
Collaborator Author

thanks, point being though that the code used to work with the same sized example and the same sized compute just in the jupyter notebook. So something has still messed up the code's efficiency

@navidcy
Copy link
Contributor

navidcy commented Apr 25, 2024

OK. A performance issue :)

@ashjbarnes
Copy link
Collaborator Author

I've tried with mpirun and that breaks too despite being given ample resources (96 cores, 250gb mem). This points to an issue with the hgrid & raw bathymetry files, as these are what are fed into mpirun script. Or with xESMF itself somehow? I'll keep looking into it but might take me a while

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question ❓ Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants