Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix preset sharding options and add tests #1

Merged
merged 3 commits into from
Jul 26, 2024

Conversation

melissalinkert
Copy link
Member

This fixes the three "preset" --shard options (SINGLE, CHUNK, and SUPERCHUNK) based on comments in zarr-developers/zarr-java#5.

Using --shard SINGLE attempts to create a v3 dataset with a single shard covering the entire array. --shard CHUNK creates one shard per chunk, and --shard SUPERCHUNK attempts to create shards with 2x2 chunks per shard.

The --shard option has no effect on writing v2 data, so should only be used in the context of converting v2 to v3.

As the chunkAndShardCompatible method suggests, as far as I can tell the chunk, shard, and array shape all need to divide evenly into each other or an exception will be thrown. Those cases should be caught with a warning now, but use caution if testing with v2 input data that has array shapes that are not an exact multiple of the chunk size.

There is a placeholder here for allowing custom shard sizes to be specified; I plan to implement that in a separate PR.

This wasn't necessary when building locally, but hopefully fixes CI builds.
@sbesson
Copy link
Member

sbesson commented Jul 25, 2024

Tested against a 384 wells plates (https://ome.github.io/ome-ngff-validator/?source=https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.4/idr0128E/9701.zarr) with images of size 2048x2048. Using the different sharding option

[sbesson@pilot-zarr3-dev ~]$ time ./zarr2zarr-0.0.1-SNAPSHOT/bin/zarr2zarr /data/idr0128/9701.zarr/ /data/idr0128/9701_v3_default.zarr/
...
real    1m57.082s
user    1m0.785s
sys     0m11.581s
[sbesson@pilot-zarr3-dev ~]$ time ./zarr2zarr-0.0.1-SNAPSHOT/bin/zarr2zarr /data/idr0128/9701.zarr/ /data/idr0128/9701_v3_single.zarr/ --shard SINGLE
...

real    13m39.567s
user    8m46.289s
sys     0m48.962s
[sbesson@pilot-zarr3-dev ~]$ time ./zarr2zarr-0.0.1-SNAPSHOT/bin/zarr2zarr /data/idr0128/9701.zarr/ /data/idr0128/9701_v3_chunk.zarr/ --shard CHUNK
...
real    2m22.479s
user    1m3.591s
sys     0m12.202s
[sbesson@pilot-zarr3-dev ~]$ time ./zarr2zarr-0.0.1-SNAPSHOT/bin/zarr2zarr /data/idr0128/9701.zarr/ /data/idr0128/9701_v3_superchunk.zarr/ --shard SUPERCHUNK
...
real    8m7.093s
user    4m22.764s
sys     0m29.344s

8.0G    9701.zarr
8.0G    9701_v3_chunk.zarr
8.0G    9701_v3_default.zarr
8.0G    9701_v3_single.zarr
8.0G    9701_v3_superchunk.zarr
40G     total
[sbesson@pilot-zarr3-dev idr0128]$ find 9701.zarr/ -type f | wc
   8468    8468  231481
[sbesson@pilot-zarr3-dev idr0128]$ find 9701_v3_chunk.zarr/ -type f | wc
   7699    7699  298006
[sbesson@pilot-zarr3-dev idr0128]$ find 9701_v3_default.zarr/ -type f | wc
   7699    7699  313404
[sbesson@pilot-zarr3-dev idr0128]$ find 9701_v3_single.zarr/ -type f | wc
   3859    3859  149705
[sbesson@pilot-zarr3-dev idr0128]$ find 9701_v3_superchunk.zarr/ -type f | wc
   5395    5395  233685

The SUPERCHUNK configuration raised warnings related to incompatible sizes which is suprising as the 2048x2048 array size should be divided into 4 1024x1024 inner chunks and I would expect this configuration to be identical to the --shard SINGLE option

20:22:00.081 [main] INFO com.glencoesoftware.zarr.Convert -- found 4 resolutions
20:22:00.081 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0128/9701.zarr/O/21/0/0
20:22:00.899 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0128/9701.zarr/O/21/0/1
20:22:00.899 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
20:22:00.921 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0128/9701.zarr/O/21/0/2
20:22:00.922 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
20:22:00.929 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0128/9701.zarr/O/21/0/3
20:22:00.929 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
20:22:01.268 [main] INFO com.glencoesoftware.zarr.Convert -- found 4 resolutions
20:22:01.269 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0128/9701.zarr/O/22/0/0
20:22:02.260 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0128/9701.zarr/O/22/0/1
20:22:02.260 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
20:22:02.289 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0128/9701.zarr/O/22/0/2
20:22:02.290 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
20:22:02.296 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /data/idr0128/9701.zarr/O/22/0/3
20:22:02.296 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
20:22:02.351 [main] INFO com.glencoesoftware.zarr.Convert -- found 4 resolutions

Copy link
Member

@sbesson sbesson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested the conversion on a 3D image (https://ome.github.io/ome-ngff-validator/?source=https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.4/idr0048A/9846151.zarr/)

[sbesson@pilot-zarr3-dev idr0048]$ time ~/zarr2zarr-0.0.1-SNAPSHOT/bin/zarr2zarr 9846151.zarr/ 9846151_v3_default.zarr/
21:19:11.481 [main] INFO com.glencoesoftware.zarr.Convert -- opened 9846151.zarr/0
21:19:11.498 [main] INFO com.glencoesoftware.zarr.Convert -- got 2 series attributes
21:19:14.222 [main] INFO com.glencoesoftware.zarr.Convert -- found 6 resolutions
21:19:14.258 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/0
23:17:45.866 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/1
23:50:01.207 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/2
23:58:55.521 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/3
00:02:02.136 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/4
00:04:15.280 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/5

real    166m17.857s
user    21m3.538s
sys     5m7.594s
[sbesson@pilot-zarr3-dev idr0048]$ time ~/zarr2zarr-0.0.1-SNAPSHOT/bin/zarr2zarr 9846151.zarr/ 9846151_v3_single.zarr/ |
--shard single                                                                                                         |
04:30:52.405 [main] INFO com.glencoesoftware.zarr.Convert -- opened 9846151.zarr/0                                     |
04:30:52.414 [main] INFO com.glencoesoftware.zarr.Convert -- got 2 series attributes                                   |
04:30:54.712 [main] INFO com.glencoesoftware.zarr.Convert -- found 6 resolutions                                       |
04:30:54.834 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/0                             |
04:30:54.835 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes               |
06:29:44.030 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/1                             |
06:29:44.030 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes               |
07:01:48.022 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/2                             |
07:01:48.022 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes               |
07:11:04.974 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/3                             |
07:14:04.809 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/4                             |
07:16:17.589 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/5                             |
                                                                                                                       |
real    166m58.189s                                                                                                    |
user    20m39.577s                                                                                                     |
sys     5m3.107s                                                                                                       |
[sbesson@pilot-zarr3-dev idr0048]$ time ~/zarr2zarr-0.0.1-SNAPSHOT/bin/zarr2zarr 9846151.zarr/ 9846151_v3_superchunk.za|
rr/ --shard superchunk                                                                                                 |
07:44:39.263 [main] INFO com.glencoesoftware.zarr.Convert -- opened 9846151.zarr/0                                     |
07:44:39.270 [main] INFO com.glencoesoftware.zarr.Convert -- got 2 series attributes                                   |
07:44:41.304 [main] INFO com.glencoesoftware.zarr.Convert -- found 6 resolutions                                       |
07:44:41.379 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/0                             |
07:44:41.380 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes               |
09:31:50.220 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/1                             |
09:31:50.221 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes               |
09:56:35.121 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/2                             
09:56:35.121 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes               
10:02:39.938 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/3                             
10:04:34.103 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/4                             
10:05:33.205 [main] INFO com.glencoesoftware.zarr.Convert -- opened array 9846151.zarr/0/5                             
                                                                                                                       
real    141m30.396s                                                                                                    
user    19m27.082s                                                                                                     
sys     4m37.969s      

[sbesson@pilot-zarr3-dev idr0048]$ ls
9846151.zarr  9846151_v3_default.zarr  9846151_v3_single.zarr  9846151_v3_superchunk.zarr
[sbesson@pilot-zarr3-dev idr0048]$ du -csh *
133G	9846151.zarr
212G	9846151_v3_default.zarr
212G	9846151_v3_single.zarr
212G	9846151_v3_superchunk.zarr
768G	total
[sbesson@pilot-zarr3-dev idr0048]$ find 9846151.zarr -type f | wc
 121987  121987 3562972
[sbesson@pilot-zarr3-dev idr0048]$ find 9846151_v3_default.zarr/ -type f | wc
 121984  121984 5148693
[sbesson@pilot-zarr3-dev idr0048]$ find 9846151_v3_single.zarr/ -type f | wc
 121984  121984 5026709
[sbesson@pilot-zarr3-dev idr0048]$ find 9846151_v3_superchunk.zarr/ -type f | wc
 121984  121984 5514645

The difference in size made me realize that the default compression is none, so ran another round of conversion with sharding and blosc compression and the increased verbosity as per the last commit against the sample plate

[sbesson@pilot-zarr3-dev idr0128]$ time ~/zarr2zarr-0.0.1-SNAPSHOT/bin/zarr2zarr 9701.zarr/ 9701_v3_single_blosc.zarr --shard single --compression blosc --debug
...
real	12m19.188s
user	5m44.453s
sys	0m37.196s
[sbesson@pilot-zarr3-dev idr0128]$ du -csh *
8.0G	9701.zarr
8.0G	9701_v3_chunk.zarr
8.0G	9701_v3_default.zarr
8.0G	9701_v3_single.zarr
5.5G	9701_v3_single_blosc.zarr
8.0G	9701_v3_superchunk.zarr
46G	total

Possibly next thing to look into is why sharding is rejected for the 3D image. But happy for this to be merged.

@sbesson sbesson merged commit a0ecef2 into glencoesoftware:master Jul 26, 2024
4 checks passed
@melissalinkert melissalinkert mentioned this pull request Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants