Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More sharding work #2

Merged
merged 3 commits into from
Aug 26, 2024
Merged

Conversation

melissalinkert
Copy link
Member

Follow-up to #1 and zarr-developers/zarr-java#5.

This fixes the check for valid shard/chunk size pairs, adds support for sharding in more than 2 dimensions, and allows custom shard sizes to be specified as a comma-separated array e.g. --shard 1,1,2,1024,1024.

@sbesson sbesson self-requested a review August 14, 2024 20:35
Copy link
Member

@sbesson sbesson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested using the 9846151.zarr sample from https://idr.github.io/ome-ngff-samples/ (dimensions 2947 x 5192 x 1402 x 3 x 1) also used for the NGFF challenge.

The conversion command was executed with zstd compression (to speed up conversion times and reduce size) for different variations of shard sizes

--shard Outer chunk shape Inner chunk shape Conversion time Dataset size Number of chunks
[1,1,1,1024,1024] 222m29.226s 76G 121975
SUPERCHUNK [1,2,2,2048,2048] [1,1,1,1024,1024] 753m55.036s 76G 16825
1,1,1,4096,4096 [1,1,1,4096,4096] [1,1,1,1024,1024] 686m33.655s 76G 92572
1,1,4,4096,4096 [1,1,4,4096,4096] [1,1,1,1024,1024] 1966m36.416s 76G 67348

Overall, the numbers are consistent with the expectations from the shard dimensions. It is clear that as things stand sharding adds a conversion overhead. I'll try and reproduce the last measurement as the time feels like an outlier.

Thinking of possible next features up for discussion:

  • chunk dimensions: for the dataset above, it would be really interesting to test a 3D rechunking (with and without shards)
  • performance: I assume any form of parallalelization would need to happen at the outer chunk level
  • metadata update: to be compliant with the proposed RFC-2 also used by https://github.com/ome/ome2024-ngff-challenge

@sbesson
Copy link
Member

sbesson commented Aug 23, 2024

I was able to reproduce the conversion times in #2 (review) i.e. ~1900m for --shard 1,1,4,4096,4096

I also ran a conversion with --shard 1,2,2,2048,2048 which is the shard that is selected in the SUPERCHUNK option for the highest resolution. Below are the conversion times alongside the number of chunks for every resolution

--shard Conversion time Chunks (0) Chunks (1) Chunks (2) Chunks (3) Chunks (4) Chunks (5)
222m29.226s 75708 25236 8412 4206 4206 4206
SUPERCHUNK 753m55.036s 8412 2804 1402 1402 1402 1402
1,1,1,4096,4096 686m33.655s 8412 4206 8412 4206 4206 4206
1,1,4,4096,4096 1966m36.416s 2106 1053 8412 4206 4206 4206
1,2,2,2048,2048 777m52.603s 8412 2804 4206 4206 4206 4206

I had not fully appreciated that SUPERCHUNK is adapting to every resolution using the array dimensions

[sbesson@pilot-zarr3-dev zarr2zarr]$ cat 9846151_superchunk_zstd.zarr/0/0/zarr.json 
{"zarr_format":3,"node_type":"array","shape":[1,3,1402,5192,2947],"data_type":"uint16","chunk_grid":{"name":"regular","configuration":{"chunk_shape":[1,2,2,2048,2048]},"name":"regular"},"chunk_key_encoding":{"name":"default","configuration":{"separator":"/"},"name":"default"},"fill_value":255,"codecs":[{"name":"sharding_indexed","configuration":{"chunk_shape":[1,1,1,1024,1024],"codecs":[{"name":"bytes","configuration":{"endian":"little"},"name":"bytes"}],"index_codecs":[{"name":"bytes","configuration":{"endian":"little"},"name":"bytes"},{"name":"crc32c","name":"crc32c"}],"index_location":"end"},"name":"sharding_indexed"},{"name":"zstd","configuration":{"level":5,"checksum":true},"name":"zstd"}],"dimension_names":null,"attributes":{}}
[sbesson@pilot-zarr3-dev zarr2zarr]$ cat 9846151_superchunk_zstd.zarr/0/3/zarr.json 
{"zarr_format":3,"node_type":"array","shape":[1,3,1402,649,368],"data_type":"uint16","chunk_grid":{"name":"regular","configuration":{"chunk_shape":[1,2,2,1298,736]},"name":"regular"},"chunk_key_encoding":{"name":"default","configuration":{"separator":"/"},"name":"default"},"fill_value":255,"codecs":[{"name":"sharding_indexed","configuration":{"chunk_shape":[1,1,1,649,368],"codecs":[{"name":"bytes","configuration":{"endian":"little"},"name":"bytes"}],"index_codecs":[{"name":"bytes","configuration":{"endian":"little"},"name":"bytes"},{"name":"crc32c","name":"crc32c"}],"index_location":"end"},"name":"sharding_indexed"},{"name":"zstd","configuration":{"level":5,"checksum":true},"name":"zstd"}],"dimension_names":null,"attributes":{}}

while the custom sharding is a fixed value which will default to simple chunks if it exceeds the shape dimensions

[sbesson@pilot-zarr3-dev zarr2zarr]$ cat 9846151_2x2x2048x2048_zstd.zarr/0/0/zarr.json 
{"zarr_format":3,"node_type":"array","shape":[1,3,1402,5192,2947],"data_type":"uint16","chunk_grid":{"name":"regular","configuration":{"chunk_shape":[1,2,2,2048,2048]},"name":"regular"},"chunk_key_encoding":{"name":"default","configuration":{"separator":"/"},"name":"default"},"fill_value":255,"codecs":[{"name":"sharding_indexed","configuration":{"chunk_shape":[1,1,1,1024,1024],"codecs":[{"name":"bytes","configuration":{"endian":"little"},"name":"bytes"}],"index_codecs":[{"name":"bytes","configuration":{"endian":"little"},"name":"bytes"},{"name":"crc32c","name":"crc32c"}],"index_location":"end"},"name":"sharding_indexed"},{"name":"zstd","configuration":{"level":5,"checksum":true},"name":"zstd"}],"dimension_names":null,"attributes":{}}
[sbesson@pilot-zarr3-dev zarr2zarr]$ cat 9846151_2x2x2048x2048_zstd.zarr/0/3/zarr.json 
{"zarr_format":3,"node_type":"array","shape":[1,3,1402,649,368],"data_type":"uint16","chunk_grid":{"name":"regular","configuration":{"chunk_shape":[1,1,1,649,368]},"name":"regular"},"chunk_key_encoding":{"name":"default","configuration":{"separator":"/"},"name":"default"},"fill_value":255,"codecs":[{"name":"bytes","configuration":{"endian":"little"},"name":"bytes"},{"name":"zstd","configuration":{"level":5,"checksum":true},"name":"zstd"}],"dimension_names":null,"attributes":{}}

Note that in the first scenario there is still an issue as the shard/outer chunk shape for the 4th resolution [1,2,2,1298,736] has values greater than the array dimension [1,3,1402,649,368]

Semi-related, I tried to include the latest zarr-java 0.0.4 release including some sharding validation on top of this branch. Several sharding unit tests are now failing during conversion with exceptions:

dev.zarr.zarrjava.ZarrException: Shape [1, 1, 10, 1024, 1024] can not contain chunk shape [1, 1, 4, 2048, 2048]
	at app//dev.zarr.zarrjava.v3.ArrayMetadata.<init>(ArrayMetadata.java:106)
	at app//dev.zarr.zarrjava.v3.ArrayMetadata.<init>(ArrayMetadata.java:69)
	at app//dev.zarr.zarrjava.v3.ArrayMetadataBuilder.build(ArrayMetadataBuilder.java:141)
	at app//com.glencoesoftware.zarr.Convert.convertToV3(Convert.java:367)
	at app//com.glencoesoftware.zarr.test.ConversionTest.test3DSharding(ConversionTest.java:403)

I believe this is related to the scenario above as the specified shard exceeds the array dimensions for a resolution lower than the largest one. We probably need to decide what the behavior of the library should be in these scenarios. I see two options:

  • drop the sharding entirely and use simple chunks
  • trim the shard shape along the dimensions where it exceeds the inner chunk/array shape

Copy link
Member

@sbesson sbesson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed earlier with @melissalinkert. The zarr-java 0.0.4 issue has been raised upstream to confirm whether this was the accepted behavior.

Merging this PR as it gives us a first API to generate Zarr v3 datasets with arbitrary shard sizes. I'll also file some issues for the various points that came up as part of the code/functional review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants