Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --chunk option to allow re-chunking when writing v3 #8

Merged
merged 1 commit into from
Sep 13, 2024

Conversation

melissalinkert
Copy link
Member

--chunk, similar to --shard, takes a comma-separated chunk size. If not specified, the input dataset's original chunk size will be used as before. This currently only applies to v2 -> v3 (not v3 -> v2).

@sbesson sbesson linked an issue Sep 11, 2024 that may be closed by this pull request
@sbesson
Copy link
Member

sbesson commented Sep 11, 2024

An initial round of validation with various chunk and shard sizes look good both in terms of conversion times as well as binary layout. I am starting to reach the point where we need some additional validation of the generated data for different options. Current proposal would be to generate a set of Zarr datasets derived from idr0048 using the combination of this work and the first commit of #7, upload to a public bucket of EBI Embassy and use the in-progress work from ome/ome-ngff-validator#36 to validate the output both from a metadata and binary perspective.

@melissalinkert
Copy link
Member Author

Current proposal would be to generate a set of Zarr datasets derived from idr0048 using the combination of this work and the first commit of #7, upload to a public bucket of EBI Embassy and use the in-progress work from ome/ome-ngff-validator#36 to validate the output both from a metadata and binary perspective.

Sounds like a good plan. I haven't reviewed #7 yet since it's a draft and has failing tests, but having concrete datasets to move that forward definitely makes sense.

Copy link
Member

@sbesson sbesson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a similar methodology and the same input data as in #2 (review), I converted the input Zarr v2 dataset into v3 with a range of varying chunk and shard sizes

--chunk --shard Conversion time Chunks (0) Chunks (1) Chunks (2) Chunks (3) Chunks (4) Chunks (5)
74m25.481s 75708 25236 8412 4206 4206 4206
1,1,1,1024,1024 1,1,4,4096,4096 102m12.047s 2106 1053 1053 1053 1053 1053
1,1,4,512,512 1,1,4,4096,4096 103m57.428s 2106 1053 1053 1053 1053 1053
1,1,16,256,256 1,1,16,2048,2048 85m55.359s 1584 528 264 264 264 264
1,1,32,128,128 1,1,64,1024,1024 81m57.966s 1188 396 132 66 66 66

Times and number of shards are consistent with my expectations. Currently looking into validating some of this work with the validator work from ome/ome-ngff-validator#36.

The default conversion works without issue - see https://deploy-preview-36--ome-ngff-validator.netlify.app/?source=https://uk1s3.embassy.ebi.ac.uk/zarr2zarr/idr0048/9846151_zstd.zarr/0/ but all generated datasets with a combinations of chunks/shards fail to load the chunk in the browser - see e.g.
https://deploy-preview-36--ome-ngff-validator.netlify.app/?source=https://uk1s3.embassy.ebi.ac.uk/zarr2zarr/idr0048/9846151_1024x1024chunks_4x4096x4096shards_zstd.zarr/0/ or https://deploy-preview-36--ome-ngff-validator.netlify.app/?source=https://uk1s3.embassy.ebi.ac.uk/zarr2zarr/idr0048/9846151_16x256x256chunks_16x2048x2048shards_zstd.zarr/0/

Given this also affects converted datasets with the same inner chunk size (1024x1024), I think the problem is outside the scope of this PR. Proposing to merge and investigate why the chunks/shards created by this library fail to load as a follow-up investigation

@melissalinkert
Copy link
Member Author

Proposing to merge and investigate why the chunks/shards created by this library fail to load as a follow-up investigation

Sounds good to me.

@sbesson sbesson merged commit 5439f5e into glencoesoftware:master Sep 13, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add option to specify chunk sizes
2 participants