Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added ability to de/serialize from/to avro container. #119

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

wdcossey
Copy link

@wdcossey wdcossey commented Oct 4, 2023

Avro Container

  • Added AvroConvert.DeserializeContainer() for deserialization (of IEnumerable<>).
  • Added AvroConvert.SerializeContainer() for serialization (of IEnumerable<>).
  • Added IsEnumerable() to Type extensions.
  • Workaround for deserialization added to Resolver.ResolveArray()
  • Added some [basic] tests.

*** I haven't had time to test everything as I only needed serialization for Azure Data Explorer

* Added `AvroConvert.DeserializeContainer()` for deserialization (of IEnumerable<>).
* Added `AvroConvert.SerializeContainer()` for serialization (of IEnumerable<>).
* Added `IsEnumerable()` to `Type` extensions.
* Workaround for deserialization added to `Resolver.ResolveArray()`
* Added some [basic] tests.
@wdcossey wdcossey mentioned this pull request Oct 4, 2023
@AdrianStrugala
Copy link
Owner

Thank you for your contribution, I will do my best to review it today. Looks pretty nice at first glance!

Updated code summary.
Removed unused code.
Enforced `IEnumerable` on `DeserializeContainer<T>(...)` methods.
@wdcossey
Copy link
Author

wdcossey commented Oct 5, 2023

Thank you for your contribution, I will do my best to review it today. Looks pretty nice at first glance!

Pushed a bug-fix and some enhancements.

@@ -0,0 +1,146 @@
#region license
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default AvroConvert.Deserialize() method handles objects serialized in this way correctly. Could you please revert the changes done in the deserialization part?

@AdrianStrugala
Copy link
Owner

Very nice PR, thank you. Just two minor comments from my side. When you address them, I will merge the PR, write a short doc, and create the next release.

@gmanvel
Copy link
Contributor

gmanvel commented Nov 18, 2023

@AdrianStrugala @wdcossey what's the decision with this PR ? In general it looks like a deviation from an AvroConvert API, which, in my impression, follows Newtonsoft.Json.JsonConvert API approach, e.g. JsonConvert.SerializeObject takes care of serializing all object types and there are no specialized methods for specific types. It would also mean all existing clients would need to make code changes to benefit from this. We could change Serialize method to apply this approach when passed object is a collection, e.g.

/// <summary>
/// Serializes given object into Avro format (including header with metadata)
/// Choosing <paramref name="codecType"/> reduces output object size
/// </summary>
public static byte[] Serialize(object obj, CodecType codecType)
{
    var schema = Schema.Create(obj);

    if (schema is ArraySchema && !obj.GetType().IsDictionary())
    {
        var enumerator = ((IEnumerable)obj).GetEnumerator();

        enumerator.MoveNext();
        var first = enumerator.Current;

        var itemSchema = Schema.Create(first);

        enumerator.Reset();
        using (MemoryStream resultStream = new MemoryStream())
        {
            using (var writer = new Encoder(itemSchema, resultStream, codecType))
            {
                while (enumerator.MoveNext())
                {
                    var item = enumerator.Current;
                    writer.Append(item);
                }
            }

            byte[] result = resultStream.ToArray();
            return result;
        }
    }
    else
    {
        using (MemoryStream resultStream = new MemoryStream())
        {
            using (var writer = new Encoder(schema, resultStream, codecType))
            {
                writer.Append(obj);
            }
            byte[] result = resultStream.ToArray();
            return result;
        }
    }
}

From the other side, this is potentially a breaking change. While AvroConvert.Deserialize can successfully deserialize .avro files generated this way, the byte content of files (generated before/after this change) are not the same.

I would suggest to make a decision and implement this change in the library as there are big perf improvements

UserCount Original Mean (ms) Improved Mean (ms) Mean Improvement (%) Original Allocated (MB) Improved Allocated (MB) Allocation Improvement (%)
100 2.932 0.9109 68.9% 2.15 1.57 27.0%
1000 12.314 7.8920 35.9% 19.46 12.79 34.3%
10000 123.433 103.5033 16.1% 217.91 151.68 30.4%

Benchmark used to compare nuget AvroConvert v3.4.0 vs AvroConvert.Serialize with the support to serialize array items into separate blocks

[MemoryDiagnoser]
public class AvroConvertSerializeArray
{
    [Params(100, 1_000, 10_000)]
    public int UserCount;
    private User[] _data;

    [GlobalSetup]
    public void Setup()
    {
        Fixture fixture = new Fixture();
        _data = fixture
            .Build<User>()
            .With(u => u.Offerings, fixture.CreateMany<Offering>(21).ToList)
            .CreateMany(UserCount)
            .ToArray();
    }

    [Benchmark]
    public byte[] Serialize() => AvroConvert.Serialize(_data);

}

@AdrianStrugala
Copy link
Owner

Hey,
I am going to implement this in a similar way that you've suggested Manvel.
The point is, that this is in fact a breaking change and I would make it part of the v4 release.
Adrian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants