Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid pickle/dill hell due to collections.namedTuple #7

Open
casassg opened this issue Jun 24, 2020 · 0 comments
Open

Avoid pickle/dill hell due to collections.namedTuple #7

casassg opened this issue Jun 24, 2020 · 0 comments

Comments

@casassg
Copy link
Member

casassg commented Jun 24, 2020

Several libraries monkeypatch collections.namedTuple as it's difficult to pickle it (namingly pyspark: https://github.com/apache/spark/blob/ee8d66105885929ac0c0c087843d70bf32de31a1/python/pyspark/serializers.py#L385 and beam: https://github.com/apache/beam/blob/v2.21.0/sdks/python/apache_beam/internal/pickler.py#L150)

This makes it difficult to use TFX when you have pyspark on your environment as the libraries try to hijack the pickling at the same time.

I know this is an issue of pyspark and beam trying to solve this monkey patching, but I'm wondering if it's possible to move out of namedtuples at all from TFX BSL codebase. Found the issue while trying to launch a Dataflow job from my JupyterLab environment which has PySpark in the environment.

The issue came when dill serializes the namedtuple for the default values in:

ColumnInfo = collections.namedtuple(

This uses pyspark as pyspark has hijacked the serializer.

Wondering, is there any way to avoid using named_tuples in tfx_bsl all together? That would help avoid this chaos I had to go through to find the issue for future people who may be in a similar environment.

Related: https://issues.apache.org/jira/browse/SPARK-22674
Filed new issue in PySpark: https://issues.apache.org/jira/browse/SPARK-32079

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant