You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This makes it difficult to use TFX when you have pyspark on your environment as the libraries try to hijack the pickling at the same time.
I know this is an issue of pyspark and beam trying to solve this monkey patching, but I'm wondering if it's possible to move out of namedtuples at all from TFX BSL codebase. Found the issue while trying to launch a Dataflow job from my JupyterLab environment which has PySpark in the environment.
The issue came when dill serializes the namedtuple for the default values in:
This uses pyspark as pyspark has hijacked the serializer.
Wondering, is there any way to avoid using named_tuples in tfx_bsl all together? That would help avoid this chaos I had to go through to find the issue for future people who may be in a similar environment.
Several libraries monkeypatch
collections.namedTuple
as it's difficult to pickle it (namingly pyspark: https://github.com/apache/spark/blob/ee8d66105885929ac0c0c087843d70bf32de31a1/python/pyspark/serializers.py#L385 and beam: https://github.com/apache/beam/blob/v2.21.0/sdks/python/apache_beam/internal/pickler.py#L150)This makes it difficult to use TFX when you have pyspark on your environment as the libraries try to hijack the pickling at the same time.
I know this is an issue of pyspark and beam trying to solve this monkey patching, but I'm wondering if it's possible to move out of namedtuples at all from TFX BSL codebase. Found the issue while trying to launch a Dataflow job from my JupyterLab environment which has PySpark in the environment.
The issue came when dill serializes the namedtuple for the default values in:
tfx-bsl/tfx_bsl/coders/csv_decoder.py
Line 56 in a49260a
This uses pyspark as pyspark has hijacked the serializer.
Wondering, is there any way to avoid using named_tuples in
tfx_bsl
all together? That would help avoid this chaos I had to go through to find the issue for future people who may be in a similar environment.Related: https://issues.apache.org/jira/browse/SPARK-22674
Filed new issue in PySpark: https://issues.apache.org/jira/browse/SPARK-32079
The text was updated successfully, but these errors were encountered: