Avoid pickle/dill hell due to collections.namedTuple #7

casassg · 2020-06-24T01:05:30Z

Several libraries monkeypatch collections.namedTuple as it's difficult to pickle it (namingly pyspark: https://github.com/apache/spark/blob/ee8d66105885929ac0c0c087843d70bf32de31a1/python/pyspark/serializers.py#L385 and beam: https://github.com/apache/beam/blob/v2.21.0/sdks/python/apache_beam/internal/pickler.py#L150)

This makes it difficult to use TFX when you have pyspark on your environment as the libraries try to hijack the pickling at the same time.

I know this is an issue of pyspark and beam trying to solve this monkey patching, but I'm wondering if it's possible to move out of namedtuples at all from TFX BSL codebase. Found the issue while trying to launch a Dataflow job from my JupyterLab environment which has PySpark in the environment.

The issue came when dill serializes the namedtuple for the default values in:

tfx-bsl/tfx_bsl/coders/csv_decoder.py

Line 56 in a49260a

ColumnInfo = collections.namedtuple(

This uses pyspark as pyspark has hijacked the serializer.

Wondering, is there any way to avoid using named_tuples in tfx_bsl all together? That would help avoid this chaos I had to go through to find the issue for future people who may be in a similar environment.

Related: https://issues.apache.org/jira/browse/SPARK-22674
Filed new issue in PySpark: https://issues.apache.org/jira/browse/SPARK-32079

The text was updated successfully, but these errors were encountered:

casassg mentioned this issue Jul 2, 2020

Avoid dill/pickle issues when running in an environment with PySpark tensorflow/tfx#2090

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid pickle/dill hell due to collections.namedTuple #7

Avoid pickle/dill hell due to collections.namedTuple #7

casassg commented Jun 24, 2020 •

edited

Loading

Avoid pickle/dill hell due to collections.namedTuple #7

Avoid pickle/dill hell due to collections.namedTuple #7

Comments

casassg commented Jun 24, 2020 • edited Loading

casassg commented Jun 24, 2020 •

edited

Loading