Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement File Format Reader/Writer #72

Merged
merged 19 commits into from
Oct 11, 2024
Merged

Conversation

lexara-prime-ai
Copy link
Contributor

Description

This pull request enables DataFrames to be read from and written to various file formats (CSV, JSON, ORC, Parquet) using a set of predefined options implemented via the ConfigOpts trait. Key changes include:

  • DataFrameWriter:

    • Implemented functionality to write DataFrames to csv, json, orc, parquet and text formats.
    • Added support for configuration options for various file formats (i.e csv, json, orc, parquet and text).
    • ConfigOpts trait implemented for JsonOptions, OrcOptions, and ParquetOptions.
    • Unit tests added to validate configuration options, similar to TableOptions tests in DataFusion.
  • DataFrameReader:

    • Enabled reading DataFrames from csv, json, orc, parquet and text formats.
    • Supported predefined reading options using the ConfigOpts trait for flexible configuration.
    • Added advanced options like schema merging and recursive file lookup for Parquet files.
    • Developed tests ensuring correct functionality of the DataFrameReader configurations and file parsing.

Related Issue(s)

Documentation

lexara-prime-ai added 12 commits August 27, 2024 15:47
…o8#53)

- Added CsvOptions struct to support CSV read options like `header`, `delimiter`, and `nullValue`.
- Implemented ConfigOpts trait for CsvOptions to convert options into key-value pairs.
- Updated DataFrameReader to include `csv` method that accepts CsvOptions.
…o8#54)

- Added documentation for the CsvOptions struct.
    - Updated the csv method in DataFrameReader to support both single string slices and arrays of string slices as input paths.
…so8#54)

- Added JsonOptions struct to support JSON read options like `schema`, `multi_line`, `encoding`, and more.
- Implemented ConfigOpts trait for JsonOptions to convert options into key-value pairs.
- Updated DataFrameReader to include `json` method that accepts JsonOptions.
- Documented all available JSON options, including example usage for setting options when reading JSON files. [TO DO]
- Write tests to validate JSON options functionality.
…o8#54)

- Example usage provided for setting ORC options when reading files.
- Write tests to validate ORC options functionality.
…russo8#54)

- Added ParquetOptions struct to support Parquet read options like `mergeSchema`, `pathGlobFilter`, and `recursiveFileLookup`.
- Implemented ConfigOpts trait for ParquetOptions to convert options into key-value pairs.
- Updated DataFrameReader to include `parquet` method that accepts ParquetOptions.
- Example usage provided for setting Parquet options when reading files.
- Write tests to validate Parquet options functionality.
…so8#54)

- Added TextOptions struct to support text read options like `wholetext`, `lineSep`, and `pathGlobFilter`.
- Implemented ConfigOpts trait for TextOptions to convert options into key-value pairs.
- Updated DataFrameReader to include `text` method that accepts TextOptions.
- Example usage provided for setting text options when reading files.
- Write tests to validate text options functionality.
…riter (sjrusso8#54)

- Added TextOptions struct to support text write options such as `whole_text` and `line_sep`.
- Added ParquetOptions struct to support Parquet write options like `merge_schema`, `path_glob_filter`, and `datetime_rebase_mode`.
- Implemented `write` method in DataFrameWriter to handle configuration for text and Parquet file formats.
- Example usage provided for setting text and Parquet options when writing DataFrames.
- Write tests to validate text and Parquet file writing functionality.
…russo8#54)

- Added support for reading and writing .csv, .json, .orc, .parquet, and .text file formats.
- Created `ConfigOpts` trait for each file type to manage options in a structured way.
- Added example method signatures for file reading using a configurable options object passed into methods.
Copy link
Owner

@sjrusso8 sjrusso8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good! Left some comments for your consideration. I think the key values when creating the HashMap need to be the camelCase option from the spark docs.

core/src/readwriter.rs Show resolved Hide resolved
core/src/readwriter.rs Show resolved Hide resolved
core/src/readwriter.rs Show resolved Hide resolved
core/src/readwriter.rs Outdated Show resolved Hide resolved
core/src/readwriter.rs Outdated Show resolved Hide resolved
core/src/readwriter.rs Show resolved Hide resolved
core/src/readwriter.rs Show resolved Hide resolved
core/src/readwriter.rs Show resolved Hide resolved
core/src/readwriter.rs Show resolved Hide resolved
core/src/readwriter.rs Show resolved Hide resolved
…jrusso8#54)

    - Implemented additional fields in ParquetOptions compression.
    - Updated test_dataframe_read_parquet_with_options to ensure valid compression codec usage.
    - Enhanced test_dataframe_read_text_with_options to properly read lines by setting line_sep and disabling whole_text.
    - Implemented the #[derive(Debug, Clone)] traits for all Option structs.
    - Updated expected path_glob_filter type to string.
    - Added the compression field to ParquetOptions, OrcOptions, and JsonOptions.
    - Updated documentation for all Options structs to include descriptions for new and existing fields.
…usso8#54)

    - Introduced CommonFileOptions to handle common configuration fields such as:
    - path_glob_filter
    - recursive_file_lookup
    - ignore_corrupt_files
    - ignore_missing_files
    - modified_before
    - modified_after

    - Updated CsvOptions, JsonOptions, OrcOptions, ParquetOptions, and TextOptions
    to use CommonFileOptions for the shared fields.

    - Updated the new() constructors for each file format options struct to initialize
    CommonFileOptions.

    - Refactored tests for each file format (e.g., ORC, CSV) to utilize the new
    CommonFileOptions, ensuring that both format-specific and shared options
    are properly tested.

    - Updated and verified tests for DataFrame reading and writing operations with updated options.
@lexara-prime-ai
Copy link
Contributor Author

Hi @sjrusso8 I just updated the pr.

@sjrusso8
Copy link
Owner

@lexara-prime-ai LGTM! just update the README.md and mark these as closed. Then i'll merge in the change

@lexara-prime-ai
Copy link
Contributor Author

Just updated the README.md @sjrusso8 👍

@sjrusso8 sjrusso8 merged commit 84f170a into sjrusso8:main Oct 11, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement File Format Reader/Writer
2 participants