-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Update readme with more info about usage
- Loading branch information
1 parent
e9281b7
commit f340985
Showing
1 changed file
with
62 additions
and
17 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,34 +1,79 @@ | ||
# PySpark Apache Beam Runner | ||
|
||
## Overview | ||
(WHY? Doesn't Beam ship with a Spark runner?) | ||
|
||
This project introduces a custom Apache Beam runner that leverages PySpark directly. | ||
This is not a 'portability' framework compliant runner! It is designed for environments | ||
where a SparkSession is available but a Spark master server is not. This is useful for | ||
e.g. serverless environments where jobs are triggered without a long-running cluster, | ||
sidestepping the expectations of Beam's default Spark runner. | ||
This project introduces a custom Apache Beam runner that leverages PySpark directly. Unlike the default Spark runner shipped with Beam, this runner is designed for environments where a SparkSession is available but a Spark master server is not. This makes it particularly useful for serverless environments where jobs are triggered without a long-running cluster. | ||
|
||
The other benefit is that this strategy for building a runner helps to keep the stack as | ||
python-centric as possible. The compilation process, the optimizations, the execution | ||
planning - these all happen in python (for better or worse). Depending on your needs, | ||
this might be a significant advantage. | ||
### Why Another Spark Runner? | ||
|
||
1. **Serverless Compatibility**: Ideal for environments without a dedicated Spark master, supporting execution in serverless frameworks such as EMR Serverless. | ||
2. **Python-Centric Approach**: The entire stack - compilation, optimizations, and execution planning - happens in Python, which can be advantageous for Python-focused teams and environments. | ||
3. **Simplified Setup**: Potentially reduces the complexity of job submission by avoiding the need for port listening on a Spark master. | ||
4. **Direct PySpark Integration**: Utilizes an assumed PySpark SparkSession directly. | ||
|
||
## Features | ||
- **Direct Integration with PySpark**: Utilizes a PySpark assumed SparkSession directly. | ||
- **Serverless Compatibility**: Ideal for environments without a dedicated Spark master, supporting execution in serverless frameworks. | ||
- **Simplified Setup**: Potentially reduces the complexity of job submission by avoiding the need for port listening on a Spark master. | ||
|
||
## Getting Started | ||
- Direct integration with PySpark | ||
- Serverless compatibility | ||
- Simplified setup process | ||
- Python-centric execution stack | ||
- Support for key Beam transforms (Create, ReadFromText, ParDo, Flatten, GroupByKey, CombinePerKey) | ||
- Efficient handling of side inputs and DoFn lifecycle | ||
|
||
## Prerequisites | ||
|
||
### Prerequisites | ||
- Apache Spark | ||
- Apache Beam | ||
- Python 3.8 or later | ||
|
||
### Installation | ||
To use this custom runner, just `pip install` as you would any library | ||
## Installation | ||
|
||
To use this custom runner, simply install it using pip: | ||
|
||
```bash | ||
pip install beam-pyspark-runner | ||
``` | ||
|
||
## Usage | ||
|
||
Here's a basic example of how to use the PySpark Beam Runner: | ||
|
||
```python | ||
from apache_beam import Pipeline | ||
from apache_beam.options.pipeline_options import PipelineOptions | ||
from beam_pyspark_runner import PySparkRunner | ||
|
||
# Define a pipeline with the custom runner implemented here | ||
with Pipeline(runner=PySparkRunner(), options=PipelineOptions()) as p: | ||
# Your Beam pipeline definition here | ||
# For example: | ||
lines = p | 'ReadLines' >> beam.io.ReadFromText('input.txt') | ||
counts = (lines | ||
| 'Split' >> beam.Map(lambda x: x.split()) | ||
| 'PairWithOne' >> beam.Map(lambda words: [(w, 1) for w in words]) | ||
| 'GroupAndSum' >> beam.CombinePerKey(sum)) | ||
counts | 'WriteOutput' >> beam.io.WriteToText('output.txt') | ||
|
||
# The pipeline will be executed using PySpark | ||
``` | ||
|
||
## Current Status and Limitations | ||
|
||
This project is currently in development. While it supports many common Beam transforms, some advanced features may not be fully implemented. Contributions and feedback are welcome! | ||
|
||
## Contributing | ||
|
||
We welcome contributions to the PySpark Apache Beam Runner! Here are ways you can contribute: | ||
|
||
1. Report bugs or request features by opening an issue | ||
2. Improve documentation | ||
3. Submit pull requests with bug fixes or new features | ||
|
||
## License | ||
|
||
This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details. | ||
|
||
## Acknowledgements | ||
|
||
This project builds upon the great work done by the Apache Beam and Apache Spark communities. | ||
We're grateful for their ongoing efforts in advancing distributed data processing technologies. |