Merge pull request #6 from moradology/feature/docs-update

Update readme with more info about usage
moradology · Sep 4, 2024 · b544406 · b544406
2 parents e9281b7 + f340985
commit b544406
Showing 1 changed file with 62 additions and 17 deletions.
diff --git a/README.md b/README.md
@@ -1,34 +1,79 @@
 # PySpark Apache Beam Runner
 
 ## Overview
-(WHY? Doesn't Beam ship with a Spark runner?)
 
-This project introduces a custom Apache Beam runner that leverages PySpark directly.
-This is not a 'portability' framework compliant runner! It is designed for environments
-where a SparkSession is available but a Spark master server is not. This is useful for
-e.g. serverless environments where jobs are triggered without a long-running cluster,
-sidestepping the expectations of Beam's default Spark runner.
+This project introduces a custom Apache Beam runner that leverages PySpark directly. Unlike the default Spark runner shipped with Beam, this runner is designed for environments where a SparkSession is available but a Spark master server is not. This makes it particularly useful for serverless environments where jobs are triggered without a long-running cluster.
 
-The other benefit is that this strategy for building a runner helps to keep the stack as
-python-centric as possible. The compilation process, the optimizations, the execution
-planning - these all happen in python (for better or worse). Depending on your needs,
-this might be a significant advantage.
+### Why Another Spark Runner?
+
+1. **Serverless Compatibility**: Ideal for environments without a dedicated Spark master, supporting execution in serverless frameworks such as EMR Serverless.
+2. **Python-Centric Approach**: The entire stack - compilation, optimizations, and execution planning - happens in Python, which can be advantageous for Python-focused teams and environments.
+3. **Simplified Setup**: Potentially reduces the complexity of job submission by avoiding the need for port listening on a Spark master.
+4. **Direct PySpark Integration**: Utilizes an assumed PySpark SparkSession directly.
 
 ## Features
-- **Direct Integration with PySpark**: Utilizes a PySpark  assumed SparkSession directly.
-- **Serverless Compatibility**: Ideal for environments without a dedicated Spark master, supporting execution in serverless frameworks.
-- **Simplified Setup**: Potentially reduces the complexity of job submission by avoiding the need for port listening on a Spark master.
 
-## Getting Started
+- Direct integration with PySpark
+- Serverless compatibility
+- Simplified setup process
+- Python-centric execution stack
+- Support for key Beam transforms (Create, ReadFromText, ParDo, Flatten, GroupByKey, CombinePerKey)
+- Efficient handling of side inputs and DoFn lifecycle
+
+## Prerequisites
 
-### Prerequisites
 - Apache Spark
 - Apache Beam
 - Python 3.8 or later
 
-### Installation
-To use this custom runner, just `pip install` as you would any library
+## Installation
+
+To use this custom runner, simply install it using pip:
 
 ```bash
 pip install beam-pyspark-runner
 ```
+
+## Usage
+
+Here's a basic example of how to use the PySpark Beam Runner:
+
+```python
+from apache_beam import Pipeline
+from apache_beam.options.pipeline_options import PipelineOptions
+from beam_pyspark_runner import PySparkRunner
+
+# Define a pipeline with the custom runner implemented here
+with Pipeline(runner=PySparkRunner(), options=PipelineOptions()) as p:
+    # Your Beam pipeline definition here
+    # For example:
+    lines = p | 'ReadLines' >> beam.io.ReadFromText('input.txt')
+    counts = (lines
+              | 'Split' >> beam.Map(lambda x: x.split())
+              | 'PairWithOne' >> beam.Map(lambda words: [(w, 1) for w in words])
+              | 'GroupAndSum' >> beam.CombinePerKey(sum))
+    counts | 'WriteOutput' >> beam.io.WriteToText('output.txt')
+
+# The pipeline will be executed using PySpark
+```
+
+## Current Status and Limitations
+
+This project is currently in development. While it supports many common Beam transforms, some advanced features may not be fully implemented. Contributions and feedback are welcome!
+
+## Contributing
+
+We welcome contributions to the PySpark Apache Beam Runner! Here are ways you can contribute:
+
+1. Report bugs or request features by opening an issue
+2. Improve documentation
+3. Submit pull requests with bug fixes or new features
+
+## License
+
+This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
+
+## Acknowledgements
+
+This project builds upon the great work done by the Apache Beam and Apache Spark communities.
+We're grateful for their ongoing efforts in advancing distributed data processing technologies.