csv-batcher

Vertical scaling

A lightweight, python-based, multi-process CSV batcher suitable for use with Pandas dataframes, as a standalone tool, or other tools that deal with large CSV files (or files that require timely processing).

Installation

pip install csv-batcher

GitHub

https://github.com/tangledpath/csv-batcher

Documentation

https://tangledpath.github.io/csv-batcher/csv_batcher.html

Further excercises

Possibly implement pooling with celery (for use in django apps, etc.), which can bring about horizontal scaling.

Usage

Arguments sent to callback function can be controlled by creating pooler with callback_with and the CallbackWith enum values:

As dataframe row

from csv_batcher.csv_pooler import CSVPooler, CallbackWith

# Callback function passed to pooler; accepts a dataframe row
#   as a pandas Series (via apply)
def process_dataframe_row(row):
    return row.iloc[0]

pooler = CSVPooler(
    "5mSalesRecords.csv",
    process_dataframe_row,
    callback_with=CallbackWith.DATAFRAME_ROW,
    pool_size=16
)
for processed_batch in pooler.process():
    print(processed_batch)

As dataframe

from csv_batcher.csv_pooler import CSVPooler, CallbackWith

# Used from process_datafrom's apply:
def process_dataframe_row(row):
    return row.iloc[0]

# Callback function passed to pooler; accepts a dataframe:
def process_dataframe(df):
    foo = df.apply(process_dataframe_row, axis=1)
    # Or do something more complicated....
    return len(df)

pooler = CSVPooler(
    "5mSalesRecords.csv",
    process_dataframe,
    callback_with=CallbackWith.DATAFRAME,
    pool_size=16
)
for processed_batch in pooler.process():
    print(processed_batch)

As CSV filename

import pandas as pd
from csv_batcher.csv_pooler import CSVPooler, CallbackWith

# Used from process_csv_filename's apply:
def process_dataframe_row(row):
    return row.iloc[0]

def process_csv_filename(csv_chunk_filename):
    # print("processing ", csv_chunk_filename)
    df = pd.read_csv(csv_chunk_filename, skipinitialspace=True, index_col=None)
    foo = df.apply(process_dataframe_row, axis=1)
    return len(df)

pooler = CSVPooler(
    "5mSalesRecords.csv",
    process_csv_filename,
    callback_with=CallbackWith.CSV_FILENAME,
    chunk_lines=10000,
    pool_size=16
)
for processed_batch in pooler.process():
    print(processed_batch)

Development

Linting

ruff check . # Find linting errors
ruff check . --fix # Auto-fix linting errors (where possible)

Documentation

# Shows in browser
poetry run pdoc csv_batcher
# Generates to ./docs
poetry run pdoc csv_batcher -o ./docs
# OR (recommended)
bin/build.sh

Testing

clear; pytest

Publishing

poetry publish --build -u __token__ -p $PYPI_TOKEN`
# OR (recommended)
bin/publish.sh

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
bin		bin
csv_batcher		csv_batcher
docs		docs
.gitignore		.gitignore
README.md		README.md
csv_batcher.png		csv_batcher.png
csv_batcher_sm.png		csv_batcher_sm.png
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

csv-batcher

Vertical scaling

Installation

GitHub

Documentation

Further excercises

Usage

As dataframe row

As dataframe

As CSV filename

Development

Linting

Documentation

Testing

Publishing

About

Releases

Packages

Languages

tangledpath/csv-batcher

Folders and files

Latest commit

History

Repository files navigation

csv-batcher

Vertical scaling

Installation

GitHub

Documentation

Further excercises

Usage

As dataframe row

As dataframe

As CSV filename

Development

Linting

Documentation

Testing

Publishing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages