Skip to content

Commit

Permalink
Create 2024-03-14-duck-db.md (#1043)
Browse files Browse the repository at this point in the history
* Create 2024-03-14-duck-db.md

* Update 2024-03-14-duck-db.md

* Apply suggestions from code review

Co-authored-by: ben8t <46634684+Ben8t@users.noreply.github.com>

* update

* [feature] add image properly

* Update 2024-03-14-duck-db.md

* Update 2024-03-14-duck-db.md

* Update 2024-03-14-duck-db.md

---------

Co-authored-by: ben8t <46634684+Ben8t@users.noreply.github.com>
Co-authored-by: Benoit <benoit@kestra.io>
  • Loading branch information
3 people authored Mar 14, 2024
1 parent eec0944 commit 00df9da
Show file tree
Hide file tree
Showing 4 changed files with 97 additions and 0 deletions.
97 changes: 97 additions & 0 deletions content/blogs/2024-03-14-duck-db.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
---
title: "Orchestrate your LakeHouse with Kestra and DuckDB"
description: "DuckDB and Kestra offer a solution to reduces costs and complexities of your Lakehouse architecture "
date: 2024-03-14T14:00:00
category: Solutions
author:
name: Benoit Pimpaud
image: "bpimpaud"
image: /blogs/2024-03-14-duck-db.jpg
---

**Combining Kestra and DuckDB to create a lakehouse architecture offers a modern approach to managing data by merging the strengths of data lakes and warehouses. This significantly reduces costs and complexities.**

**This blog summarizes a talk I gave at the first DuckDB Meetup in Paris where I explored how DuckDB's in-memory columnar database integrates well with the orchestration capabilities of Kestra. It also covers the transition from traditional data storage methods to the more efficient and flexible lakehouse model, and discusses the use of DuckDB within Kestra environments through best practices for building scalable, cost-effective data ecosystems.**


## Data Lake, Data Warehouse or Data Lakehouse?

A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. Meanwhile, a data warehouse is a system used for structuring data and support use cases such as reporting and data analysis. It is considered a core component of business intelligence. Data warehouses are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise.

The lakehouse architecture merges the flexibility of data lakes with the structured query capabilities of data warehouses. It consists of three layers: a query engine, a transaction layer, and a storage layer.

The transaction layer introduces an abstraction over raw storage, enabling direct table-like access to raw data, facilitated by Open Table Formats such as [Apache Iceberg](https://iceberg.apache.org/) or [Apache Hudi](https://hudi.apache.org/). This design enhances analytical and operational workflows, providing ACID transactions and efficient data management within a unified platform.

### We Need a Control Plane

To manage various data sources, applications, and infrastructure, a robust control plane is essential. Kestra aligns with the concept of a control plane by providing a centralized, comprehensive orchestration layer for managing data workflows, infrastructure, and applications. It simplifies complex operations, enabling users to define, automate, and monitor processes across different environments. This orchestration layer acts as a control plane by offering visibility and control over various components, streamlining execution, and enhancing efficiency in data-driven ecosystems.

## About DuckDB

DuckDB is considered SQLite for Analytics. It’s an open-source embedded OLAP database that can run in-process rather than relying on the traditional client-server architecture. As with SQLite, there’s no need to install a database server to get started. Installing DuckDB instantly turns your laptop into an OLAP engine capable of aggregating large volumes of data at an impressive speed using just CLI and SQL. It’s especially useful for reading data from local files or objects stored in cloud storage buckets (e.g., parquet files on S3).

This lightweight embedded database allows fast queries in virtually any environment with almost no setup. However, it doesn’t offer high concurrency or user management, and it doesn’t scale horizontally. That’s where MotherDuck can help.
MotherDuck

## Orchestrate your Lakehouse with kestra and DuckDB

In our exploration of DuckDB's integration within Kestra environments, we've identified several layers of complexity that showcase the applications of DuckDB as the query engine for lakehouse architectures. From automating basic queries to implementing advanced data management and analytics, these levels reflect the practical scenarios and solutions our user community is actively deploying:

- **Level 1:** focuses on automating queries and managing dependencies within the stack, utilizing DuckDB for query execution and an S3 bucket for storage, with Kestra orchestrating the workflow. This level is straightforward and fits very well in projects that need to be efficient and rely on simple technologies. However, this level does not address SQL query management, metadata management, or "big data" capabilities as it relies only on DuckDB engine (could be hard to scale horizontally and distribute compute)

![image](/blogs/2024-03-14-duck-db/level1.png)


- **Level 2:** advances by incorporating SQL query management through tools like SQLMesh or DBT, still orchestrated by Kestra. It enhances the structure from Level 1 by uncoupling the business logic (the SQL queries) from the orchestration logic. For example, dbt project can be stored in a dedicated GitHub repository and pulled at runtime in Kestra. This way, dependencies between upstream tasks (data ingestion or data wrangling) and downstream tasks (alerting, refreshing a BI tool, etc.) is very easy to manage and monitor. It also simplifies backfilling and retry strategy management.
Still, it remains limited by the absence of metadata management and "big data" support.

```yaml
id: dbt_duckdb
namespace: blueprint

tasks:
- id: dbt
type: io.kestra.core.tasks.flows.WorkingDirectory
tasks:
- id: cloneRepository
type: io.kestra.plugin.git.Clone
url: https://github.com/kestra-io/dbt-example
branch: main

- id: dbt-run
type: io.kestra.plugin.dbt.cli.DbtCLI
runner: DOCKER
docker:
image: ghcr.io/kestra-io/dbt-duckdb:latest
commands:
- dbt run
```
- **Level 3:** represents a fully-fledged lakehouse model, adding metadata management, ACID properties, and time travel features through the integration of Apache Iceberg for the transaction layer. This level maintains the advancements of Level 2 and addresses its limitations, offering a comprehensive solution for data management and analytics. Note that this integration is still in its early phase: Iceberg writing disposition with DuckDB is still in development and several challenges could arise. Still, it's something open-source [developers are working on](https://github.com/duckdb/duckdb_iceberg) and it could be a very interesting and mature solution soon.
![image](/blogs/2024-03-14-duck-db/level3.png)
Each level progressively builds upon the last, offering more sophisticated features and capabilities, illustrating Kestra's flexibility and power in orchestrating complex data ecosystems with DuckDB as the query engine.
## The End of Big Data?
This article from [MotherDuck](https://motherduck.com/blog/big-data-is-dead/) highlights a shift towards more manageable data sizes in many organizations, suggesting that large-scale distributed computing might not be necessary for all. Instead, leveraging single-node databases like DuckDB for most data scenarios—where volumes are under 10GB—offers a cost-effective, simplified approach. For tasks that demand distributed computing, tools like Databricks or BigQuery provide the necessary scale. This perspective encourages a balanced approach to data analytics, prioritizing efficiency and practicality over the pursuit of handling ever-increasing data sizes.
Having a control-plane like Kestra is key here: it allows one to manage different projects with the corresponding tools. For example having some pipelines using DuckDB as the backbone for fast and efficient small data processing while running in parallel workflows using Spark to distribute heavy data processing on several nodes. It bridges the gap between different solutions by adding dependencies management and branching opportunities, making pipelines connection a seamless task for every engineer.
## Engineers Is Your First Budget Cost not Compute
While the lakehouse architecture reduces computing costs by a great factor, an equally crucial aspect is the human element of development.
Kestra's intuitive, declarative syntax enhances productivity, enabling teams to focus more on innovation and less on the intricacies of orchestration. It facilitates the separation of business and orchestration logic through a fully managed control plane - either using a full web user interface or using code assets like Terraform resources or following [GitOps patterns](https://kestra.io/blogs/2024-02-06-gitops).
To go further between the integration of DuckDB and Kestra you can check our article on [how to automate your data analyse usigin kestra and DuckDB](https://kestra.io/blogs/2023-04-25-automate-data-analysis-with-kestra-and-duckdb) or check out our article about [when you need to move from DuckDB to MotherDuck](https://kestra.io/blogs/2023-07-28-duckdb-vs-motherduck)
Join the Slack [community](https://kestra.io/slack) if you have any questions or need assistance.
Follow us on [Twitter](https://twitter.com/kestra_io) for the latest news.
Check the code in our [GitHub repository](https://github.com/kestra-io/kestra) and give us a star if you like the project.
Binary file added public/blogs/2024-03-14-duck-db.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added public/blogs/2024-03-14-duck-db/level1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added public/blogs/2024-03-14-duck-db/level3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 00df9da

Please sign in to comment.