Skip to content

Latest commit

 

History

History
105 lines (74 loc) · 5.03 KB

README.md

File metadata and controls

105 lines (74 loc) · 5.03 KB

Unfolder

License: MIT pre-commit

An extremely fast directory exploration tool to find:

  • Largest files
  • Duplicated files
  • ... to be continued

In directories of any size and structure.

A bar chart with benchmark results

Example of analyzing the Apache Airflow codebase

  • ⚡️ Analyzes large software projects in < 1 sec
  • 🤝 Respects .gitignore files
  • 🏠 Works locally and doesn't send your data anywhere
  • 📖 Performs only read operations and doesn't modify files
  • 💾 Has been tested on directories up to 100 GB of size, 20,000 files, 5,000 subfolders

Use cases

Unfolder can be useful for:

  • Software maintainers to reduce repo size and eliminate duplicate files, within or across projects.
  • Project managers to avoid extra data storage costs and have single location for each key artifact.

Benchmarks

Unfolder analyzes codebases of large open-source projects in under half a second:

Project Files Folders Elapsed time, ms
Apache Airflow 7,558 1,713 310
Ruff 7,374 615 182
React 6,467 532 156
CPython 5,182 420 136
Kedro 527 122 176

Time values are measured during local runs on a MacBook Pro with Apple M1 Max chip, 32 GB RAM.

Getting started

Warning

This is a personal pet project of Yury Fedotov, implemented to learn Rust by doing. It may contain bugs or incomplete features. It is shared "as is" without warranty of any kind, either expressed or implied, including but not limited to the implied warranties of merchantability, fitness for a particular purpose, or non-infringement.

Use at your own risk. The author is not responsible for any damage, data loss, or other issues that may arise from using this software.

Installation

Currently, only installation from source is supported:

  1. Make sure you have Rust toolchain set up.
  2. Clone project repo locally, and cd there.
  3. Run cargo build --release to build the binary executable file for the tool.
  4. Run cargo install --path . to install this executable and make it available under unfolder namespace in CLI.

Usage

The tool currently has just one CLI command which is available as:

unfolder path/to/directory/

In addition to path to directory, it can take 3 optional arguments:

Argument Short Long Options Default
List of file extensions to analyze -e --extensions Comma-separated: e.g. py,png All
Minimum file size to consider for duplicate analysis --min_file_size One of the following alias: blank, config, code, excel, document, image, gif, audio, video, large code (100 Kb)
Number of largest files to return based on size -n --n_top Any positive integer 5

So for example:

unfolder path/to/directory/ -e csv,pkl,png,gif --min_file_size image

Would:

  • Analyze path/to/directory/.
  • Consider only files of csv, pkl, png and gif extensions.
  • While identifying duplicates, ignore files smaller than image alias implies (10 Mb).

You can also run unfolder -h to get info on arguments.

FAQ

What makes Unfolder fast?

  • Unfolder is written in pure Rust, which gives a very performant baseline.
  • It leverages parallelism to analyse files faster (as much faster as many cores you have).
  • To check for duplicate files, it leverages a very fast hashing algorithm: xxHash64.

Can Unfolder delete files?

No. As can be validated from its open-source code, it performs only does read operations.