Skip to content

Main repository for "DProvDB: Differentially Private Query Processing with Multi-Analyst Provenance", accepted to appear in Proc. of the ACM on Management of Data (PACMMOD/SIGMOD'2024)

License

Notifications You must be signed in to change notification settings

DProvDB/DProvDB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

arXiv License conference

DProvDB

Main repository for "DProvDB: Differentially Private Query Processing with Multi-Analyst Provenance", accepted to appear in Proc. of the ACM on Management of Data (PACMMOD/SIGMOD'2024) [bibtex] [tech report]

Brief Intro

DProvDB projects aims to build an online DP query processing system where multiple data analysts (with different trust levels) are involved -- these data analysts are not allowed to collude by law or regulations but have the incentive to collude (for a more accurate query answer). We would like to develop DP algorithms that can minimize the privacy loss when the analysts are compromised and build a system – DProvDB – to maximize the total number of queries that can be answered given a fixed privacy budget.

This repository contains the implementation of the DProvDB system.

TL’DR

We provide automatic scripts for installing DProvDB, running it, and reproducing the experimental results presented in the paper.

# install prerequisite packages and DProvDB
bash ./install.sh

# run DProvDB
bash ./run.sh

# plot results
bash ./plot.sh

Note that the experiment code can run for a while (typically 1-2 hours since it contains 4 runs for all experiments). Then, the plots for the paper can be found in the “/plot_code” directory.

If you want to manually install and run or test with your own test cases, we suggest reading the rest of the Code Guide.

Repository Structure

├── DProvDB/
│   ├── src/                              *Main directory
|       ├── main/scala/DProvDB/           *System source code
|       └── test/resources/schema.yaml    *Database configuration
│   ├── data/                             *Directory to dataset and place experimental results
|   ├── chorus/                           *Chorus submodule
|   ├── DProvDB_full.pdf                  *Technical report
|   └── build.sbt                         *Project dependency

Installation Guide

Hardware Requirements

  • This codebase has been tested on Linux (Ubuntu 22.04), Mac (M1 chip), and Windows 10 (w. WSL) platforms. We recommend Linux for a stable testbed, and the following code guides are for Linux.

  • The system code is written in Scala/Java and includes some scripts written in Python. So this codebase should be independent of specific hardware requirements.

  • 16 GB RAM should be sufficient for reproducing the experimental results for this codebase.

Datasets

The experimental results are evaluated over two datasets:

  • The Adult dataset from UCI Machine Learning Repository (accessible in the code repository).

  • The TPC-H dataset generated using the TPC-H kit with 1 GB scale factor (https://github.com/gregrahn/tpch-kit).

Software Requirements

  • Java Runtime Environment w. packaging tools MVN and SBT (both).

  • Scala version =2.12.2

  • PostgreSQL version=16.1

  • CMake, GCC (for generating TPC-H dataset)

Installing DProvDB

  • First, clone and open the codebase.
git clone https://github.com/DProvDB/DProvDB.git
cd DProvDB
  • Install Chorus as a submodule
git submodule add https://github.com/uvm-plaid/chorus.git chorus
git submodule update --init --recursive

cd chorus
mvn install
cd ..

Note: potential missing dependencies for Chorus. If there is an error, add the following to the SBT file.

libraryDependencies += "com.google.guava" % "guava" % "28.0-jre"

If slf4j version is not compatible with the platform, try changing the "org.slf4j" package version in "chorus/pom.xml" version to "1.7.13" (available: Sept, 2024).

  • Preparing data (i.e., load into PostgreSQL), using Adult as an example (Similarly, import TPC-H dataset with TPC-H kit).
createdb adult
psql adult

CREATE TABLE adult (AGE INTEGER NOT NULL,
WORKCLASS VARCHAR(55) NOT NULL,
FNLWQT INTEGER NOT NULL,
EDUCATION VARCHAR(55) NOT NULL,
EDUCATION_NUM INTEGER NOT NULL,
MARITAL_STATUS VARCHAR(55) NOT NULL,
OCCUPATION VARCHAR(55) NOT NULL,
RELATIONSHIP VARCHAR(55) NOT NULL,
RACE VARCHAR(55) NOT NULL,
SEX VARCHAR(55) NOT NULL,
CAPITAL_GAIN INTEGER NOT NULL,
CAPITAL_LOSS INTEGER NOT NULL,
HOURS_PER_WEEK INTEGER NOT NULL,
NATIVE_COUNTRY VARCHAR(55) NOT NULL,
SALARY VARCHAR(55) NOT NULL);

\copy adult FROM './data/adult.data' DELIMITER ',' CSV

CREATE USER link WITH PASSWORD '12345';
GRANT ALL PRIVILEGES ON TABLE adult TO link;

exit;

Note: This testing PostgreSQL is listening to the default port 5432; If you are using another port or another username/password, please update line 40-42 in ‘src/main/scala/Experiments/Experiments.scala’ accordingly.

One can also use their own data with DProvDB, but the DB schema needs to be properly configured in 'src/test/resources/schema.yaml'.

Testing with DProvDB

Manually Run and Evaluate DProvDB

Running DProvDB Code

Under ‘/DProvDB’ directory, run the following command with args filled in.

We enable four arguments:

  • [args1]: dataset, must be "adult" or "tpch";

  • [args2]: task, must be "RRQ" or "EQW";

  • [args3]: table, e.g., "adult", or "orders";

  • [args4]: 5 letters to decide which experiment(s) to run, "T" for run, "F" for not run. e.g., "TFTFT" meaning running all experiments except the 2nd and the 4th.

sbt "run [args1] [args2] [args3] [args4]"

For example, to run all five experiments on the adult dataset using RRQ, use

sbt "run adult RRQ adult TTTTT"

Note: the experimental results data file is automatically stored in ‘data/’.

How to cite:

@inproceedings{zhang2024dprovdb,
  author={Zhang, Shufan and He, Xi},
  title={DProvDB: Differentially Private Query Processing with Multi-Analyst Provenance}, 
  journal={Proceedings of the ACM on Management of Data (SIGMOD'2024)},
  url={https://arxiv.org/abs/2309.10240},
  note={to appear}
}

Correspondence

📬 Shufan Zhang 📜 Homepage
📬 Xi He 📜 Homepage

License

BSD-3-Clause License

About

Main repository for "DProvDB: Differentially Private Query Processing with Multi-Analyst Provenance", accepted to appear in Proc. of the ACM on Management of Data (PACMMOD/SIGMOD'2024)

Topics

Resources

License

Stars

Watchers

Forks