Welcome to my project's README! This is where you get a sneak peek into Mexico crime data. 🦸♂️🇲🇽
The goal? Simple. We're building a dashboard to showcase criminal stats in Mexico. It's a blend of cool tech, data magic, and a sprinkle of detective work. 🔍💻
-
Google Cloud Platform (GCP):
- Dataproc: Sparkin' up those data jobs.
- Google Cloud Storage: Where our data calls home.
- BigQuery: Our data warehouse chill spot.
- Compute Engine: Hosting a VM for our shenanigans.
-
Apache Airflow: The conductor orchestrating our data symphony.
-
DBT (Data Build Tool): Crafting our data models with finesse. 🔨
-
Docker: Wrapping Airflow, DBT, and friends in neat little containers. 🐳
-
Terraform: Building castles in the cloud (GCP). 🏰
Imagine a world where data flows like a river, getting processed, stored, and visualized seamlessly. That's our game plan! 🌊📈
Our data journey is like riding a wave 🏄♂️—it starts with a splash and ends with a splashier insight! Here's the lowdown:
-
Data Ingestion 🚀:
- Raw data rides in from all corners of Mexico, packing juicy details about crimes, locations, and times. Think of it as our data surfers catching the gnarliest waves of info.
- We scoop up this data and stash it in our beachfront hangout: Google Cloud Storage (GCS). It's like the cool surf shack where all the rad data hangs out.
-
Data Processing 🌊:
- Time to ride the wave! With Apache Airflow as our wave master, we shred through tasks like data cleanup, transformation, and enrichment. 🤙
- Spark jobs on Google Dataproc clusters do the heavy lifting, turning raw data into polished pearls ready for analysis.
-
Data Storage 🏄♀️:
- Our polished pearls make their way to the BigQuery beach house, where they chillax in structured tables. BigQuery is our go-to spot for sippin' on SQL queries and soaking up insights.
- Sometimes, we stash intermediate goodies here too, for quick access during our surfing sessions.
-
Data Modeling 💪:
- Now it's time to sculpt! With DBT (Data Build Tool), we mold our data into sleek, well-defined shapes. Think of it as giving our insights a killer beach bod.
- SQL magic in DBT transforms our data, adding layers of meaning and depth, turning raw numbers into stories worth telling.
-
Data Visualization 🌊:
- Hang on tight, 'cause here come the waves of insight! 🌊 We hop on LookerStudio visualization tool to craft custom dashboards to ride those data waves.
- Interactive dashboards make it a breeze for everyone to catch a glimpse of the big kahuna: key metrics, trends, and patterns in Mexico's crime scene.
And that's how we roll—riding the data waves, catching insights, and turning them into action-packed adventures in the world of Mexican crime stats! 🔍
-
Clone Me:
git clone https://github.com/emilianolel/dez-project-emi.git
-
Get Things Ready:
cd dez-project-emi
-
Hook Up GCP:
- Spin up a GCP project.
- Flip those API switches (Dataproc, BigQuery, LookerStudio).
- Make some service accounts dance to your tune. Here is an example of the roles you will need
-
Fire Up Terraform: Open
terraform/variables.tf
and replace this variables for the ones that fits you better such as your service account and secret.json
file.cd terraform
terraform init
terraform apply
-
Connect to VM
- Open Google Cloud Console and get the
external ip
of your VM. - Connect to it using:
ssh -i <path/to/ssh/private/key> <username>@<external_ip_address>
- If got the following error message:
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! Someone could be eavesdropping on you right now (man-in-the-middle attack)! It is also possible that a host key has just been changed. The fingerprint for the ED25519 key sent by the remote host is
you can run:
ssh-keygen -R <external_ip_address>
- Open Google Cloud Console and get the
-
Install Docker:
- From the project directory simply run
./setup/install_docker.sh
- Then run the following commands:
sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker
docker run hello-world
-
Set up Airflow ENV Variables
- Open the
airflow/airflow.env
and modify the following variables:
# Params to touch USER_GCP_SECRET_PATH=.secrets/gcp/ USER_GCP_SECRET_NAME=<your-secret-json> GCP_SERVICE_ACCOUNT=<your-service-account-id>@<your-gcp-project>.iam.gserviceaccount.com GCP_PROJECT=<your-gcp-project>
- Run the following command in order to create the Dockerfile:
./docker/make_dockerfile.sh
- Open the
-
Launch Airflow:
- Dockerize Airflow. Change the following line of the file
./airflow/scripts/entrypoint.sh
airflow db init
to
airflow db reset
- Blast off using Docker Compose or Kubernetes.
- Dockerize Airflow. Change the following line of the file
-
Start Airflow:
-
Check Out the UI:
- Visit
http://localhost:8080
. - Dive into the DAGs and hit play!
- Visit
-
Explore the Dashboard:
We're all about teamwork! If you're itching to hop on board, here's how:
- Fork this repo.
- Cook up your feature branch (
git checkout -b feature/YourFeature
). - Add your magic (
git commit -am 'Added awesome stuff'
). - Push it like you mean it (
git push origin feature/YourFeature
). - Open that sweet, sweet Pull Request.
This project is licensed under the MIT License. Go ahead, have fun with it!