February 25, 2026
Your Data Lives in America: Why Europe Needs a Sovereign Data Platform
Europe has no independent data platform. Why digital sovereignty matters now and how DataBaas is building a European data platform.
In our previous post, we explored why most companies don’t need complex distributed systems for their data needs. The small data movement shows that modern hardware and smart tools can handle far more than we give them credit for.
Now it’s time to build something.
This guide walks you through creating a complete analytics platform using DuckDB and Superset—no Spark clusters, no expensive warehouses, just open-source tools that get the job done. We’ll start with local development on your laptop, then show you how to make it accessible to your entire organization.
Let’s say you’re working at a company that currently does analytics in Excel spreadsheets and ad-hoc Python scripts. This does not scale well, Excel struggles with slightly more data and python can quickly become messy if you are trying to explore data visually. You have:
Our goal: Build a platform where you can:
The architecture (starting simple):
Data Sources (CSV, Excel, Databases)
↓
DuckDB (transformation & query)
↓
Superset (visualization)
Later, we’ll evolve this to support multiple users safely.
Let’s start on your laptop. This is where you’ll develop your transformations and build your first dashboards.
First, create a project directory and install DuckDB:
# Create project directory
mkdir small-data-platform
cd small-data-platform
# Create a Python virtual environment with uv
uv init
# Install DuckDB
uv add duckdb
If you don’t have uv installed yet, read here how to install it.
Note: DuckDB also works as a standalone CLI tool (just download the binary) or even in your browser. But we’re using Python here since most data folks are already comfortable with it.
Let’s load some CSV data. Create a file called load_data.py:
import duckdb
# Create a persistent database file
# This file will store your tables and can be reused
con = duckdb.connect('analytics.duckdb')
# Load CSV data directly - DuckDB infers the schema
con.execute("""
CREATE TABLE sales AS
SELECT * FROM read_csv('data/sales_2024.csv')
""")
# Load Excel files (yes, DuckDB can read Excel!)
con.execute("""
CREATE TABLE product_usage AS
SELECT * FROM st_read('data/usage_logs.xlsx')
""")
# Connect to your existing Postgres database
con.execute("""
INSTALL postgres;
LOAD postgres;
CREATE TABLE customers AS
SELECT * FROM postgres_scan(
'host=localhost port=5432 dbname=production user=readonly',
'public',
'customers'
)
""")
con.close()
print("✅ Data loaded successfully!")
Run it:
uv run load_data.py
You now have an analytics.duckdb file containing all your data. This file is a complete database that you can version control, back up, or share with colleagues.
Now let’s transform this data into something useful. Create transformations.py:
import duckdb
con = duckdb.connect('analytics.duckdb')
# Create a view for monthly revenue by customer segment
con.execute("""
CREATE OR REPLACE VIEW monthly_revenue AS
SELECT
DATE_TRUNC('month', s.sale_date) as month,
c.segment,
c.region,
COUNT(DISTINCT s.customer_id) as active_customers,
SUM(s.amount) as total_revenue,
AVG(s.amount) as avg_order_value
FROM sales s
JOIN customers c ON s.customer_id = c.id
GROUP BY 1, 2, 3
ORDER BY 1 DESC, 2
""")
# Create a view for product usage metrics
con.execute("""
CREATE OR REPLACE VIEW product_metrics AS
SELECT
product_id,
DATE_TRUNC('week', event_date) as week,
COUNT(DISTINCT user_id) as weekly_active_users,
COUNT(*) as total_events,
COUNT(*) / COUNT(DISTINCT user_id) as events_per_user
FROM product_usage
GROUP BY 1, 2
""")
# Create a customer summary table for dashboarding
con.execute("""
CREATE OR REPLACE TABLE customer_summary AS
SELECT
c.id,
c.name,
c.segment,
c.region,
COUNT(s.id) as total_orders,
SUM(s.amount) as lifetime_value,
MAX(s.sale_date) as last_purchase_date,
DATEDIFF('day', MAX(s.sale_date), CURRENT_DATE) as days_since_purchase
FROM customers c
LEFT JOIN sales s ON c.id = s.customer_id
GROUP BY 1, 2, 3, 4
""")
con.close()
print("✅ Transformations complete!")
Why views vs tables?
Run your transformations:
uv run transformations.py
Now you can query your transformed data. Create query.py:
import duckdb
con = duckdb.connect("analytics.duckdb", read_only=True)
# Get revenue by segment from July-December 2024
df = con.execute("""
SELECT
month,
segment,
region,
total_revenue,
active_customers
FROM monthly_revenue
WHERE month >= '2024-07-01' AND month < '2025-01-01'
ORDER BY total_revenue DESC, month DESC
""").df()
print(df.head())
con.close()
The .df() at the end converts the result to a Pandas DataFrame, which is handy for plotting or further analysis.
Now let’s visualize this data. We’ll install Superset using Docker (easiest way), create a file named docker-compose.yml with the following content:
name: small-data-platform
services:
superset:
container_name: superset
build:
context: .
dockerfile_inline: |
FROM apache/superset:6.0.0-py312
USER root
RUN uv pip install psycopg2-binary duckdb 'duckdb-driver>=0.17.0'
USER superset
ports:
- "8088:8088"
volumes:
- ./analytics.duckdb:/app/analytics.duckdb
configs:
- source: superset_config.py
target: /app/superset_config.py
environment:
- SUPERSET_CONFIG_PATH=/app/superset_config.py
- DATABASE_DB=superset
- DATABASE_HOST=db
- DATABASE_PASSWORD=superset
- DATABASE_USER=superset
- SUPERSET_LOAD_EXAMPLES=false
db:
image: postgres:16
environment:
- POSTGRES_USER=superset
- POSTGRES_PASSWORD=superset
- POSTGRES_DB=superset
volumes:
- superset_postgres_data:/var/lib/postgresql/data
volumes:
superset_postgres_data:
configs:
superset_config.py:
content: |
SECRET_KEY = "your_super_duper_secret_key_here"
SQLALCHEMY_DATABASE_URI = "postgresql://superset:superset@db:5432/superset"
This file contains a small example for a simple Superset setup with the required dependencies for visualizing data from DuckDB.
Now run the following commands:
docker compose up --build -d
# you might need to wait a bit until it starts and then run
docker exec -it superset /app/docker/docker-init.sh
Open your browser to http://localhost:8088 and log in with admin/admin!
In Superset, you’ll need to add a database connection:
duckdb:////app/analytics.duckdbNow you can create visualizations:
SELECT
month,
segment,
total_revenue
FROM monthly_revenue
ORDER BY month DESC
LIMIT 100
You now have a working analytics setup on your laptop! You can develop transformations, query data, and build dashboards—all locally.
Your local setup works great for you, but what about your colleagues? They need access to the dashboards without installing DuckDB and Python.
You have two main paths forward:
MotherDuck is a managed DuckDB service (think “DuckDB in the cloud”). Preset is managed Superset.
This path is the fastest to get started:
analytics.duckdb file or connect your data sourcesWhen this makes sense:
When to consider alternatives:
Let’s focus on the self-hosted path, which gives you more control and can run on European infrastructure.
This is where things get interesting. We’ll set up a shared analytics platform that multiple users can access simultaneously, while keeping costs reasonable and data in your preferred region.
Here’s what we’re building:
Your Laptop (Development)
↓ (write SQL transformations)
↓ (run locally, push results)
DuckLake on VM
├── Postgres (metadata)
├── OVH Object Storage (actual data)
└── DuckDB (query engine)
↓
Superset (visualization)
Why this architecture?
You might be less comfortable with hyperscalers these days, whether due to data sovereignty concerns, GDPR compliance, or simply wanting to support European infrastructure. Good news: this entire stack runs perfectly on providers like OVH, Hetzner, or Stackit. A brief comparison written by Stijn can be found here: European Cloud Providers Compared.
We’ll use OVH in our examples, but the setup is nearly identical across providers. Stijn also wrote a short guide on how to use OVH with Terraform.
DuckLake is an open-source data lake format that uses:
Unlike Iceberg or Delta Lake, DuckLake is dead simple to set up and designed specifically for DuckDB. It allows multiple users to query and write to the same data lake without conflicts.
First, provision a VM. On OVH Cloud:
SSH into your VM:
ssh ubuntu@your-vm-ip
Create an OVH Object Storage bucket:
GRA for Gravelines, France)small-data-platformSave these credentials—you’ll need them.
Cost: €0.01/GB/month for storage, plus minimal egress (€0.01/GB for downloads)
For a 100GB data lake: ~€1/month storage + ~€5/month bandwidth = €6/month total.
# Update system
sudo apt update && sudo apt upgrade -y
# Install Docker and Docker Compose
sudo apt install -y docker.io docker-compose
# Start Docker
sudo systemctl enable docker
sudo systemctl start docker
# Add your user to docker group
sudo usermod -aG docker $USER
# Log out and back in for this to take effect
Create a directory for your setup:
mkdir ~/small-data-platform
cd ~/small-data-platform
Create a docker-compose.yml file (we’ll reuse the local one and add a database for the DuckLake metadata):
name: small-data-platform
services:
superset:
container_name: superset
build:
context: .
dockerfile_inline: |
FROM apache/superset:6.0.0-py312
USER root
RUN uv pip install psycopg2-binary duckdb 'duckdb-driver>=0.17.0'
USER superset
ports:
- "8088:8088"
volumes:
- ./analytics.duckdb:/app/analytics.duckdb
configs:
- source: superset_config.py
target: /app/superset_config.py
environment:
- SUPERSET_CONFIG_PATH=/app/superset_config.py
- SUPERSET_SECRET_KEY="your_super_duper_secret_key_here"
- DATABASE_DB=superset
- DATABASE_HOST=db
- DATABASE_PASSWORD=superset
- DATABASE_USER=superset
- SUPERSET_LOAD_EXAMPLES=false
restart: unless-stopped
db:
image: postgres:16
environment:
- POSTGRES_USER=superset
- POSTGRES_PASSWORD=${SUPERSET_DB_PASSWORD}
- POSTGRES_DB=superset
volumes:
- superset_postgres_data:/var/lib/postgresql/data
restart: unless-stopped
ducklake-metadata:
image: postgres:16
ports:
- "5432:5432"
environment:
- POSTGRES_USER=ducklake
- POSTGRES_PASSWORD=${DUCKLAKE_DB_PASSWORD}
- POSTGRES_DB=ducklake
volumes:
- ducklake-metadata-data:/var/lib/postgresql/data
restart: unless-stopped
volumes:
superset_postgres_data:
ducklake-metadata-data:
configs:
superset_config.py:
content: |
SQLALCHEMY_DATABASE_URI = "postgresql://superset:${SUPERSET_DB_PASSWORD}@db:5432/superset"
Create a .env file with your passwords:
# Generate secure passwords
DUCKLAKE_DB_PASSWORD=$(openssl rand -base64 32)
SUPERSET_DB_PASSWORD=$(openssl rand -base64 32)
SUPERSET_SECRET_KEY=$(openssl rand -base64 42)
# Save them to .env
cat > .env << EOF
DUCKLAKE_DB_PASSWORD=$DUCKLAKE_DB_PASSWORD
SUPERSET_DB_PASSWORD=$SUPERSET_DB_PASSWORD
SUPERSET_SECRET_KEY=$SUPERSET_SECRET_KEY
EOF
echo "✅ Credentials generated and saved to .env"
Start the services:
docker compose up -d
Initialize Superset:
docker exec -it superset /app/docker/docker-init.sh
On your local machine, install the DuckLake extension and initialize the DuckLake (fill in the appropriate secrets):
import duckdb
con = duckdb.connect('analytics.duckdb')
# Install DuckLake extension
con.execute("INSTALL ducklake")
con.execute("LOAD ducklake")
# Configure connection to your VM's Postgres and OVH Object Storage
con.execute("""
CREATE OR REPLACE PERSISTENT SECRET s3_ducklake (
TYPE s3,
PROVIDER config,
KEY_ID 'your-ovh-access-key',
SECRET 'your-ovh-secret-key',
REGION 'gra'
ENDPOINT 's3.gra.io.cloud.ovh.net'
);
-- Using the DuckLake credentials for Postgres
CREATE OR REPLACE PERSISTENT SECRET postgres_secret (
TYPE postgres,
HOST 'your-vm-ip',
DATABASE 'ducklake',
USER 'ducklake',
PASSWORD 'your-DUCKLAKE_DB_PASSWORD'
);
-- DuckLake config secret
CREATE OR REPLACE PERSISTENT SECRET ducklake_secret (
TYPE ducklake,
METADATA_PATH '',
DATA_PATH 's3://small-data-platform/lake',
METADATA_PARAMETERS MAP {'TYPE': 'postgres','SECRET': 'postgres_secret'}
);
-- This initializes DuckLake
ATTACH 'ducklake:ducklake_secret' AS my_ducklake;
USE my_ducklake;
""")
con.close()
Now you can push your transformed data to the shared lake:
import duckdb
con = duckdb.connect('analytics.duckdb')
# Load DuckLake and attach
con.execute("LOAD ducklake")
con.execute("""ATTACH 'ducklake:ducklake_secret' AS my_ducklake;
USE my_ducklake;""")
# Copy your tables to DuckLake
con.execute("""
CREATE OR REPLACE TABLE my_ducklake.monthly_revenue AS
SELECT * FROM analytics.monthly_revenue
""")
con.execute("""
CREATE OR REPLACE TABLE my_ducklake.customer_summary AS
SELECT * FROM analytics.customer_summary
""")
con.execute("""
CREATE OR REPLACE TABLE my_ducklake.product_metrics AS
SELECT * FROM analytics.product_metrics
""")
print("✅ Data pushed to DuckLake!")
con.close()
Your data is now stored in OVH Object Storage, with metadata in Postgres. Anyone with access can query it.
Now you need to set up Superset on the VM to query your DuckLake. (Kudos to the guide from Danial Lewis—I did have to fork the duckdb-engine repo to fix some DuckLake compatibility issues and publish it under the name duckdb-driver.)
Now in Superset’s UI:
duckdb:///ducklake:postgres:dbname=ducklake host=ducklake-metadata user=ducklake password={your-ducklake-db-password}
{
"connect_args": {
"config": {
"s3_access_key_id": "your-ovh-access-key",
"s3_secret_access_key": "your-ovh-secret-key",
"s3_endpoint": "s3.gra.cloud.ovh.net",
"s3_url_style": "path",
"s3_use_ssl": false,
"s3_region": "gra"
}
}
}
Awesome! You did it. You made it to the end of this very long piece of text. Now all your small data troubles are over! I hope this helps a bit in showing what’s possible with modern technology.
Now you can create user accounts for your colleagues and create your first charts and dashboards!
Your Superset instance is now running at http://your-vm-ip:8088. However, exposing it directly to the internet isn’t secure. (Especially insecure if we keep the admin:admin login 😅) So please don’t do this with data you are uncomfortable sharing with the world!
There are some options to make it a bit more secure:
You now have a complete small data platform that:
All code used in this tutorial can be found at: github.com/wolkwork/small-data-platform-tutorial
This stack will take you surprisingly far. But you’ll eventually outgrow it. Here are the warning signs and what to do:
Warning Sign 1: Queries are getting slow (>30 seconds)
Warning Sign 2: Multiple people running transformations causes conflicts
Warning Sign 3: Data is growing past 1TB
The key: Each upgrade should be motivated by an actual problem you’re experiencing, not theoretical future needs.
Where to go from here:
Resources to learn more:
This post is Part 2 of our Small Data Platform series. Read Part 1: Why Your Data Stack is Probably Over-Engineered if you haven’t already.
Originally by Nathan Clerkx on the Wolk Blog.